DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is in response to submission of application on 11/16/2017. 
Claims 1-20 are presented for examination.

Information Disclosure Statement
The information disclosure statements submitted on 5/29/2018 and 10/02/2018 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are considered by the examiner.

Drawings
The Drawings filed on 11/16/2017 are acceptable for examination purposes.
Specification
The disclosure is objected to because of the following informalities: 
Page 24 Lines 3-4 of Para. [00073] recites, in part, "can be used to display live images captured by the sensor 810, fused images, such as the image 335,". The image labelled as 335 is not annotated in any of the drawings.    
Page 6 Line 4 of Para. [00018] recites, in part, “solving a double layer optimization problem minimizing a difference betwe”. The sentence continues but runs off the page.
Appropriate correction is required.

Claim Objections
Claims 1-20 are objected to because of the following informalities:  
Claims 1-20 are not properly indented. (See MPEP § 608.01(i)).  
Claims 1-3 and 7-8 recites, in part, “double layer optimization”. Examiner recommends changing to “double layer optimization problem” by adding “problem” as in claims 11-13 and 17-18.
Claim 1 Line 2 recites, in part, “the neural network”. Examiner suggests changing to “a neural network” as it is presented for the first time. Line 4 recites, in part, “to solve a double layer optimization”. Examiner suggests changing to “to solve a double layer optimization problem” as recited in claim 11 on line 5 and claim 20 on line 5. Line 12 recites, in part, “the previous layer”. Examiner suggests changing to “a previous layer”. Line 12-13 recites, in part, “previous layer; and and an output interface”. Examiner suggests changing to “previous layer; and an output interface” by removing extra “and”.
	Claims 9 and 19 Lines 2 recite, in part, “determines a set of variable”. Examiner suggest changing to “determines a set of variables” as reflected on line 3.
	Claim 11 Line 5 recites, in part, “solving a double layer optimization problem minimizing a difference”. Examiner suggests changing to “solving a double layer optimization problem includes minimizing a difference” as reflected in claim 1. Line 8 recites, in part, “the first layer”. Examiner suggests changing to “a first layer”. Line 11 recites, in part, “the second layer”. Examiner suggests changing to “a second layer”. Line 13 recites, in part, “the previous layer”. Examiner suggests changing to “a previous layer”.
	Claim 18 Lines 12-13 recite, in part, “the weight matrix set W̃. solving”. Examiner suggests changing to “the weight matrix set W̃; and solving”
	Claim 19 Lines 1-2 recites, in part, “wherein the solving using the block coordinate descent solves a set”. Examiner recommends changing to “wherein solving the Tikhonov regularized problem using the block coordinate descent determines a set” to make it clearer that the solving is from claim 18 line 14 and to use “determines” as reflected in parallel claim 9. 
	Claim 20 is objected to under the same rationale as claim 11.


Allowable Subject Matter
Claims 8-9 and 18-19 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 7-9 and 11-19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 11 recites the limitation "the neural network" in line 4.  There is insufficient antecedent basis for this limitation in the claim.

Claims 12-19 are dependent upon claim 11 and are rejected under the same rationale.
Claims 7 and 17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being incomplete for omitting essential elements, such omission amounting to a gap between the elements.  See MPEP § 2172.01.  The omitted elements are: “s.t. ui,n” on lines 2 of each claim, respectively.
Claims 8-9 and 18-19 are dependent upon claims 7 and 17, respectively and are rejected under the same rationale.
The term "at least some of the steps of the method" in claim 11 is a relative term which renders the claim indefinite.  The term "at least some of the steps of the method" is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention.  The term "at least some" under the broadest reasonable interpretation is interpreted as one or more. In the context of claim 11, the term used in "at least some of the steps of the method" allows for one or more steps of the method to be omitted or not all the steps of the method to be performed. This makes it unclear which steps of the method are necessary. Additionally, the use of this term is not present in parallel claims 1 and 20. For the purposes of examination, Examiner will interpret “at least some of the steps of the method” to be all the steps as reflected in claims 1-10 and 20.
Claims 12-19 are dependent upon claim 11 and are rejected under the same rationale.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claims 1-2, 10-12 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (US 20160307565 A1, hereinafter Liu) in view of Hershey et al. (US 20160034810 A1, hereinafter Hershey).

Regarding claim 1,
Liu discloses a computer-based system (Liu fig. 1 element 100 and [0019] recites “Among other components not shown, system 100 includes network 110 communicatively coupled to one or more data source(s) 108, storage 106, client devices 102 and 104, and DNSVM model generator 120.”), comprising: 
an input interface to receive an input to the neural network and labels of the input to the neural network (Liu [0004] and [0020-0021] recites “The new DNN is described herein as a deep neural support vector machine (DNSVM). [0021] The data provided by data source(s) 108 may include labeled and un-labeled data, such as transcribed and un-transcribed data. [0022] In one embodiment, the client device is capable of receiving input data such as audio and image information usable by a ASR system described herein that is operating in the device.” Client device, input data, labeled data and DNSVM (i.e. input interface, input, labels, and neural network)); 
a processor to solve a double layer optimization to produce parameters of the neural network (Liu [0050] and [0051] recites “Turning now to FIG. 4, a method 400 for training a deep neural support vector machine ("DNSVM") performed by one or more computing devices having a processor and a memory is described. [0051] In one aspect, steps 420-450 are repeated iteratively 470 to retrain the top layer and the previous layers until parameters change less than a threshold between iterations. When the parameters change less than the
threshold then the training stops and the DNSVM model is saved at step 480.” Processor in computing device performing top layer and previous layers optimization of parameters in the DNSVM (i.e. processor solving double layer optimization producing parameters of neural network)), 
wherein the double layer optimization includes an optimization of a first layer subject to an optimization of a second layer (Liu [0052] recites “At step 440 initial values are assigned to the top layer parameters according to the solution and fixed. At step 450, the previous layers of the DNSVM are trained while keeping the initial values of the top layer parameters fixed. The training uses the maximum margin objective function of step 430 to generate updated values for parameters of the one or more previous layers.” Training the previous layers while keeping top layer parameters fixed (i.e. optimization of a first layer subject to optimization of second layer)),
wherein the optimization of the first layer minimizes a difference between an output of the neural network processing the input and the labels of the input to the neural network (Liu [0032] recites “Given the training observations and their corresponding state labels, 
    PNG
    media_image1.png
    38
    98
    media_image1.png
    Greyscale
, where St ∈ {1,..., N}, in frame-level training, the parameters of DNNs can be estimated by minimizing the cross-entropy.” Minimizing cross-entropy with training observations and state labels (i.e. minimize a difference between output and labels)),
wherein the optimization of the second layer minimizes a distance between an output vector of each layer and a corresponding input vector to each layer (Liu [0039] and [0040] recite “The sequence-level training can be used when a structured SVM is used for one or more layers. The sequence-level trained DNSVM can act like an acoustic model and a language model. In the max-margin sequence training, for simplicity, first consider one training utterance (O, S), where O = { o1, ... , oT} is the observation sequence and S = {s1, ... , sT} is the corresponding reference states. The parameters of the model can be estimated by maximizing 
    PNG
    media_image2.png
    42
    334
    media_image2.png
    Greyscale
[0040] Here the margin is defined as the minimum distance between the reference state sequence S and competing state sequence S in the log posterior domain.” Minimizing distance between reference sequence and competing sequence when using one or more layers (i.e. Minimizing distance of output vector of an output layer at each layer and corresponding input vector)),
wherein the input vector of a current layer is a linear transformation of the non-negative output vector of the previous layer (Liu [0031] and [0034]-[0035] recite “Each node in the hidden layer 312 performs a calculation to generate an output that is then fed into each node in the second hidden layer 314. The different nodes may give different weight to different inputs resulting in a different output. [0034] Thus, substituting the slack variable ε.sub.t from the constraints into the objective function, equation (5) can be reformulated as the minimization of: 
    PNG
    media_image3.png
    67
    405
    media_image3.png
    Greyscale
 where w=[w.sub.1.sup.T, . . . , w.sub.N.sup.T].sup.T are the parameter vectors for each state and [.Math.].sub.+ is the hinge function. Note the maximum of a set of linear functions is convex, thus equation (7) is convex with respect to w.” Calculations in current hidden layer nodes to next hidden layer nodes (i.e. linear transformation of previous layer as input to current layer)); and
and an output interface to output the parameters of the neural network (Liu [0055] recites “Although the various blocks of FIG. 5 are shown with lines for the sake of clarity, in reality, delineating various components is not so clear, and metaphorically, the lines would more accurately be grey and fuzzy. For example, one may consider a presentation component such as a display device to be an I/O component.” I/O component (i.e. output interface)).
However, Liu does not explicitly disclose a non-negative output vector. 
Hershey teaches a non-negative output vector (Hershey [0056] recites “NMF operates on a matrix of F-dimensional non-negative spectral features, usually the power or magnitude spectrogram of the mixture M=[m1 . . . mT], where T is the number of frames and 
    PNG
    media_image4.png
    21
    60
    media_image4.png
    Greyscale
, t=1, . . . , T are obtained by short-time Fourier transformation of the time-domain signal. With L sources, a set of Rl non-negative basis vectors w1l, . . . , 
    PNG
    media_image5.png
    29
    32
    media_image5.png
    Greyscale
 is assumed for each source l∈{1, . . . , L}.” Non-negative basis vector (i.e. non-negative vector)).
Hershey and Liu are both directed to machine learning and deep neural networks. In view of the teachings of Hershey, it would have been obvious to one of ordinary skill in the art to apply the teachings of Hershey to Liu before the effective filing date of the claimed invention in order to combine the advantages of a neural network with the internal structure of a model-based approach for optimized performance (cf. Hershey [0014]-[0015] recites “The resulting method combines the expressive power of the neural network with the internal structure of the model-based approach, while allowing inference to be performed in a number of layers whose parameters can be optimized for best performance. [0015] This framework can be applied to a number of model-based methods. In particular, it can be applied to non-negative matrix factorization (NMF) to obtain a novel non-negative neural network architecture, that can be trained with a multiplicative back-propagation-style update procedure. The method can also be applied to loopy belief propagation (BP) for Markov random fields, or variational inference procedures for intractable generative models.”).

Regarding claim 2, 
The Liu/Hershey Combination teaches the system of claim 1, wherein, to solve the double layer optimization, the processor is configured to (Liu [0050] recites “Turning now to FIG. 4, a method 400 for training a deep neural support vector machine (DNSVM) performed by one or more computing devices having a processor and a memory is described.”)
transform the double layer optimization into a single layer optimization problem by adding an objective function in the second layer into an objective function in the first layer and merging constraints in the second layer with constraints in the first layer (Liu [0032]-[0034] recites, in part, “The frame-level training can be used when a multi-class SVM is used for one or more layers in the DNSVM model… Herein, let 
    PNG
    media_image6.png
    21
    64
    media_image6.png
    Greyscale
 as the feature space derived from the DNN, the parameters of the last layer are first estimated using the multi-class SVM training algorithm: 
    PNG
    media_image7.png
    48
    356
    media_image7.png
    Greyscale
 s.t. for every training frame t=1, . . . , T, [0033] for every competing state 
    PNG
    media_image8.png
    23
    107
    media_image8.png
    Greyscale
: 
    PNG
    media_image9.png
    22
    357
    media_image9.png
    Greyscale
 where εt ≤ 0 is the slack variable which penalizes the data points that violate the margin requirement. Note that the objective function is essentially the same as the binary SVM. The only difference comes from the constraints, which basically say that the score of the correct state label, 
    PNG
    media_image10.png
    24
    41
    media_image10.png
    Greyscale
, has to be greater than the scores of any other states, 
    PNG
    media_image11.png
    23
    40
    media_image11.png
    Greyscale
, by a margin determined by the loss… [0034] Thus, substituting the slack variable εt from the constraints into the objective function, equation (5) can be reformulated as the minimization of 
    PNG
    media_image12.png
    51
    356
    media_image12.png
    Greyscale
” Objective function of frame-level training for a multi-class SVM for 1 or more layers and score constraints (i.e. adding objective function from layers and combining constraints). Reformulation using slack variables (i.e. transform optimization)); and 
solving the single layer optimization problem by an alternating optimization (AO) (Liu [0050] and [0051] recites “At step 420, initial values for parameters of one or more previous layers within the DNSVM are determined and fixed. At step 430, a top layer of the DNSVM is trained while keeping the initial values fixed using a maximum-margin objective function to find a solution. The top layer can be a support vector machine. The top layer could be multi-class, structured, or another type of support vector machine. [0051] At step 440, initial values are assigned to the top layer parameters according to the solution and fixed. At step 450, the previous layers of the DNSVM are trained while keeping the initial values of the top layer parameters fixed. The training uses the maximum-margin objective function of step 430 to generate updated values for parameters of the one or more previous layers.” Previous layers within the DNSVM are determined and fixed while training the top layer using a maximum-margin objective function to find a solution (430) and training the previous layers (450) while the top layer parameters are fixed (i.e. alternating optimization)).
See motivation for claim 1 above.

Regarding claim 10, 
The Liu/Hershey Combination teaches the system of claim 1, further comprising: an application interface to perform a computer-based application using the neural network (Liu [0026] recites “The DNSVM models generated by generator 120 may be deployed on a user device such as device 104 or 102, a server, or other computer system. DNSVM model generator 120 and its components 122, 124, 126, and 128 may be embodied as a set of compiled computer instructions or functions, program modules, computer software services, or an arrangement of processes carried out on one or more computer systems, such as computing device 500, described in connection to FIG. 5, for example. DNSVM model generator 120, components 122, 124, 126, and 128, functions performed by these components, or services carried out by these components may be implemented at appropriate abstraction layer(s) such as the operating system layer, application layer, hardware layer, etc., of the computing system(s). Alternatively, or in addition, the functionality of these components, generator 120, and/or the embodiments of technology described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.” DNSVM with embodiment of computer software services deployed on computing devices or systems implemented on application layer (i.e. application interface to perform computer-based application using the neural network)).
See motivation for claim 1 above.

Regarding claims 11-12,
Claims 11-12 are directed to a method performed in a manner substantially identical to those recited in claims 1-2. Therefore, the rejections to claims 1-2 apply equally here.
In addition, Liu discloses the additional limitation of stored instructions (Liu [0054] recites “The technology described herein may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program components, being executed by a computer or other machine, such as a personal data assistant or other handheld device.” Computer-executable instructions executed by a computer (i.e. processor coupled with stored instructions)).

Regarding claim 20,
Liu discloses a non-transitory computer readable storage medium embodied thereon a program executable by a processor for performing a method, the method comprising (Liu [0050] and [0056] recites “Turning now to FIG. 4, a method 400 for training a deep neural support vector machine (DNSVM) performed by one or more computing devices having a processor and a memory is described. [0056] Computing device 500 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by computing device 500 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data.” Non-volatile computer storage media and program modules (i.e. non-transitory computer readable storage medium and executable program)): 
receiving ground truth labels of input to the neural network (Liu [0004], [0020-0021] and [0050] recites “The new DNN is described herein as a deep neural support vector machine (DNSVM). [0021] The data provided by data source(s) 108 may include labeled and un-labeled data, such as transcribed and un-transcribed data. [0050] The method comprises receiving a corpus of training material at step 410. The corpus of training material can comprise one or more labeled acoustic features.” Labeled data such as labeled acoustic features and DNSVM (i.e. ground truth labels and neural network));
wherein the input vector of a current layer is a linear transformation of the non-negative output vector of the previous layer (Liu [0031] recites “Each node in the hidden layer 312 performs a calculation to generate an output that is then fed into each node in the second hidden layer 314. The different nodes may give different weight to different inputs resulting in a different output” Calculations in current hidden layer nodes to next hidden layer nodes (i.e. linear transformation of previous layer as input to current layer)); and 
outputting the parameters of the neural network (Liu fig. 4 and [0051] recites “In one aspect, steps 420-450 are repeated iteratively 470 to retrain the top layer and the previous layers until parameters change less than a threshold between iterations. When the parameters change less than the threshold, then the training stops and the DNSVM model is saved at step 480.” Saving the DNSVM model (i.e. outputting the parameters of the neural network)).
However, Liu does not explicitly disclose solving a double layer optimization problem minimizing a difference between an output of the neural network processing the input to the neural network and the ground truth labels of the input to the neural network to produce parameters of the neural network, wherein the minimizing the difference is the first layer of the double layer optimization problem that is subject to minimizing a distance between a non-negative output vector of each layer and a corresponding input vector to each layer forming the second layer of the double layer optimization problem.
Hershey teaches solving a double layer optimization problem minimizing a difference between an output of the neural network processing the input to the neural network and the ground truth labels of the input to the neural network to produce parameters of the neural network (Hershey [0060] and [0061] recites “In a similar way, we can discriminatively train NMF bases for source separation. The following optimization problem for training bases is called discriminative NMF (DNMF): 

    PNG
    media_image13.png
    103
    408
    media_image13.png
    Greyscale

[0061] For example, in speech de-noising, we focus on reconstructing the speech signal from a noisy mixture… The second part in equation (11) ensures that Ĥ are the activations that arise from the test-time inference Objective… Nonetheless, the above is a difficult bi-level optimization problem, because the bases W occur in both levels.” Bi-level optimization and second part in equation ensuring activations during test time (i.e. solving a double layer optimization and minimizing a difference between output and input)), 
wherein the minimizing the difference is the first layer of the double layer optimization problem that is subject to minimizing a distance between a non-negative output vector of each layer and a corresponding input vector to each layer forming the second layer of the double layer optimization problem (Hershey [0060] and [0061] recites “In a similar way, we can discriminatively train NMF bases for source separation. [0061] For example, in speech de-noising, we focus on reconstructing the speech signal from a noisy mixture. The first part in equation (10) minimizes the reconstruction en or given Ĥ… Nonetheless, the above is a difficult bi-level optimization problem, because the bases W occur in both levels.” Non-negative Matrix Factorization (NMF) and first part of equation minimizes the reconstruction (i.e. Non-negative output and minimizes output vector)).
Hershey and Liu are both directed to machine learning and deep neural networks. In view of the teachings of Hershey, it would have been obvious to one of ordinary skill in the art to apply the teachings of Hershey to Liu before the effective filing date of the claimed invention in order to combine the advantages of a neural network with the internal structure of a model-based approach for optimized performance (cf. Hershey [0014]-[0015] recites “The resulting method combines the expressive power of the neural network with the internal structure of the model-based approach, while allowing inference to be performed in a number of layers whose parameters can be optimized for best performance. [0015] This framework can be applied to a number of model-based methods. In particular, it can be applied to non-negative matrix factorization (NMF) to obtain a novel non-negative neural network architecture, that can be trained with a multiplicative back-propagation-style update procedure. The method can also be applied to loopy belief propagation (BP) for Markov random fields, or variational inference procedures for intractable generative models.”).

Claims 3-4 and 13-14 are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Hershey and in further view of He et al. (US 20130091081 A1, hereinafter He).

Regarding claim 3, 
The Liu/Hershey Combination teaches the system of claim 1, wherein, to solve the double layer optimization, the processor is configured to (Liu [0050] recites “Turning now to FIG. 4, a method 400 for training a deep neural support vector machine (DNSVM) performed by one or more computing devices having a processor and a memory is described.”)
transform the double layer optimization into a single layer optimization by adding an objective function in the second layer into an objective function in the first layer and merging constraints in the second layer with constraints in the first layer (Liu [0032]-[0034] recites, in part, “The frame-level training can be used when a multi-class SVM is used for one or more layers in the DNSVM model… Herein, let 
    PNG
    media_image6.png
    21
    64
    media_image6.png
    Greyscale
 as the feature space derived from the DNN, the parameters of the last layer are first estimated using the multi-class SVM training algorithm: 
    PNG
    media_image7.png
    48
    356
    media_image7.png
    Greyscale
 s.t. for every training frame t=1, . . . , T, [0033] for every competing state 
    PNG
    media_image8.png
    23
    107
    media_image8.png
    Greyscale
: 
    PNG
    media_image9.png
    22
    357
    media_image9.png
    Greyscale
 where εt ≤ 0 is the slack variable which penalizes the data points that violate the margin requirement. Note that the objective function is essentially the same as the binary SVM. The only difference comes from the constraints, which basically say that the score of the correct state label, 
    PNG
    media_image10.png
    24
    41
    media_image10.png
    Greyscale
, has to be greater than the scores of any other states, 
    PNG
    media_image11.png
    23
    40
    media_image11.png
    Greyscale
, by a margin determined by the loss… [0034] Thus, substituting the slack variable εt from the constraints into the objective function, equation (5) can be reformulated as the minimization of 
    PNG
    media_image12.png
    51
    356
    media_image12.png
    Greyscale
” Objective function of frame-level training for a multi-class SVM for 1 or more layers and score constraints (i.e. adding objective function from layers and combining constraints). Reformulation using slack variables (i.e. transform optimization)); 
However, The Liu/Hershey Combination does not teach perform a variable replacement in the single layer optimization problem to produce a Tikhonov regularized problem with a regularization term including a matrix representing an architecture and the parameters of the neural network; and solve the Tikhonov regularized problem with a block coordinate descent.
He teaches perform a variable replacement in the single layer optimization problem to produce a Tikhonov regularized problem with a regularization term including a matrix representing an architecture and the parameters of the neural network (He [0025], [0036-0037] and [0040] recites “By introducing a generalized Tikhonov regularization, a method according to the present disclosure enforces the interaction of latent factors to have an influence on learning latent factors and basis vectors. [0036] From this point on in this disclosure, we consider a special case of the learning problem in Eq. (8) when x follows a multivariate normal distribution and s follows a sparse Gaussian graphical model (SGGM)…  Then the objective function in Eq. (8) becomes 
    PNG
    media_image14.png
    73
    356
    media_image14.png
    Greyscale
  [0037] If Φ is fixed, the problem in Eq. (12) is a matrix factorization method with generalized Tikhonov regularization: trace(ST ΦS). [0040] The hyper-parameter ρ controls the sparsity of Φ. A large ρ will result in a diagonal precision matrix Φ, indicating that the latent factors are conditionally independent… Therefore, this regularization term makes SLFA produce a collaborative reconstruction based on the conditional dependencies between latent factors.” Special case of the learning problem in equation 8 when x follows a multivariate normal distribution and s follows a sparse SGGM and Φ is fixed, matrix factorization, and hyper-parameter ρ controlling the precision matrix (i.e. Tikhonov regularization from variable replacement, involves a matrix, matrix representing architecture and parameters)); and 
solve the Tikhonov regularized problem with a block coordinate descent (He [0033] recites “The objective function in the above equation (8) is not convex with respect to all three unknowns (B, S and θ) together. Therefore, a good algorithm in general exhibits convergence behavior to a stationary point and we can use Block Coordinate Descent algorithm to iteratively update B, S and θ as follows: 
    PNG
    media_image15.png
    355
    564
    media_image15.png
    Greyscale
Block coordinate descent algorithm to solve (i.e. block coordinate descent)”).
He and The Liu/Hershey Combination are both directed to machine learning. In view of the teachings of He, it would have been obvious to one of ordinary skill in the art to apply the teachings of He to Liu as modified by Hershey before the effective filing date of the claimed invention in order to learn higher quality similarity functions with faster operation, higher performance, and has theoretical guarantees (cf. He [0006] recites “As will become apparent to those skilled in the art, a method according to the present disclosure: 1) advantageously learns higher quality similarity functions and kernels that facilitate higher performance; 2) allows for easy incorporation of past and future advances in binary classification techniques, including, but not limited to, stochastic gradient descent, sparse learning, semi-supervised learning and transfer learning; 3) has faster operation than known methods and scales to large-scale data by taking advantage in large-scale classification; 4) is simpler and easier to use than known methods; and 5) has theoretical guarantees.”).



    PNG
    media_image16.png
    543
    745
    media_image16.png
    Greyscale


Regarding claim 4, 
The Liu/Hershey/He Combination teaches the system of claim 3, wherein the input interface receives architecture constraints indicative of the architecture of the neural network (Hershey fig. 1 element 101, [0025] and [0031] recite, in part, “A model 101 with constraints and model parameters is used to derive an iterative inference procedure 102, which makes use of corresponding procedure parameters, as well as the model parameters. The iterative inference procedure and parameters are unfolded 110. The unfolding converts each iteration of the iterative inference procedure into a layer-vase structure analogous to a neural network with a set of layers L.sub.k 111, k=0, . . . , K where K is a number of iterations in the iterative inference procedure, and a set of network parameters θ 112. [0031] The steps can be performed in one or more processors connected to memory and input/output interfaces as known in the art.” Model constraints and model parameters (101) used to derive structure and set of network parameters (i.e. receive architecture constraints indicative of the neural network architecture)), 
wherein the processor solves the Tikhonov regularized problem subject to the architecture constraints (Liu [0050] recites “Turning now to FIG. 4, a method 400 for training a deep neural support vector machine (DNSVM) performed by one or more computing devices having a processor and a memory is described.” Additionally, He [0025] recites “By introducing a generalized Tikhonov regularization, a method according to the present disclosure enforces the interaction of latent factors to have an influence on learning latent factors and basis vectors.” Hershey [0045] recites “Rather than considering the iterations as a procedure, we unfold 110 the procedure 102 as a sequence of layers 111 in a neural network-like architecture, where the iteration index is now interpreted as an index to the neural network layer. The intermediate variables φ.sup.1, . . . , φ.sup.K are the nodes of layers 1 to K and equation (3) determines the transformation and activation function between the layers.” Step performed by processor as known in the art which is subject to the constraints (101) and equation determines the transformation and activation function between layers (i.e. processor solving a problem subject to architecture constraints)).
See motivation for claim 3 above.

Regarding claims 13-14,
Claims 13-14 are directed to a method performed in a manner substantially identical to those recited in claims 3-4. Therefore, the rejections to claims 3-4 apply equally here.
In addition, Liu discloses the additional limitation of stored instructions (Liu [0054] recites “The technology described herein may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program components, being executed by a computer or other machine, such as a personal data assistant or other handheld device.” Computer-executable instructions executed by a computer (i.e. processor coupled with stored instructions)).


Claims 5-6 and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Hershey and in further view of He and in further view of Simard (US 10685285 B2, hereinafter Simard).


Regarding claim 5, 
The Liu/Hershey/He Combination teaches the system of claim 3, the processor, the regularization term (Liu [0050] recites “Turning now to FIG. 4, a method 400 for training a deep neural support vector machine (DNSVM) performed by one or more computing devices having a processor and a memory is described.” Additionally, He [0040] recites “The hyper-parameter ρ controls the sparsity of Φ. A large ρ will result in a diagonal precision matrix Φ, indicating that the latent factors are conditionally independent… Therefore, this regularization term makes SLFA produce a collaborative reconstruction based on the conditional dependencies between latent factors.”) and 
solves the Tikhonov regularized problem subject to connectivity and symmetry constraints on blocks of the matrix (He [0025] and [0027] recites “According, one contribution of this disclosure is a general LFM method that models the pairwise relationships between latent factors by sparse graphical models. By introducing a generalized Tikhonov regularization, a method according to the present disclosure enforces the interaction of latent factors to have an influence on learning latent factors and basis vectors. As a result, a method according to the present disclosure will learn meaningful latent factors and simultaneously obtain a graph where the nodes represent hidden groups and the edges represent their pairwise relationships. [0027]” Additionally, He [0027] recites the following: 

    PNG
    media_image17.png
    460
    461
    media_image17.png
    Greyscale

Pairwise interactions/relationships, basis vectors of the basis matrix, and symmetric θ used to model (i.e. connectivity, blocks of the matrix, and symmetry)).
	However, The Liu/Hershey/He Combination does not explicitly teach wherein the processor initializes the regularization term with a number of layers of the neural network.
	Simard teaches wherein the processor initializes the regularization term with a number of layers of the neural network (Simard Pg. 28, Col. 16, Ln. 25-35 recites “Linear functions are very easy to learn and far more plausible and useful than the identity. So we start from the following regularizer: 
    PNG
    media_image18.png
    33
    83
    media_image18.png
    Greyscale
 at every layer. If r≠0, this deep network will regularize to a linear network.“ Start with regularizer at every layer (i.e. initialize regularization term with a number of layers)).
Simard and The Liu/Hershey/He Combination are both directed to machine learning. In view of the teachings of Simard, it would have been obvious to one of ordinary skill in the art to apply the teachings of Simard to Liu as modified by Hershey and further modified by He before the effective filing date of the claimed invention in order to use linear functions which are easier to learn for optimizing a deep neural network (cf. Simard Pg. 29, Col. 17, Ln. 56-64 recites “Starting from this initialization, the Mirror DNN starts with the performance of the residual DNN since they compute the same function. Learning is started with the linear regularizer. If the linear regularizer is small, the pull toward a linear function will not have a negative effect on the performance. However, it will bring the learning dynamic closer to a linear dynamic, which is far easier to optimize. Network performance can improve as a result of superior optimization.”).

Regarding claim 6, 
The Liu/Hershey/He/Simard Combination teaches the system of claim 5, wherein the input interface receives the number of layers of the neural network (Liu [0025] recites “In an embodiment, storage 106 stores data from one or more data source(s) 108, one or more DNSVM models, information for generating and training DNSVM models, and the computer-usable information outputted by one or more DNSVM models. As shown in FIG. 1, storage 106 includes DNSVM models 107 and 109. Additional details and examples of DNSVM models are described in connection to FIGS. 2-5. Although depicted as a single data store component for the sake of clarity, storage 106 may be embodied as one or more information stores, including memory on user device 102 or 104, DNSVM model generator 120, or in the cloud.” Client/user device that includes DNSVM models or information for generating DNSVM model (i.e. interface received neural network which has layers)).
See motivation for claim 5 above.

Regarding claims 15-16,
Claims 15-16 are directed to a method performed in a manner substantially identical to those recited in claims 5-6. Therefore, the rejections to claims 5-6 apply equally here.
In addition, Liu discloses the additional limitation of stored instructions (Liu [0054] recites “The technology described herein may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program components, being executed by a computer or other machine, such as a personal data assistant or other handheld device.” Computer-executable instructions executed by a computer (i.e. processor coupled with stored instructions)).


Claim 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Hershey and in further view of Bishop (“Training with Noise is Equivalent to Tikhonov Regularization”).


Regarding claim 7, 
The Liu/Hershey Combination teaches the system of claim 1, nonempty closed convex sets, and a convex loss function (Liu [0016] and [0034]-[0035] recites “The support vector machine (SVM) has several prominent features. First, it has been shown that maximizing the margin is equivalent to minimizing an upper bound on the generalization error. Second, the optimization problem of SVM is convex, which is guaranteed to have a global optimal solution. [0034] Thus, substituting the slack variable εt from the constraints into the objective function, equation (5) can be reformulated as the minimization of: 
    PNG
    media_image19.png
    60
    360
    media_image19.png
    Greyscale
 [0035] where 
    PNG
    media_image20.png
    20
    167
    media_image20.png
    Greyscale
 are the parameter vectors for each state and 
    PNG
    media_image21.png
    21
    24
    media_image21.png
    Greyscale
 is the hinge function. Note the maximum of a set of linear functions is convex, thus equation (7) is convex with respect to w.” Deep Neural SVM (DNSVM), Hinge function and set of linear functions as in equation 7 (i.e. system, convex loss function, and convex sets)).
 However, The Liu/Hershey Combination does not explicitly teach wherein the double layer optimization is

    PNG
    media_image22.png
    64
    711
    media_image22.png
    Greyscale
 wherein 
    PNG
    media_image23.png
    26
    61
    media_image23.png
    Greyscale
 is i-th training data, 
    PNG
    media_image24.png
    23
    323
    media_image24.png
    Greyscale
as the output vector for xi from the n-th (1 ≤ n ≤ N) hidden layer in the neural network), 
    PNG
    media_image25.png
    28
    98
    media_image25.png
    Greyscale
 as a weight matrix between the n-th and m-th hidden layers, Mn as an index set for the n-th hidden layer, 
    PNG
    media_image26.png
    28
    98
    media_image26.png
    Greyscale
 as the weight matrix between the last hidden layer and an output layer of the neural network, U, V, W as nonempty closed convex sets, 
    PNG
    media_image27.png
    27
    236
    media_image27.png
    Greyscale
 as a convex loss function.
Bishop teaches wherein the double layer optimization is

    PNG
    media_image22.png
    64
    711
    media_image22.png
    Greyscale
(Bishop Pg. 108-109, Section 1 Regularization recites “A common choice of error function is the sum-of-squares error… Substituting 1.3 into 1.1 gives the sum-of-squares error in the form
    PNG
    media_image28.png
    47
    394
    media_image28.png
    Greyscale
” Sum-of-squares error in form 1.4 (i.e. double layer optimization))
wherein 
    PNG
    media_image23.png
    26
    61
    media_image23.png
    Greyscale
 is i-th training data, 
    PNG
    media_image24.png
    23
    323
    media_image24.png
    Greyscale
as the output vector for xi from the n-th (1 ≤ n ≤ N) hidden layer in the neural network (Bishop Pg. 108, Section 1 Regularization recites “A feedforward neural network can be regarded as a parameterized nonlinear mapping from a d-dimensional input vector x = (x1,..., xd) into a c-dimensional output vector y = (y1,..., yc). Supervised training of the network involves minimization, with respect to the network parameters, of an error function, defined in terms of a set of input vectors x and corresponding desired (or target) output vectors t.” D-dimensional input vector x with element x1, c-dimensional output vector y with element y1 and target output vectors t (i.e. xi i-th training data, ui,n output vector from hidden layer, y label from set of labels)), 

    PNG
    media_image25.png
    28
    98
    media_image25.png
    Greyscale
 as a weight matrix between the n-th and m-th hidden layers, Mn as an index set for the n-th hidden layer, 
    PNG
    media_image26.png
    28
    98
    media_image26.png
    Greyscale
 as the weight matrix between the last hidden layer and an output layer of the neural network (Bishop Pg. 112-113, Section 2 Training with Noise, in part, recites “However, the total regularized error in each case is minimized by the same network function y(x) (and hence by the same set of network weight values).” Same network function y(x) where x is a d-dimensional input vector and y is a c-dimensional output vector with same set of network weights (i.e. weight matrix, W and V, for the layers of the neural network)). 
U, V, W, 
    PNG
    media_image27.png
    27
    236
    media_image27.png
    Greyscale
 (Bishop Pg. 108, Section 1 Regularization, in part, recites “Supervised training of the network involves minimization, with respect to the network parameters, of an error function, defined in terms of a set of input vectors x and corresponding desired (or target) output vectors t.” Additionally, Bishop Pg. 112-113, Section 2 Training with Noise, in part, recites “However, the total regularized error in each case is minimized by the same network function y(x) (and hence by the same set of network weight values). Target output vectors t, network weights, and error function (i.e. U, V, W, and ℓ)).
Bishop and The Liu/Hershey Combination are both directed to machine learning. In view of the teachings of Bishop, it would have been obvious to one of ordinary skill in the art to apply the teachings of Bishop to Liu as modified by Hershey before the effective filing date of the claimed invention in order to provide a practical alternative to training with noise (cf. Bishop Pg. 108 Abstract, in part, recites “In this paper we show that for the purposes of network training, the regularization term can be reduced to a positive semi-definite form that involves only first derivatives of the network mapping. For a sum-of-squares error function, the regularization term belongs to the class of generalized Tikhonov regularizers. Direct minimization of the regularized error function provides a practical alternative to training with noise.”).


Regarding claim 15-17,
Claim 17 is directed to a method performed in a manner substantially identical to those recited in claim 7. Therefore, the rejection to claim 7 applies equally here.
In addition, Liu discloses the additional limitation of stored instructions (Liu [0054] recites “The technology described herein may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program components, being executed by a computer or other machine, such as a personal data assistant or other handheld device.” Computer-executable instructions executed by a computer (i.e. processor coupled with stored instructions)).








Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Najibi et al. (U.S. 20180285682 A1) teaches an artificial intelligence framework and addresses the vanishing gradient descent problem with a loss function related to computer vision and convolutional neural networks.
Deng et al. (U.S 20120254086 A1) teaches deep neural networks training and a convex optimization.
Bouchard et al. (US 20140156579 A1) teaches collaborative filtering or multi-view learning where relationships are represented by a matrix and matrix factorization.
Arnold et al. (US 20190150764 A1) teaches a bi-input convolutional neural network framework to estimate parameter values.
Su et al. ("A Systematic Evaluation of the Bag-of-Frames Representation for Music Information Retrieval", 2014, IEEE Transactions on Multimedia ( Volume: 16, Issue: 5, Aug. 2014), Pages 1188-1200) teaches a dual layer feature learning framework with bag-of-frames from complex, high-dimensional data.
Zhang et al. ("Efficient Training of Very Deep Neural Networks for Supervised Hashing", 21 April 2016, arXiv:1511.04524v2) teaches deep neural network supervised training using alternating direction method of multipliers (ADMM) and an auxiliary variable.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEON W CHEUNG whose telephone number is (571) 272-9930.  The examiner can normally be reached on 8:30AM-5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571) 270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/LWC/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124