Detailed Action
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim 1-15 are pending.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 13 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Claim 13 recites the limitation ‘performing the following operations for …’. The claim does not mention any detail about what is the ‘following operation’. 
For purpose of examination that claim is being interpreted as: repeatedly performing the training operations claimed in the claim 1 for each of the dataset sampled.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 5-6, 9, 12, 14-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li (US 20190287515 A1) in view of Andrychowicz (Andrychowicz et al, 2016, “Learning to learn by gradient descent by gradient descent”) and further in view of Wang (US 20190244103 A1).

Regarding claim 1, Li teaches a computer-implemented method for training an optimizer neural network having optimizer parameters, wherein the optimizer neural network is configured to generate an output that defines updated values of target parameters of a target neural network during training of the target neural network to perform one or more target neural network tasks ([Li, 0019] “In one embodiment, a method is provided. The method includes operations for training a teacher model based on teacher speech data and for initializing a student model with parameters obtained from the trained teacher model”, the teacher model corresponds to the optimizer neural network, the student model corresponds to the target neural network. The target parameter means any parameter of the target neural network), 
wherein the optimizer neural network is associated with an outer loss function that measures how well the optimizer neural network generates updated values of target parameters for the target neural network ([Li, 0019] “Training the student model with adversarial teacher-student learning further includes minimizing a teacher-student loss that measures a divergence of outputs between the teacher model and the student model; minimizing a classifier condition loss with respect to parameters of a condition classifier, the classifier condition loss measuring errors caused by acoustic condition classification;”), and 
wherein the method comprises: 
repeatedly performing the following operations to determine trained values of the optimizer parameters ([Li, 0050] “Models may be run against a training dataset for several epochs (e.g., iterations), in which the training dataset is repeatedly fed into the model to refine its results”, teaches the training process repeats for each training values): 
training an instance of target neural network to perform a neural network task associated with the instance of target neural network by updating values of parameters of the instance of target neural network ([Li, 0021] “In yet another embodiment, a machine-readable storage medium includes instructions that, when executed by a machine, cause the machine to perform operations comprising: training a teacher model based on teacher speech data; initializing a student model with parameters obtained from the trained teacher model; training the training the student model with adversarial teacher-student learning based on the teacher speech data and student speech data, and recognizing speech with the trained student model. Training the student model with adversarial teacher-student learning further includes minimizing a teacher-student loss that measures a divergence of outputs between the teacher model and the student model; minimizing a classifier condition loss with respect to parameters of a condition classifier; and maximizing the classifier condition loss with respect to parameters of a feature extractor”), 
evaluating a performance of the trained instance of target neural network on the neural network task associated with the instance of target neural network to determine one or more performance metrics for the trained instance of the target neural network on the neural network task ([Li, 0053] “Once the learning phase is complete, the models are finalized. In some example embodiments, models that are finalized are evaluated against testing criteria. In a first example, a testing dataset that includes known outputs for its inputs is fed into the finalized models to determine an accuracy of the model in handling data that is has not been trained on. In a second example, a false positive rate or false negative rate may be used to evaluate the models after finalization. In a third example, a delineation between data clusterings is used to select a model that produces the clearest bounds for its clusters of data”, the accuracy of the model corresponds to the performance metric of the neural network), 
determining, using the one or more performance metrics for the trained instances of target neural network, a gradient estimate of the outer loss function associated with the optimizer neural network ([Li, 0106] “In some example embodiments, stochastic gradient descent ( SGD ) is used to optimize                         
                            
                                
                                    L
                                
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                            
                            (
                            
                                
                                    θ
                                
                                
                                    f
                                
                            
                            ,
                            
                                
                                    θ
                                
                                
                                    y
                                
                            
                            ,
                            
                                
                                    θ
                                
                                
                                    C
                                
                            
                            )
                        
                    . SGD, also known as incremental gradient descent, is a stochastic approximation of the gradient descent optimization and iterative method for minimizing an objective function that is written as a sum of differentiable functions. In other words, SGD tries to find minima or maxima by iteration”), and 
Li does not specifically teach initializing current values of the optimizer parameters; adjusting the current values of the optimizer parameters of the optimizer neural network based on the gradient estimate of the outer loss function associated with the optimizer neural network, generating perturbed values of the parameters by applying a perturbation to the current values of the parameters, training a network in accordance with the perturbed values of the parameters. 
Andrychowicz teaches initializing current values of the optimizer parameters ([Andrychowicz, page 5, 3.1 Quadratic functions, line 11-13] “the solid curve shows the learned optimizer performance and dashed curves show the performance of the standard baseline optimizers. It is clear the learned optimizers substantially outperform the baselines in this setting”, teaches the existence of the baseline optimizer, and comparing trained optimizer and the baseline optimizer); 
adjusting the current values of the optimizer parameters of the optimizer neural network based on the gradient estimate of the outer loss function associated with the optimizer neural network ([Andrychowicz, page 5, 3 Experiments] “In all experiments the trained optimizers use two-layer LSTMs with 20 hidden units in each layer. Each optimizer is trained by minimizing Equation 3 using truncated BPTT (Backpropagation Through Time) as described in Section 2. The minimization is performed using ADAM with a learning rate chosen by random search”, the Equation 3 includes the process of using                         
                            L
                            
                                
                                    φ
                                
                            
                        
                     gradient, which corresponds to using the outer loss function to update the optimizer parameter                         
                            φ
                        
                     . 
[Andrychowicz, page 3, 2 Learning to learn with recurrent neural networks, line 1-15] “In this work we consider directly parameterizing the optimizer. As a result, in a slight abuse of notation we will write the final optimizee parameters                         
                            
                                
                                    θ
                                
                                
                                    *
                                
                            
                            (
                            f
                            ,
                            φ
                            )
                        
                     as a function of the optimizer parameters                         
                            φ
                        
                     and the function in question. We can then ask the question: What does it mean for an optimizer to be good? Given a distribution of functions f we will write the expected loss as                         
                            L
                            
                                
                                    φ
                                
                            
                            =
                            
                                
                                    E
                                
                                
                                    f
                                
                            
                            [
                            f
                            (
                            
                                
                                    θ
                                
                                
                                    *
                                
                            
                            (
                            f
                            ,
                            φ
                            )
                            )
                            ]
                        
                     As noted earlier, we will take the update steps g_t to be the output of a recurrent neural network m, parameterized by                         
                            φ
                        
                    , whose state we will denote explicitly with ht. Next, while the objective function in (2) depends only on the final parameter value, for training the optimizer it will be convenient to have an objective that depends on the entire trajectory of optimization, for some horizon T,                         
                            L
                            
                                
                                    φ
                                
                            
                            =
                            
                                
                                    E
                                
                                
                                    f
                                
                            
                            [
                            
                                
                                    Σ
                                
                                
                                    t
                                    =
                                    1
                                
                                
                                    T
                                
                            
                            
                                
                                    ω
                                
                                
                                    t
                                
                            
                            f
                            (
                            
                                
                                    θ
                                
                                
                                    t
                                
                            
                            )
                            ]
                        
                       where                           
                            
                                
                                    θ
                                
                                
                                    t
                                    +
                                    1
                                
                            
                            =
                            
                                
                                    θ
                                
                                
                                    t
                                
                            
                            +
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                      (3) … We can minimize the value of using                         
                            L
                            
                                
                                    φ
                                
                            
                        
                     gradient descent on                         
                            φ
                        
                     . The gradient estimate                         
                            
                                
                                    ∂
                                    L
                                    (
                                    φ
                                    )
                                
                                
                                    ∂
                                    φ
                                
                            
                        
                     can be computed by sampling a random function f and applying backpropagation to the computational graph in Figure 2. We allow gradients to flow along the solid edges in the graph, but gradients along the dashed edges are dropped …”, teaches the Equation 3, which is the                         
                            L
                            
                                
                                    φ
                                
                            
                             
                        
                    ).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Li and Andrychowicz to use the adjusting the parameter of the optimizer according to the loss function of Andrychowicz to implement the optimizer neural network system of Li. The suggestion and/or motivation to do so is to improve the accuracy of the model, as optimizer adjusts the parameter of the model and adjusting the optimizer affects the performance of the optimizer ([Andrychowicz, 3 Experiments, 2nd paragraph]).
Li in view of Andrychowicz does not specifically teach generating perturbed values of the parameters by applying a perturbation to the current values of the parameters, training a network in accordance with the perturbed values of the parameters.
	Wang teaches generating perturbed values of the parameters by applying a perturbation to the current values of the parameters, training a network in accordance with the perturbed values of the parameters ([Wang, 0121] In one embodiment, a method of training a neural network model includes injection of randomness in the model weight space during optimization. In each iteration of the optimization, the gradients to be backpropagated are generated from the model slightly perturbed in the weight space. For example, the neural network model may be trained iteratively using models selected from a model distribution. At each step, a different model is sampled from the model distribution. The model distribution may be generated by sampling neural network parameters).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Li, Andrychowicz and Wang to use the generating perturbed parameters and training a network using perturbed values of Wang to implement the optimizer neural network system of Li and Andrychowicz. The suggestion and/or motivation to do so is to improve the accuracy of the model by simulating the effect of unexpected error ([Wang, 0015]).

Regarding claim 14, Li in view of Andrychowicz and further in view of Wang teaches a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of a method for training an optimizer neural network having optimizer parameters, wherein the optimizer neural network is configured to generate an output that defines updated values of target parameters of a target neural network during training of the target neural network to perform one or more target neural network tasks ([Li, 0019] “In one embodiment, a method is provided. The method includes operations for training a teacher model based on teacher speech data and for initializing a student model with parameters obtained from the trained teacher model”, the teacher model corresponds to the optimizer neural network, the student model corresponds to the target neural network. The target parameter means any parameter of the target neural network, [Li, 0021] “In yet another embodiment, a machine – readable storage medium includes instructions that, when executed by a machine, cause the machine to perform operations comprising: training a teacher model based on teacher speech data; …  and maximizing the classifier condition loss with respect to parameters of a feature extractor”, teaches the system comprises computer and storage).

Claim 14 is a system claim having similar limitation to the method claim 1. Therefore, it is rejected under the same rationale as the claim 1.

Regarding claim 15, Li in view of Andrychowicz and further in view of Wang teaches one or more non-transitory computer readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations of a method for training an optimizer neural network having optimizer parameters, wherein the optimizer neural network is configured to generate an output that defines updated values of target parameters of a target neural network during training of the target neural network to perform one or more target neural network tasks associated with the target neural network ([Li, 0019] “In one embodiment, a method is provided. The method includes operations for training a teacher model based on teacher speech data and for initializing a student model with parameters obtained from the trained teacher model”, the teacher model corresponds to the optimizer neural network, the student model corresponds to the target neural network. The target parameter means any parameter of the target neural network, [Li, 0021] “In yet another embodiment, a machine – readable storage medium includes instructions that, when executed by a machine, cause the machine to perform operations comprising: training a teacher model based on teacher speech data; …  and maximizing the classifier condition loss with respect to parameters of a feature extractor”, teaches the system comprises computer and storage).
Claim 15 is a non-transitory computer readable storage media claim having similar limitation to the method claim 1. Therefore, it is rejected under the same rationale as the claim 1.

Regarding claim 5, Li in view of Andrychowicz and further in view of Wang teaches wherein the outer loss function is an expected training loss of the target neural network ([Li, 0018] “In some example embodiments, using AT/S, a student acoustic model and a condition classifier are jointly optimized by minimizing the Kullback - Leibler (KL) divergence between the output distributions of the teacher and the student models, while simultaneously min - maximizing classification losses (e.g., the accuracy loss due to the acoustic condition variability in a condition classifier”).

Regarding claim 6, The method of claim 1, wherein the outer loss function is an expected validation loss of the target neural network ([Li, 0018] “In some example embodiments, using AT/S, a student acoustic model and a condition classifier are jointly optimized by minimizing the Kullback - Leibler (KL) divergence between the output distributions of the teacher and the student models, while simultaneously min - maximizing classification losses (e.g., the accuracy loss due to the acoustic condition variability in a condition classifier”, Li teaches the process of validating whether the classification process classifies the input data correctly, which corresponds to the validation loss).

Regarding claim 9, Li in view Andrychowicz teaches of the method of claim 1, wherein evaluating the performance of the trained instance of target neural network on the neural network task associated with the instance of target neural network to determine the one or more performance metrics for the trained instance of the target neural network on the neural network task ([Li, 0106] “In some example embodiments, stochastic gradient descent ( SGD ) is used to optimize                         
                            
                                
                                    L
                                
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                            
                            (
                            
                                
                                    θ
                                
                                
                                    f
                                
                            
                            ,
                            
                                
                                    θ
                                
                                
                                    y
                                
                            
                            ,
                            
                                
                                    θ
                                
                                
                                    C
                                
                            
                            )
                        
                    . SGD, also known as incremental gradient descent, is a stochastic approximation of the gradient descent optimization and iterative method for minimizing an objective function that is written as a sum of differentiable functions. In other words, SGD tries to find minima or maxima by iteration”) comprises: 
Li in view of Adrychowicz does not specifically teach determining one or more perturbed values of the outer loss function based on the perturbed values of the optimizer parameters.
Wang teaches determining one or more perturbed values of the outer loss function based on the perturbed values of the optimizer parameters ([Wang, 0121] In one embodiment, a method of training a neural network model includes injection of randomness in the model weight space during optimization. In each iteration of the optimization, the gradients to be backpropagated are generated from the model slightly perturbed in the weight space. For example, the neural network model may be trained iteratively using models selected from a model distribution. At each step, a different model is sampled from the model distribution. The model distribution may be generated by sampling neural network parameters).

Regarding claim 12, Li in view of Andrychowicz and further in view of Wang teaches wherein generating the perturbed values of the optimizer parameters comprises sampling the perturbation from a multivariate standard normal distribution ([Wang, 0121] “In one embodiment, a method of training a neural network model includes injection of randomness in the model weight space during optimization. In each iteration of the optimization, the gradients to be backpropagated are generated from the model slightly perturbed in the weight space. For example, the neural network model may be trained iteratively using models selected from a model distribution. At each step, a different model is sampled from the model distribution. The model distribution may be generated by sampling neural network parameters”, teaches injecting perturbation to the parameter and sampling the perturbed model. [Wang, 0131] “In some embodiments, the modified model may be selected from a model distribution. The model distribution may be generated by sampling neural network parameters”, teaches the distribution is generated from the NN parameter).

Claim 2 are rejected under 35 U.S.C. 103 over Li (US 20190287515 A1) in view of Andrychowicz (Andrychowicz et al, 2016, “Learning to learn by gradient descent by gradient descent”) in view of Wang (US 20190244103 A1), and further in view of Geoffrey (Geoffrey et al, 2017, “Sticking the Landing” Simple, Lower-Variance Gradient Estimators for Variational Inference”), and further in view of Naesseth (Naesseth, 2017, “Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms”).

Regarding claim 2, Li in view of Andrychowicz and further in view of Wang teaches the method of claim 1. 
Li in view of Andrychowicz and further in view of Wang does not specifically teach wherein determining the gradient estimate of the outer loss function comprises determining (i) a reparameterization-based gradient estimate of the outer loss function, and (ii) a log-derivative gradient estimate of the outer loss function.
Geoffrey teaches wherein determining the gradient estimate of the outer loss function comprises determining (i) a reparameterization-based gradient estimate of the outer loss function ([Geoffrey, page 4, 3 Implementation Details, 2nd paragraph] “Algorithm 1 shows the standard reparameterized gradient for the ELBO. We require three function definitions: q_sample to generate a reparameterized sample from the variational approximation, and functions that implement log p(x, z) and log q(z|x, φ). Once the loss Lˆ t is defined, we can leverage automatic differentiation to return the standard gradient evaluated at φt. This yields equation (7)”). 
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Li, Andrychowicz, Wang, and Geoffrey to calculate the reparameterization-based gradient estimate of Geoffrey to implement the optimizer neural network system of Li, Andrychowicz and Wang. The suggestion and/or motivation to do so is to improve the performance of the model, as reparameterization trick provides lower-variance gradient estimates than general gradient estimator ([Geoffrey, 1 Introduction, 1st paragraph]).
Li in view of Andrychowicz in view of Wang and further in view of Geoffrey does not specifically teach (ii) a log-derivative gradient estimate of the outer loss function. 
Naesseth teaches (ii) a log-derivative gradient estimate of the outer loss function ([Naesseth, page 2, right column, 2nd paragraph, Score function estimator.] “The score function estimator, also known as the log-derivative trick or reinforce [Williams, 1992, Glynn, 1990], is a general way to estimate the gradient of the elbo [Paisley et al., 2012, Ranganath et al., 2014, Mnih and Gregor, 2014]. The score function estimator expresses the gradient as an expectation with respect to q(z ; θ): ∇θL(θ) = Eq(z ;θ) [f(z)∇θ log q(z ; θ)] + ∇θH[q(z ; θ)]”).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Li, Andrychowicz, Wang, Geoffrey, Naesseth to calculate a log-derivative gradient estimate of Naesseth to implement the optimizer neural network system of Li, Andrychowicz, Wang, and Geoffrey. The suggestion and/or motivation to do so is to improve the performance of the model, as log-derivative gradient estimate is a general way to estimate the gradient of a function ([Naesseth, Score function estimator, 1st paragraph]).

Claim 3 is rejected under 35 U.S.C. 103 Li (US 20190287515 A1) in view of Andrychowicz (Andrychowicz et al, 2016, “Learning to learn by gradient descent by gradient descent”) in view of Wang (US 20190244103 A1), and further in view of Geoffrey (Geoffrey et al, 2017, “Sticking the Landing” Simple, Lower-Variance Gradient Estimators for Variational Inference”), in view of Naesseth (Naesseth, 2017, “Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms”), and further in view of Grosse (US 20030176650 A1).

Regarding claim 3, Li in view of Andrychowicz in view of Wang and further in view of Geoffrey teaches wherein determining the gradient estimate of the outer loss function further comprises: determining the gradient estimate by the reparameterization-based gradient estimate ([Geoffrey, page 4, 3 Implementation Details, 2nd paragraph] “Algorithm 1 shows the standard reparameterized gradient for the ELBO. We require three function definitions: q_sample to generate a reparameterized sample from the variational approximation, and functions that implement log p(x, z) and log q(z|x, φ). Once the loss Lˆ t is defined, we can leverage automatic differentiation to return the standard gradient evaluated at φt. This yields equation (7)”).
 Li in view of Andrychowicz in view of Wang and further in view of Geoffrey does not specifically teach the log-derivative gradient estimate, and combining the gradient estimates using an inverse variance weighting technique.
Nasseth teaches calculating the log-derivative gradient estimate ([Naesseth, page 2, right column, 2nd paragraph, Score function estimator.] “The score function estimator, also known as the log-derivative trick or reinforce [Williams, 1992, Glynn, 1990], is a general way to estimate the gradient of the elbo [Paisley et al., 2012, Ranganath et al., 2014, Mnih and Gregor, 2014]. The score function estimator expresses the gradient as an expectation with respect to q(z ; θ): ∇θL(θ) = Eq(z ;θ) [f(z)∇θ log q(z ; θ)] + ∇θH[q(z ; θ)]”).
Nasseth does not specifically teach gradient estimate using an inverse variance weighting technique.
Grosse teaches combining two values using an inverse variance weighting technique ([Grosse, 0478] Total The total test combines the estimates of b from the unrelated, mean, and difference tests, which are statistically independent. A minimum variance estimator of b is built by weighting each of the three tests by the inverse of their sampling variance, and the variance of the combined estimator is the inverse of the sum of the inverse variances of the independent estimates. This test is more sensitive than either of the three independent tests in the absence of stratification, but is not as robust as the difference or non-parametric difference test in the presence of stratification”).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Li, Andrychowicz, Wang, Geoffrey, Naesseth, Grosse to estimate a value using an inverse variance weighting technique of Grosse to implement the optimizer neural network system of Li, Andrychowicz, Wang, Geoffrey, and Naesseth. The suggestion and/or motivation to do so is to improve the performance of the model, as the technique is more sensitive than other technique on measuring loss ([Grosse, 0478]).

Claim 7-8 are rejected under 35 U.S.C. 103 over Li (US 20190287515 A1) in view of Andrychowicz (Andrychowicz et al, 2016, “Learning to learn by gradient descent by gradient descent”) in view of Wang (US 20190244103 A1) and further in view of Berrada (Berrada, 02/2018, “SMOOTH LOSS FUNCTIONS FOR DEEP TOP-K CLASSIFICATION”).

Regarding claim 7, Li in view of Andrychowicz and further in view of Wang teaches the method of claim 1 and calculating training loss of the target neural network ([Li, 0018] “In some example embodiments, using AT/S, a student acoustic model and a condition classifier are jointly optimized by minimizing the Kullback - Leibler (KL) divergence between the output distributions of the teacher and the student models, while simultaneously min - maximizing classification losses (e.g., the accuracy loss due to the acoustic condition variability in a condition classifier”).
Li in view of Andrychowicz and further in view of Wang does not specifically teach wherein the outer loss function is a smoothed expected training loss of the target neural network.
Berrada teaches wherein the loss function is a smoothed expected loss ([Berrada, page 4, Smoothing] “In the form of equation (8), the loss function can be smoothed with a temperature parameter τ >0: … Note that we have changed the notation to use Lk,τ to refer to the smooth loss. In what follows, we first outline the properties of Lk,τ and its relationship with cross-entropy. Then we show the empirical advantage of Lk,τ over its non-smooth counter-part”).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Li, Andrychowicz, Wang, and Berrada to calculate the smooth training loss of Berrada to implement the optimizer neural network system of Li, Andrychowicz and Wang. The suggestion and/or motivation to do so is to improve the performance of the model, as smoothed loss function used to offer better performance in practice ([Berrada, page 4, Difficulty of the Optimization, 2nd paragraph]).

Regarding claim 8, Li in view of Andrychowicz and further in view of Wang teaches the method of claim 1. 
Li in view of Andrychowicz and further in view of Wang does not specifically teach wherein the outer loss function is a smoothed expected validation loss of the target neural network.
Berrada teaches wherein the outer loss function is a smoothed expected validation loss of the target neural network ([Berrada, page 4, Smoothing] “In the form of equation (8), the loss function can be smoothed with a temperature parameter τ >0: … Note that we have changed the notation to use Lk,τ to refer to the smooth loss. In what follows, we first outline the properties of Lk,τ and its relationship with cross-entropy. Then we show the empirical advantage of Lk,τ over its non-smooth counter-part”).

Claim 10 is rejected under 35 U.S.C. 103 over Li (US 20190287515 A1) in view of Andrychowicz (Andrychowicz et al, 2016, “Learning to learn by gradient descent by gradient descent”) in view of Wang (US 20190244103 A1) and further in view of Ali (US 20140287698 A1).

Regarding claim 10, Li in view of Andrychowicz and further in view of Wang teaches the method of claim 9, wherein determining the one or more perturbed values of the outer loss function ([Wang, 0121] In one embodiment, a method of training a neural network model includes injection of randomness in the model weight space during optimization. In each iteration of the optimization, the gradients to be backpropagated are generated from the model slightly perturbed in the weight space. For example, the neural network model may be trained iteratively using models selected from a model distribution. At each step, a different model is sampled from the model distribution. The model distribution may be generated by sampling neural network parameters). 
Li in view of Andrychowicz and further in view of Wang does not specifically teach determining a positively-perturbed value of the outer loss function based on the positively-perturbed values of the parameters, wherein the positively-perturbed values of the parameters are the current values of the parameters plus the perturbation, and determining a negatively-perturbed value of the outer loss function based on the negatively-perturbed values of the parameters, wherein the negatively- perturbed values of the parameters are the current values of the parameters minus the perturbation.
Ali teaches determining a positively-perturbed value based on the positively-perturbed values of the parameters, wherein the positively-perturbed values of the parameters are the current values of the parameters plus the perturbation, and determining a negatively-perturbed value based on the negatively-perturbed values of the parameters, wherein the negatively- perturbed values of the parameters are the current values of the parameters minus the perturbation ([Ali, Claim 16] “said second power parameter corresponds to a received power measurement obtained with said matching circuit operating at perturbed frequencies corresponding to one of the operating frequency minus a frequency perturbation step interval or the operating frequency plus the frequency perturbation step interval”, teaches the original operating frequency plus perturbation and minus perturbation).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Li, Andrychowicz, Wang, and Ali to determine a positively perturbed value and negatively perturbed value of Ali to implement the optimizer neural network system of Li, Andrychowicz and Wang. The suggestion and/or motivation to do so is to improve the accuracy of the model, as testing the model with more diverse perturbation may increase the chance of training the model more accurately. 

Claim 11 and 13 are rejected under 35 U.S.C. 103 over Li (US 20190287515 A1) in view of Andrychowicz (Andrychowicz et al, 2016, “Learning to learn by gradient descent by gradient descent”) in view of Wang (US 20190244103 A1), and further in view of Oliver (US 20040002930 A1).

Regarding claim 11, Li in view of Andrychowicz and further in view of Wang teaches the method of claim 1.
Li in view of Andrychowicz and further in view of Wang does not specifically teach further comprising sampling a dataset from a set of datasets, wherein each dataset is used for training the target neural network to perform a respective neural network task, and wherein each dataset includes respective training data and respective validation data
Oliver teaches further comprising sampling a dataset from a set of datasets, wherein each dataset is used for training the target neural network to perform a respective neural network task, and wherein each dataset includes respective training data and respective validation data ([Oliver, 0094] “In a first case, 10 datasets of randomly sampled synthetic discrete data were generated with 3 hidden states, 3 observation values and random additive observation noise, for example. In one example, the experiment employed 120 samples per dataset for training, 120 per dataset for testing and a 10-fold cross validation to estimate a. The training was supervised for both HMMs and MIHMMs. MIHMMs had an average improvement over the 10 datasets of about 11%, when compared to HMMs of similar structure. The a.sub.optimal determined and selected was 0.5 (a range from about 0.3 to 0.8 was suitable). A mean classification error over the ten datasets for HMMs and MIHMMs with respect to a is depicted in FIG. 7. A summary of the mean accuracies of HMMs and MIHMMs is depicted below in Table 1”).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Li, Andrychowicz, Wang, and Oliver to determine a positively perturbed value and negatively perturbed value of Oliver to implement the optimizer neural network system of Li, Andrychowicz and Wang. The suggestion and/or motivation to do so is to improve the accuracy of the model, as testing the model with more diverse perturbation may increase the chance of training the model more accurately. 

Regarding claim 13, Li in view of Andrychowicz in view of Wang and further in view of Oliver teaches wherein repeatedly performing the following operations to determine trained values of the optimizer parameters comprise: repeatedly performing the following operations for each of the dataset sampled from the set of datasets ([Li, 0050] “Models may be run against a training dataset for several epochs (e.g., iterations), in which the training dataset is repeatedly fed into the model to refine its results. For example, in a supervised learning phase, a model is developed to predict the output for a given set of inputs, and is evaluated over several epochs to more reliably provide the output that is specified as corresponding to the given input for the greatest number of inputs for the training dataset. In another example, for an unsupervised learning phase, a model is developed to cluster the dataset into n groups, and is evaluated over several epochs as to how consistently it places a given input into a given group and how reliably it produces the n desired clusters across each epoch”, the details about the ‘models’ are mentioned in the paragraph 0045 and 0046, comprises the teacher and the student model. Training the teacher model corresponds to the process of repeatedly performing operation).

Allowable Subject Matter
Claim 4 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Regarding optimizer neural network.
Ito, 2016, “Gradient-based global features for seam carving”
Bello, 2017, “Neural Optimizer Search with Reinforcement Learning”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JUN KWON whose telephone number is (571)272-2072. The examiner can normally be reached on M-F 7:30AM – 4:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JUN KWON/
Patent Examiner, Art Unit 2127
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126