DETAILED ACTION
This action is in response to the claims filed 09/13/2018. Claims 1-20 are pending and have been examined.	
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Interpretation
Claim 1 is directed to a method claim that recites a comparison between candidate learning rate control value and maximum previously observed learning rate control value. The method of claim 1 recites contingent limitations and any prior art meets the broadest reasonable interpretation of the claim if only one of those limitations are met. Therefore the prior art only needs to read on “when the candidate learning rate control value is greater…” OR “when the candidate learning rate control value is less than…” There is only a single initial comparison, and therefore only a single condition is required in the claim language. See MPEP 2111.04(II); see also Ex parte Schulhauser. Examiner notes that even though the broadest reasonable interpretation of the method claim merely requires one condition, the prior art cited below reads on the structure for performing all of the functionality in the apparatus and product claims.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 1-20 are rejected under 35 U.S.C. 101 because 
Regarding Claim 1
optimizing machine-learning models that provides improved convergence properties, which is directed to a process, one of the statutory categories. The claim recites the following limitations “determining…a gradient of a loss function that evaluates a performance of a machine-learned model that comprises a plurality of parameters”, ““when the candidate learning rate control value is greater than the maximum previously observed learning rate control value: …setting a current learning rate control value equal to the candidate learning rate control value; and setting the maximum previously observed learning rate control value equal to the candidate learning rate control value;”, “when the candidate learning rate control value is less than the maximum previously observed learning rate control value: …setting the current learning rate control value equal to the maximum previously observed learning rate control value;”   Under Step 2A Prong 1, these limitations amount to mathematical calculations. In particular determining a gradient of a loss function is simply a mathematical evaluation. Further, assigning/setting parameter values according to a conditional comparison between two values is an evaluation of a mathematical operation, these steps are equivalent to taking the maximum value between a set of values and/or taking the minimum value between a set of values. Furthermore the limitations:  “comparing…the candidate learning rate control value to a maximum previously observed learning rate control value;”, “determining…a candidate learning rate control value based at least in part on the gradient of the loss function;”, “determining…a current learning rate based at least in part on the current learning rate control value;”, “determining…an updated set of values for the plurality of parameters of the machine-learned model based at least in part on the gradient of the loss function and according to the current learning rate.” Under Step 2A Prong 1, these limitations correspond to an evaluation performed in the human mind. Determining and comparing values 
Furthermore under step 2A Prong 2 and 2B the claims recite the additional element(s) that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. (“by the one or more computing devices”) See MPEP 2106.05(f). Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.
Regarding Claim 2
The claim is directed to a process. The claim recites the following limitations “determining, … an exponential moving average of squared past gradients and a square of the gradient of the loss function.” Under Step 2A Prong 1, these limitations correspond to a mathematical concept.  
Furthermore under step 2A Prong 2 and 2B, the claim does not recite additional elements to consider other than those considered in the independent claim. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.
Regarding Claim 3
The claim is directed to a process. The claim recites the following limitations “updating… a current momentum value based at least in part on the gradient of the loss function and one or more previous momentum values respectively from one or more previous iterations;”, “determining, … the updated set of values for the plurality of parameters of the machine-learned model based at least in part on the current momentum value and according to the current learning rate” Under Step 2A Prong 1, these limitations correspond to an evaluation performed in the human mind. Determining and updating values associated with a machine learning model based on loss functions can be performed simply in the human mind or with aid of pen and paper. By way of example an update may simply be a determination that the parameters should be incremented if the gradient is positive.
Furthermore under step 2A Prong 2 and 2B, the claim does not recite additional elements to consider other than those considered in the independent claim. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.
Regarding Claim 4
The claim is directed to a process. The claim recites the following limitations “determining a moving average of the one or more previous momentum values and the gradient of the loss function.” Under Step 2A Prong 1, these limitations correspond to an evaluation performed in the human mind. Determining values associated with a machine learning model based on loss functions can be performed simply in the human mind or with aid of pen and paper.
Furthermore under step 2A Prong 2 and 2B, the claim does not recite additional elements to consider other than those considered in the dependent claim 3. Accordingly, the recited 
Regarding Claim 5
The claim is directed to a process. The claim recites the following limitations “performing… a projection operation on a current set of values for the plurality of parameters minus the current momentum value times the current learning rate.” Under Step 2A Prong 1, these limitations correspond to a mathematical concept.  
Furthermore under step 2A Prong 2 and 2B, the claim does not recite additional elements to consider other than those considered in the dependent claim 3. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.
Regarding Claim 6
The claim is directed to a process. The claim recites the following limitations “dividing, by the one or more computing devices, a step size by a square root of a matrix version of the current learning rate control value.” Under Step 2A Prong 1, these limitations correspond to a mathematical concept.  
Furthermore under step 2A Prong 2 and 2B, the claim does not recite additional elements to consider other than those considered in the independent claim. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do 
Regarding Claim 7
The claim is directed to a process. The claim recites the following limitations no additional abstract ideas, other than those addressed in the independent claim. 
Furthermore under step 2A Prong 2 and 2B the claims recite the additional element(s) “performing…the method of claim 1 for each of a plurality of iterations.” that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP 2106.05(g). The additional element in the context of the claim amounts to performing repetitive calculations, corresponding to simply appending well-understood, routine, conventional activities. (See MPEP 2106.05(d)). Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.
Regarding Claim 8
The claim is directed to a process. The claim recites the following limitations “wherein… a second order moment decay factor used to determine the candidate learning rate control value based at least in part on the gradient of the loss function is held constant.” Under Step 2A Prong 1, these limitations only serve to further describe the abstract ideas recited in independent claim 1 and dependent claim 7.
Furthermore under step 2A Prong 2 and 2B, the claim does not recite additional elements to consider other than those considered in the dependent claim 7. Accordingly, the recited 
Regarding Claim 9
The claim is directed to a process. The claim recites the following limitations “wherein…a second order moment decay factor used to determine the candidate learning rate control value based at least in part on the gradient of the loss function is increased so as to provide increasing influence to past learning rate control values.” Under Step 2A Prong 1, these limitations only serve to further describe the abstract ideas recited in independent claim 1 and dependent claim 7.
Furthermore under step 2A Prong 2 and 2B, the claim does not recite additional elements to consider other than those considered in the dependent claim 7. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.
Regarding Claim 10
The claim is directed to a process. The claim recites the following limitations “wherein…a momentum decay factor used to update a current momentum value based at least in part on the gradient of the loss function is held constant.” Under Step 2A Prong 1, these limitations only serve to further describe the abstract ideas recited in independent claim 1 and dependent claim 7.
dependent claim 7. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.
Regarding Claim 11
The claim is directed to a process. The claim recites the following limitations “wherein…a momentum decay factor used to update a current momentum value based at least in part on the gradient of the loss function is decreased according to a decay schedule.” Under Step 2A Prong 1, these limitations only serve to further describe the abstract ideas recited in independent claim 1 and dependent claim 7.
Furthermore under step 2A Prong 2 and 2B, the claim does not recite additional elements to consider other than those considered in the dependent claim 7. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.
Regarding Claim 12
Under step 1, the claim is directed to a computing system which is directed to a machine, one of the statutory categories. The claim recites the following limitations “determining a gradient of a loss function that evaluates a performance of a machine-learned model that comprises a plurality of parameters;”, “selecting a minimum of the candidate learning rate and a minimum previously observed learning rate to serve as a current learning rate;”,  Under Step 2A Prong 1, these limitations amount to mathematical calculations. In particular determining a gradient of a loss function is simply a mathematical evaluation. Further, selecting a minimum parameter value between values is an evaluation of a mathematical operation, these steps are equivalent to implementing a min() function on a set of values in order to evaluate or select a minimum. “determining a candidate learning rate based at least in part on the gradient of the loss function;”, “updating at least one of the plurality of parameters of the machine-learned model based at least in part on the gradient of the loss function and according to the current learning rate.”. Under Step 2A Prong 1, these limitations correspond to an evaluation performed in the human mind. Determining/updating values associated with a machine learning model based on loss functions can be performed simply in the human mind or with aid of pen and paper. By way of example an update may simply be a determination that the parameters should be incremented if the gradient is positive.
Furthermore under step 2A Prong 2 and 2B the claims recite the additional element(s) that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. (“one or more processors; and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the one or more processors to perform operations”) See MPEP 2106.05(f). Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.

Regarding Claim 13
The claim is directed to a machine. The claim the limitations “determining a candidate learning rate control value based at least in part on the gradient of the loss function, wherein the candidate learning rate is a function of and has an inverse relationship to the candidate learning rate control value; and selecting the minimum of the candidate learning rate and the minimum previously observed learning rate as the current learning rate comprises: identifying a maximum of the candidate learning rate control value and a maximum previously observed learning rate control value; determining the current learning rate based on the maximum of the candidate learning rate control value and the maximum previously observed learning rate control value.” only serve to further describe the recited abstract idea of “determining…” discussed in the rejection of claim 12. 
Furthermore under step 2A Prong 2 and 2B, the claim does not recite additional elements to consider other than those considered in the independent claim 12. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.
Regarding Claim 14
Claim 14 recites the same features as claim 2 and is rejected for at least the same reasons as claim 2 in connection with dependent claim 13. 
Regarding Claim 15
Claim 15 recites the same features as claim 6 and is rejected for at least the same reasons as claim 6 in connection with dependent claim 13. 
Regarding Claim 16
Claim 16 recites the same features as claim 3 and is rejected for at least the same reasons as claim 3 in connection independent claim 12. 
Regarding Claim 17
Claim 17 recites the same features as claim 4 and is rejected for at least the same reasons as claim 4 in connection dependent claim 16. 
Regarding Claim 18
Claim 18 recites the same features as claim 5 and is rejected for at least the same reasons as claim 5 in connection dependent claim 16. 
Regarding Claim 19
Under step 1, the claim is directed to a method for One or more non-transitory computer-readable media, which is directed to a machine, one of the statutory categories. The claim recites the following limitations “determining a gradient of a loss function that evaluates a performance of a machine-learned model that comprises a plurality of parameters;”, “determining a candidate learning rate control value based at least in part on the gradient of the loss function”, “selecting a maximum of the candidate learning rate control value and a maximum previously observed learning rate control value as a current learning rate control value;”, “updating … the machine-learned model based at least in part on the gradient of the loss function and according to a current learning rate that is a function of the current learning rate control value.” Under Step 2A Prong 1, these limitations correspond to an evaluation performed in the human mind. Determining/comparing/selecting values associated with a machine learning model based on loss functions can be performed simply in the human mind or with aid of pen and paper.
(“One or more non-transitory computer-readable media that store instructions that, when executed by one or more processors, cause the one or more processors to perform operations”) See MPEP 2106.05(f). In addition, additional element(s) “for each of a plurality of iterations:” that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP 2106.05(g). The additional element in the context of the claim amounts to performing repetitive calculations, corresponding to simply appending well-understood, routine, conventional activities. (See MPEP 2106.05(d)).  Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.
Regarding Claim 20
The claim is directed to a machine. The claim recites the following limitations “wherein the current learning rate is inversely correlated to the candidate learning rate control value” only serve to further describe the recited abstract idea of “selecting…”.
Furthermore under step 2A Prong 2 and 2B, the claim does not recite additional elements to consider other than those considered in the independent claim 19. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Kingma et al. “ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION” hereinafter Kingma. 

Regarding claim 1
	Kingma teaches, A computer-implemented method for optimizing machine-learned models that provides improved convergence properties, the method comprising: (Abstract pg 1 “We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements… Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.
” a training approach refers to a method of optimizing machine learned models. The Adamax algorithm is described on pg 3) determining, by one or more computing devices, a gradient of a loss function that evaluates a performance of a machine-learned model that comprises a plurality (pg 9 Algorithm 2 
    PNG
    media_image1.png
    482
    922
    media_image1.png
    Greyscale
The adamax algorithm is an iterative algorithm, as seen in the while loop from line 4 to line 10.  One line 6 the gradients are determined and assigned to the variable gt. The gradients reflect the performance of the algorithm for the current iteration.) determining, by the one or more computing devices, a candidate learning rate control value based at least in part on the gradient of the loss function; ( pg 9 Algorithm 2 in line 8 of the absolute value of the gradient, corresponding to the candidate learning rate control value is determined for the current time step. This determination is based at least in part on the value of the gradient, gt.) comparing, by the one or more computing devices, the candidate learning rate control value to a maximum previously observed learning rate control value;(pg 9 Algorithm 2 again in line 8 of the algorithm the max function has two parameters, the first is the previously observed learning rate, u_t-1. Accordingly, u_t-1*B_2 is either an initial value or the previous result of the max() function, thus this previously observed learning rate is the “maximum previously observed learning rate”. Furthermore, the parameter is the previously mapped candidate learning rate control value, the max() function performs the comparison.) when the candidate learning rate control value is greater than the maximum previously observed learning rate control value: setting a current learning rate control value equal to the candidate learning rate control value; and setting the maximum previously observed learning rate control value equal to the candidate learning rate control value; when the candidate learning rate control value is less than the maximum previously observed learning rate control value: setting the current learning rate control value equal to the maximum previously observed learning rate control value; (pg 9 Algorithm 2 the conditional assignment of the “current learning rate value” is implemented by the max() function in line 8 of the algorithm. When the candidate is the maximum value it is assigned as the current learning rate u_t, further in the next iteration this u_t will be u_t-1, the maximum previously observed learning rate. Similiarly, when u_t-1*B_2 is assigned when it is greater than the candidate.)  determining, by the one or more computing devices, a current learning rate based at least in part on the current learning rate control value; ( pg 9 Algorithm 2  in line 9 of the algorithm the current learning rate is a function of the control value ut, 
    PNG
    media_image2.png
    36
    161
    media_image2.png
    Greyscale
.) and determining, by the one or more computing devices, an updated set of values for the plurality of parameters of the machine-learned model based at least in part on the gradient of the loss function and according to the current learning rate. ( pg 9 Algorithm 2  in line 9 of the algorithm the updated parameters theta_t are determined based on the current learning control value, which as previously described is based at least in part on the gradient and used to determine the current learning rate.) 
Regarding claim 2
	Kingma teaches claim 1
	Kingma teaches, determining, by the one or more computing devices, an exponential moving average of squared past gradients and a square of the gradient of the loss function ( pg 8 Section 7.1 “In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a (scaled) L2 norm of their individual current and past gradients. We can generalize the L2 norm based update rule to a L p norm based update rule….” Equation 6 
    PNG
    media_image3.png
    26
    202
    media_image3.png
    Greyscale
 the generalized updated rule for the candidate learning control value vt is given by equation 6. In ADAM vt is determined as an exponential moving average of squared past gradients and current gradient when p=2. While the adamax algorithm computes the infinity norm to compute the candidate gradient, it is derived from this generalized equation. Adamax is equivalent to ADAM but p=infinity instead of 2.)
Regarding claim 3
	Kingma teaches claim 1
	Kingma teaches, updating, by the one or more computing devices, a current momentum value based at least in part on the gradient of the loss function and one or more previous momentum values respectively from one or more previous iterations; ( pg 9 algorithm 2 line 7. The first moment corresponding to the momentum values is updated according based on the gradient gt and the previous momentum values m_t-1) and determining, by the one or more computing devices, the updated set of values for the plurality of parameters of the machine-learned model based at least in part on the current momentum value and according to the current learning rate. ( pg 9 Algorithm 2  in line 9 of the algorithm the updated parameters theta_t are determined based on the current learning control value, which as previously described is based at least in part on the gradient and used to determine the current learning rate. And also based on the parameter m_t)
Regarding claim 4
	Kingma teaches claim 3
	Kingma teaches, determining a moving average of the one or more previous momentum values and the gradient of the loss function. ( pg 9 algorithm 2 line 7. The first moment corresponding to the momentum values is updated according based on the gradient gt and the previous momentum values m_t-1, The equation given here is by definition an exponential moving average, which is a type of moving average calculation.)
Regarding claim 5
	Kingma teaches claim 3
	Kingma teaches, performing, by the one or more computing devices, a projection operation on a current set of values for the plurality of parameters minus the current momentum value times the current learning rate. (
    PNG
    media_image4.png
    35
    478
    media_image4.png
    Greyscale
 the current momentum value corresponds to mt while the current learning rate is                 
                    
                        
                            α
                        
                        
                            
                                
                                    1
                                    -
                                    β
                                
                            
                            *
                            U
                            _
                            t
                        
                    
                
            . The difference between these values and the current parameters θ_t-1 corresponds to a projection operation.)

Regarding claim 6
	Kingma teaches claim 2
Kingma teaches, dividing, by the one or more computing devices, a step size by a square root of a matrix version of the current learning rate control value. ( pg 5 Section AdaGrad “AdaGrad corresponds to a version of Adam with β1 = 0... 
    PNG
    media_image5.png
    33
    179
    media_image5.png
    Greyscale
”  when hyper paramters are selected such that Adam corresponds to AdaGrad, the update step which uses the current learning rate to update parameters, is defined by the step size alpha, diveded by the square root of the sum of squared gradients.  Matrix version of the gradient, or current learning rate control value, is equivilent to an element wise squareing on a gradient vector.)

Regarding claim 7
	Kingma teaches claim 1
Kingma teaches, performing, by the one or more computing devices, the method of claim 1 for each of a plurality of iterations. (pg 9 Algorithm 2 the steps described in independent claim 1 including the parameter updates are performed within a while loop, thus they are performed through a plurality of iterations.)
Regarding claim 8
	Kingma teaches claim 7
Kingma teaches, wherein, over the plurality of iterations, a second order moment decay factor used to determine the candidate learning rate control value based at least in part on the gradient of the loss function is held constant (pg 9 Algorithm 2 and pg 2 Algorithm 1  B2 is the second order moment decay factor because it is used in the infinity norm update. The infinity norm is derived from the second order norm described in the ADAM algorithm described on page 2. In both versions of the algorithm B2 is set near 1 or .999 a constant value.)
Regarding claim 9
	Kingma teaches claim 7
Kingma teaches, wherein, over the plurality of iterations, a second order moment decay factor used to determine the candidate learning rate control value based at least in part on the gradient of the loss function is increased so as to provide increasing influence to past learning rate control values. (pg 9 Section 7.2 “Alternatively, an exponential moving average over the parameters can be used, giving higher weight to more recent parameter values. This can be trivially implemented by adding one line to the inner loop of algorithms 1 and 2… 
    PNG
    media_image6.png
    24
    191
    media_image6.png
    Greyscale
… 
    PNG
    media_image7.png
    30
    133
    media_image7.png
    Greyscale
” the authors note that the higher weight to recently past parameters, corresponds to increasing influence to past learning rate control values because the learning rate control values are the values used to update past parameters. The term (1- β2^t) is the second order moment decay factor, which increased as the number of iterations, t increases.)


Regarding claim 10
	Kingma teaches claim 7
Kingma teaches, wherein, over the plurality of iterations, a momentum decay factor used to update a current momentum value based at least in part on the gradient of the loss function is held constant.(pg 9 Algorithm 2 and pg 2 Algorithm 1 B1 is the first order moment decay factor because it is used in the first moment update. In both the Adamax algorithm and the ADAM algorithm the B1 is used to calculate the momentum parameter mt. In both versions of the algorithm B1 is set to .9.)
Regarding claim 11
	Kingma teaches claim 7
Kingma teaches, wherein, over the plurality of iterations, a momentum decay factor used to update a current momentum value based at least in part on the gradient of the loss function is decreased according to a decay schedule (pg 9 Algorithm 2  B1 is the first order moment decay factor because it is used in the first moment update. In line 9 of the algorithm the parameter is raised to the iteration count t. Thus over time this constant value which is strictly less than 1 and equal to .9 is the example decreases over the course of each iteration. The decrease corresponds to the decay schedule.)

Regarding claim 12
Kingma teaches, A computing system, comprising: one or more processors; and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the one or more processors to perform operations, the operations comprising: (Section 6 Experiments pg 5 “To empirically evaluate the proposed method, we investigated different popular machine learning Models… Using large models and datasets, we demonstrate Adam can efficiently solve practical deep learning problems…. We pre-process the IMDB movie reviews into bag-of-words (BoW) feature vectors including the first 10,000 most frequent words. The 10,000 dimension BoW feature vector for each review is highly sparse” Conclusion pg 10 “The method is straightforward to implement and requires little memory” the method described by Kingma is clear performed on processors, as the operations are described in the context of utilizing “little memory” and operating an very high dimension vectors.) determining a gradient of a loss function that evaluates a performance of a machine-learned model that comprises a plurality of parameters; (pg 9 Algorithm 2 
    PNG
    media_image1.png
    482
    922
    media_image1.png
    Greyscale
The adamax algorithm is an iterative algorithm, as seen in the while loop from line 4 to line 10.  One line 6 the gradients are determined and assigned to the variable gt. The gradients reflect the performance of the algorithm for the current iteration.) determining a candidate learning rate based at least in part on the gradient of the loss function; ( pg 9 Algorithm 2  As described in the rejection of claim 1 line 8 of the algorithm the absolute value of the gradient, corresponding to the candidate learning rate control value is determined for the current time step. This determination is based at least in part on the value of the gradient, gt. The candidate learning rate control value is inversely proportional to the learning rate. Thus determining the candidate learning control value is akin to determining the candidate learning rate.) selecting a minimum of the candidate learning rate and a minimum previously observed learning rate to serve as a current learning rate; (as described in the rejection of claim 1, the maximum candidate learning control value  and the maximum previously observed learning control value is selected. Because the candidate learning rate is inversely proportional to the candidate learning control value, selecting the maximum candidate learning value is equivalent to selecting the minimum candidate learning rate.) and updating at least one of the plurality of parameters of the machine-learned model based at least in part on the gradient of the loss function and according to the current learning rate. ( pg 9 Algorithm 2  in line 9 of the algorithm the updated parameters theta_t are determined based on the current learning control value, which as previously described is based at least in part on the gradient and used to determine the current learning rate.)

Regarding claim 13
	Kingma teaches claim 12
Further Kingma teaches, determining a candidate learning rate control value based at least in part on the gradient of the loss function, ( pg 9 Algorithm 2 in line 8 of the absolute value of the gradient, corresponding to the candidate learning rate control value is determined for the current time step. This determination is based at least in part on the value of the gradient, gt.) wherein the candidate learning rate is a function of and has an inverse relationship to the candidate learning rate control value; ( pg 9 Algorithm 2  in line 9 of the algorithm the learning rate is a function of the control value ut, 
    PNG
    media_image2.png
    36
    161
    media_image2.png
    Greyscale
. The learning rate is the value used to update the parameters in line 9 of the algorithm, this value is 1/ut. Which is inversely related to the control value u_t.) and selecting the minimum of the candidate learning rate and the minimum previously observed learning rate as the current learning rate comprises: identifying a maximum of the candidate learning rate control value and a maximum previously observed learning rate control value; (pg 9 Algorithm 2 again in line 8 of the algorithm the max function has two parameters, the first is the previously observed learning rate, u_t-1. Accordingly, u_t-1*B_2 is either an initial value or the previous result of the max() function, thus this previously observed learning rate is the “maximum previously observed learning rate”. Furthermore, the parameter is the previously mapped candidate learning rate control value, the max() function performs the identification and selection. As stated in the rejection of claim 12, selecting a minimum candidate learning rate corresponds to selecting a maximum candidate learning control value, because of the inverse relationship.)  and determining the current learning rate based on the maximum of the candidate learning rate control value and the maximum previously observed learning rate control value. (pg 9 Algorithm 2 the conditional assignment of the “current learning rate value” is implemented by the max() function in line 8 of the algorithm. When the candidate is the maximum value it is assigned as the current learning rate u_t, further in the next iteration this u_t will be u_t-1, the maximum previously observed learning rate. Similiarly, when u_t-1*B_2 is assigned when it is greater than the candidate. Then the current learning rate is simply a function of the current learning rate value.)  

Claim 14
	Claim 14 is rejected for at least the same reasons set forth in claim 13 in connection with claim 2.
Claim 15
	Claim 15 is rejected for at least the same reasons set forth in claim 13 in connection with claim 6.
Claim 16
	Claim 16 is rejected for at least the same reasons set forth in claim 12 in connection with claim 3.
Claim 17
	Claim 17 is rejected for at least the same reasons set forth in claim 12 and 16 in connection with claim 4.
Claim 18
	Claim 18 is rejected for at least the same reasons set forth in claim 12 and 16 in connection with claim 5.
Regarding claim 19
Kingma teaches, One or more non-transitory computer-readable media that store instructions that, when executed by one or more processors, cause the one or more processors to perform operations, the operations comprising: (Section 6 Experiments pg 5 “To empirically evaluate the proposed method, we investigated different popular machine learning Models… Using large models and datasets, we demonstrate Adam can efficiently solve practical deep learning problems…. We pre-process the IMDB movie reviews into bag-of-words (BoW) feature vectors including the first 10,000 most frequent words. The 10,000 dimension BoW feature vector for each review is highly sparse” Conclusion pg 10 “The method is straightforward to implement and requires little memory” the method described by Kingma is clear performed on processors, as the operations are described in the context of utilizing “little memory” and operating an very high dimension vectors.) for each of a plurality of iterations: (pg 9 Algorithm 2 the steps described in independent claim 1 including the parameter updates are performed within a while loop, thus they are performed through a plurality of iterations.) determining a gradient of a loss function that evaluates a performance of a machine-learned model that comprises a plurality of parameters; (pg 9 Algorithm 2 
    PNG
    media_image1.png
    482
    922
    media_image1.png
    Greyscale
The adamax algorithm is an iterative algorithm, as seen in the while loop from line 4 to line 10.  One line 6 the gradients are determined and assigned to the variable gt. The gradients reflect the performance of the algorithm for the current iteration.) determining a candidate learning rate control value based at least in part on the gradient of the loss function; ( pg 9 Algorithm 2 in line 8 of the absolute value of the gradient, corresponding to the candidate learning rate control value is determined for the current time step. This determination is based at least in part on the value of the gradient, gt.) selecting a maximum of the candidate learning rate control value and a maximum previously observed learning rate control value as a current learning rate control value; (pg 9 Algorithm 2 again in line 8 of the algorithm the max function has two parameters, the first is the previously observed learning rate, u_t-1. Accordingly, u_t-1*B_2 is either an initial value or the previous result of the max() function, thus this previously observed learning rate is the “maximum previously observed learning rate”. Furthermore, the parameter is the previously mapped candidate learning rate control value, the max() function performs the selecting.) updating at least one of the plurality of ( pg 9 Algorithm 2  in line 9 of the algorithm the updated parameters theta_t are determined based on the current learning control value, which as previously described is based at least in part on the gradient and used to determine the current learning rate.)
Regarding claim 20
	Kingma teaches claim 19
Further Kingma teaches, wherein the current learning rate is inversely correlated to the candidate learning rate control value. ( pg 9 Algorithm 2  in line 9 of the algorithm the learning rate is a function of the control value ut, 
    PNG
    media_image2.png
    36
    161
    media_image2.png
    Greyscale
. The learning rate is the value used to update the parameters in line 9 of the algorithm, this value is 1/ut. Which is inversely related to the control value u_t.)

Conclusion
Prior Art
Chen et al “Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks” discloses PADAM an improved variant on the AMSGRAD algorithm.


Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNATHAN R GERMICK whose telephone number is (571)272-8363. The examiner can normally be reached M-F 7:30-4:30.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/J.R.G./Examiner, Art Unit 2122                                                                                                                                                                                                        
/NICHOLAS KLICOS/Primary Examiner, Art Unit 2145