Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . This action is responsive to the Reply filed on 06/29/2022. Claims 1-25 are pending in the case. Claims 1, 14, and 20 are independent claims.

Response to Arguments
Applicant's amendments to claims 2, 3, 4, 15, 16, 21, and 24; and cancellation of claims n, n, and n; and arguments regarding rejections of claims 2-4, 15-17, 21-23, and 24 are persuasive. Accordingly, these rejections are hereby withdrawn. The remaining rejections under 35 U.S.C. § 112, claims 7, 10-12, and 25, were left unaddressed, and are therefore maintained.
Applicant's prior art arguments have been fully considered but they are not persuasive. Applicant argues that the limitation “with at least two different time constants for the same type of the moment using the gradient” is not disclosed by the cited references. (Reply at pages 10 and 11). Specifically, Applicant argues that while “Schaul et al. may employ two different time constants, the reference is not employing these time constants at ‘the same type of moment using the gradient’.” (Id). Applicant is merely applying a piecemeal analysis of the cited reference. One cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Shaul explicitly teaches that τ is a time constant and that the estimate of the gradient is computed at a particular moment. Therefore, Examiner respectfully asserts that the cited art sufficiently teaches the limitations recited in the claims and the rejections are maintained.


Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 7, 10-12, and 25 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 7 and 25 recite " inconsistency " and the metes and bounds of this limitation are unclear. Additionally, it is arguably a relative term of degree. It is unclear how one measures the inconsistency as well as what this inconsistency represents. The Specification does not define how inconsistency is measured and the language "depending on inconsistency" implies that there are varying amounts of the inconsistence, and that amount is unknown to one of ordinary skill in the art. See Specification para. [0011]. For examination purposes, this will be treated as the estimates not being identical. 7 and 25 recite inconsistency
Claim 10 recites the limitation of "streaming manner" that is unclear; for the purpose of prosecution, examiner has interpreted "provided in a streaming manner” as "online".
Claims 11-12 depend on claim 10 and inherit the same issue and are rejected under the same reasoning as above.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 6, 8-16, 18, 20-21 and 23-24 are rejected under 35 U.S.C. 103 as being unpatentable over Kingma et al. (“ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION”) and in view of Schaul et al. (“No More Pesky Learning Rates”).
Regarding Claim 1
Kingma teaches 
	A computer-implemented method for training a model, comprising: 
- obtaining a training example for a model having model parameters (§[4] “our goal is to predict the parameter θt and evaluate it on a previously unknown cost function ft.”) stored on one or more computer readable storage mediums operably coupled to the hardware processor, the training example including an outcome and features to explain the outcome; (§[6.1] “We examine the sparse feature problem using IMDB movie review dataset from (Maas et al., 2011). We pre-process the IMDB movie reviews into bag-of-words (BoW) feature vectors including the ﬁrst 10,000 most frequent words.”; “movie review dataset” reads on “the training example including an outcome and features to explain the outcome”)
- calculating a gradient with respect to the model parameters of the model using the training example; (§[Algorithm 1] “gt← ∇θft(θt−1) (Get gradients w.r.t. stochastic objective at timestep t)”; θ is a parameter)
- computing at least two estimates of a moment of the gradient [[with at least two different time constants for the same type of the moment using the gradient]]; (§[Algorithm 1] “mt←β1·mt−1+ (1 −β1)·gt (Update biased ﬁrst moment estimate), vt←β2·vt−1+ (1 −β2)·g2t (Update biased second raw moment estimate)”; 
	Kingma does not distinctly disclose
- with at least two different time constants for the same type of the moment using the gradient
- updating, using a hardware processor, the model parameters of the model using the at least two estimates of the moment with the at least two different time constants to reduce errors while calculating the at least two estimates of the moment of the gradient
	However, Schaul teaches
- with at least two different time constants for the same type of the moment using the gradient ([4.1][4.2] “We use an exponential moving average with time constant τ (the approximate number of samples considered from recent memory)” )
    PNG
    media_image1.png
    67
    366
    media_image1.png
    Greyscale

    PNG
    media_image2.png
    79
    301
    media_image2.png
    Greyscale


    PNG
    media_image3.png
    349
    333
    media_image3.png
    Greyscale

τi(t+1) and τi(t) reads on “two different time constants”)
- updating, using a hardware processor, the model parameters of the model using the at least two estimates of the moment with the at least two different time constants to reduce errors while calculating the at least two estimates of the moment of the gradient (§[4.1][Algorithm 1] discloses the parameter update using two estimates of the moment (gi and vi). 
Examiner note: gi and vi are calculated using the previous value which previous time constant was used for calculation. Upon the broadest reasonable interpretation, those moments are calculated using the two different time constants.)
Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine learning optimization system of Kingma with adaptive time constants of Schaul to remove the need for learning rate tuning effectively. (Schaul [Abstract] “Using a number of convex and non-convex learning tasks, we show that the resulting algorithm matches the performance of SGD or other adaptive approaches with their best settings obtained through systematic search, and effectively removes the need for learning rate tuning.”)

Regarding Claim 2
The combination of Kingma and Schaul teaches all of the limitations of claim 1 and Kingma further teaches
- wherein each of the model parameters is updated with an amount determined individually by respective components of the at least two estimates of the moment. (§[Algorithm 1] discloses updating parameter using mt and vt which are two estimates of the moment)

    PNG
    media_image4.png
    79
    735
    media_image4.png
    Greyscale

	

Regarding Claim 3
The combination of Kingma and Schaul teaches all of the limitations of claim 1 and Schaul further teaches
wherein a first model parameter of the model is updated by lesser amount than a second model parameter in response to the at least two estimates of the moment that is not fitting a component corresponding to the first model parameter.
(§[4.1][Algorithm 1] the first order moment gi is different from the previous gi-1 which reads on “inconsistency between at least two estimates” and gi affects learning rate which is a component for parameter update.)

    PNG
    media_image3.png
    349
    333
    media_image3.png
    Greyscale


Regarding Claim 4
The combination of Kingma and Schaul teaches all of the limitations of claim 3 and Schaul further teaches
- wherein, in response to the at least two estimates of the moment being consistent in the component corresponding to the first model parameter, the first model parameter is updated according to a value generated by combining respective components of the at least two estimates of the moment corresponding to the first model parameter. 
(§[4.1][Algorithm 1] the first order moment gi is calculated from the previous gi-1 which reads on “two estimates of the moment” and gi affects learning rate which is a component for parameter update.)

    PNG
    media_image3.png
    349
    333
    media_image3.png
    Greyscale

Examiner note: Since the term “consistent” is indefinite, “two estimates of the moment being consistent” is interpreted as “two estimates can be calculated” upon the broadest reasonable interpretation.

Regarding Claim 6
The combination of Kingma and Schaul teaches all of the limitations of claim 1 and Kingma further teaches
- wherein the moment includes a first order moment and a second order moment as different types, wherein the first order moment represents average of the gradient and the second order moment scales individual learning rates for the model parameters of the model. (§[Algorithm 1]
    PNG
    media_image5.png
    28
    490
    media_image5.png
    Greyscale
; α is learning rate and vt is the second order moment; In the formula, α(learning rate) is scaled by the square root of vt(the second order moment) for the parameter.)

Regarding Claim 8
The combination of Kingma and Schaul teaches all of the limitations of claim 1 and Schaul further teaches
- wherein the time constants change exponential decay rates for moment estimation and the time constants include a first time constant and a second time constant that is larger or smaller than the first time constant (§[4.2 Adaptive Time-constant] “We want the size of the memory to increase when the steps taken are small (increment by 1), and to decay quickly if a large step (close to the Newton step) is taken”; Time-constant itself is a decay rate and it is exponentially changed; 
    PNG
    media_image6.png
    48
    231
    media_image6.png
    Greyscale
; 
    PNG
    media_image1.png
    67
    366
    media_image1.png
    Greyscale
This formula shows the second time constant is different(larger or smaller) from the first time constant.)

Regarding Claim 9
The combination of Kingma and Schaul teaches all of the limitations of claim 1 and Kingma further teaches
- wherein calculating the gradient, computing the at least two estimates of the moment, and updating the model parameters are iteratively performed in response to a new training example. (§[Algorithm 1] “mt←β1·mt−1+ (1 −β1)·gt (Update biased ﬁrst moment estimate), vt←β2·vt−1+ (1 −β2)·g2t (Update biased second raw moment estimate)”; Algorithm 1 shows iteration for a new training example); 

Regarding Claim 10
The combination of Kingma and Schaul teaches all of the limitations of claim 1 and Kingma further teaches
- wherein the training example is provided in a streaming manner(§[4 Convergence Analysis] “We analyze the convergence of Adam using the online learning framework proposed in (Zinkevich, 2003).”; “online” reads on “streaming manner”), wherein the model to be trained is updated each time a new training example arrives and the model is used to predict a value of the outcome based on input features ([§4]“At each time t, our goal is to predict the parameter θt”; [§6.1] “We examine the sparse feature problem using IMDB movie review dataset from (Maas et al., 2011). We pre-process the IMDB movie reviews into bag-of-words (BoW) feature vectors including the ﬁrst 10,000 most frequent words.”; “movie review dataset” reads on “the training example including an outcome and features to explain the outcome”)

Regarding Claim 11
The combination of Kingma and Schaul teaches all of the limitations of claim 10 and Kingma further teaches
- wherein the input features include a plurality of elements representing past value fluctuations of the outcome observed over a predetermined period ([§6.1] “We examine the sparse feature problem using IMDB movie review dataset from (Maas et al., 2011). We pre-process the IMDB movie reviews into bag-of-words (BoW) feature vectors including the ﬁrst 10,000 most frequent words.”; “sparse” implies “fluctuations” and “first 10,000” reads on “predetermined”).

Regarding Claim 12
The combination of Kingma and Schaul teaches all of the limitations of claim 10 and Kingma further teaches 
- wherein the input features include a plurality of elements related to the outcome. ([§6.1] “We examine the sparse feature problem using IMDB movie review dataset from (Maas et al., 2011). We pre-process the IMDB movie reviews into bag-of-words (BoW) feature vectors including the ﬁrst 10,000 most frequent words.”; “feature vectors including the ﬁrst 10,000 most frequent words” reads on “input features include a plurality of elements related to the outcome”)

Regarding Claim 13
The combination of Kingma and Schaul teaches all of the limitations of claim 1 and Kingma further teaches 
- wherein the gradient is a stochastic gradient of an objective function(§[Abstract] “We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions”) at an iteration step, wherein the objective function evaluates a loss between the outcome in the training example and a prediction done by the model with current values of the model parameters from the features in the training example (§[Algorithm 1] “f(θ): Stochastic objective function with parameters θ, Get gradients w.r.t. stochastic objective at timestep t”; it is well known that the objective function evaluates a loss between outcome and prediction in machine learning) and the training example includes a single training example or a group of training examples (§[2 Algorithm] “The stochasticity might come from the evaluation at random subsamples (minibatches) of datapoints, or arise from inherent function noise.”)

Regarding Claim 14
	Claim 14 is a computer system claim corresponding to the methods of claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 1. Note that Kingma teaches a memory and computer (§[1 Introduction] “We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients;”).

Regarding Claim 15
	Claim 15 is a computer system claim corresponding to the methods of claim 2, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 2. Note that Kingma teaches a memory and computer (§[1 Introduction] “We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients;”).

Regarding Claim 16
The combination of Kingma and Schaul teaches all of the limitations of claim 15 and Schaul further teaches
- wherein the processing circuitry is configured to update a first model parameter of the model by lesser amount than a second model parameter in response to the at least two estimates of the moment that is not fitting a component corresponding to the first model parameter(§[4.1][Algorithm 1] the first order moment gi is different from the previous gi-1 which reads on “inconsistency between at least two estimates” and gi affects learning rate which is a component for parameter update.) and in response to the at least two estimates of the moment is matching the component corresponding to the first model parameter, wherein the first model parameter is updated according to a value generated by combining respective components of the at least two estimates of the moment corresponding to the first model parameter. (§[4.1][Algorithm 1] the first order moment gi is calculated from the previous gi-1 which reads on “two estimates of the moment” and gi affects learning rate which is a component for parameter update.)


    PNG
    media_image3.png
    349
    333
    media_image3.png
    Greyscale


Regarding Claim 18
	Claim 18 is a computer system claim corresponding to the methods of claim 6, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 6. Note that Kingma teaches a memory and computer (§[1 Introduction] “We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients;”).

Regarding Claim 20
	Claim 20 is a program product claim corresponding to the methods of claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 1. 

Regarding Claim 21
Claim 21 is a program product claim corresponding to the methods of claim 2, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 2. 

Regarding Claim 23
The combination of Kingma and Schaul teaches all of the limitations of claim 21 and Kingma further teaches
- wherein the moment includes a first order moment and a second order moment as different types, wherein the first order moment represents average of the gradient and the second order moment scales individual learning rates for the model parameters of the model. (§[Algorithm 1]
    PNG
    media_image5.png
    28
    490
    media_image5.png
    Greyscale
; α is learning rate and vt is the second order moment; In the formula, α(learning rate) is scaled by the square root of vt(the second order moment) for the parameter.)


Regarding Claim 24
The combination of Kingma and Schaul teaches all of the limitations of claim 15 and Schaul further teaches
- wherein the computer is configured to update a first model parameter of the model by lesser amount than a second model parameter in response to the at least two estimates of the moment that is not fitting a component corresponding to the first model parameter(§[4.1][Algorithm 1] the first order moment gi is different from the previous gi-1 which reads on “inconsistency between at least two estimates” and gi affects learning rate which is a component for parameter update.) and in response to the at least two estimates of the moment is matching the component corresponding to the first model parameter, wherein the first model parameter is updated according to a value generated by combining respective components of the at least two estimates of the moment corresponding to the first model parameter. (§[4.1][Algorithm 1] the first order moment gi is calculated from the previous gi-1 which reads on “two estimates of the moment” and gi affects learning rate which is a component for parameter update.)


    PNG
    media_image3.png
    349
    333
    media_image3.png
    Greyscale


Claims 5, 7, 17, 19, 22 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Kingma in view of Schaul further in view of Ginsburg et al. (“Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments”)
Regarding Claim 5
The combination of Kingma and Schaul teaches all of the limitations of claim 2 but does not appear to distinctly disclose
- wherein a first model parameter of the model is updated according to a maximum or a mean of components of the at least two estimates of the moment corresponding to the first model parameter.
	However, Ginsburg teaches
- wherein a first model parameter of the model is updated according to a maximum or a mean of components of the at least two estimates of the moment corresponding to the first model parameter. ([3. algorithm] 
    PNG
    media_image7.png
    29
    130
    media_image7.png
    Greyscale
; parameter is updated using a maximum of two estimates)
Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine learning optimization system of Kingma and Schaul with layer-wise moment of Ginsburg to achieve robust learning rate. (Ginsburg [1. Introduction] “Our motivation for this work was to build an algorithm which: (1) performs equally well for image classification, speech recognition, machine translation, and language modeling, (2) is robust to learning rate (LR) and weight initialization, (3) has strong regularization properties.”)

Regarding Claim 7
The combination of Kingma and Schaul teaches all of the limitations of claim 1 and Schaul further teaches 
- wherein the moment includes a first order moment and a second order moment as different types and a first model parameter of the model is updated in a manner depending on inconsistency between at least two estimates of the first order moment in a component corresponding to the first model parameter (§[4.1][Algorithm 1] the first order moment gi is different from the previous gi-1 which reads on “inconsistency between at least two estimates” and gi affects learning rate which is a component for parameter update.)

    PNG
    media_image3.png
    349
    333
    media_image3.png
    Greyscale

The combination of Kingma and Schaul does not appear to distinctly disclose
- magnitude relationship between at least two estimates of the second order moment in the component.
	However, Ginsburg teaches
- magnitude relationship between at least two estimates of the second order moment in the component. ([3. algorithm] 
    PNG
    media_image7.png
    29
    130
    media_image7.png
    Greyscale
; parameter is updated using a maximum of two estimates of second order moments; maximum reads on “magnitude relationship” )
	Same motivation as claim 5.

Regarding Claim 17
	Claim 17 is a computer system claim corresponding to the methods of claim 5, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 5. Note that Kingma teaches a memory and computer (§[1 Introduction] “We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients;”).

Regarding Claim 19
	Claim 19 is a computer system claim corresponding to the methods of claim 7, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 7. Note that Kingma teaches a memory and computer (§[1 Introduction] “We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients;”).

Regarding Claim 22
Claim 22 is a program product claim corresponding to the methods of claim 5, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 5. 

Regarding Claim 25
Claim 25 is a program product claim corresponding to the methods of claim 7, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 7. 
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Casey R. Garner whose telephone number is 571-272-2467. The examiner can normally be reached on Monday to Friday, 8am to 5pm, Eastern Time.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached on 571-270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Casey R. Garner/Primary Examiner, Art Unit 2123