DETAILED ACTION
Claims 1-20 are pending in the present application.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 07/02/2020 and 08/03/2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-3, 6, 8-10, 13, and 15-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ruder (RUDER, SEBASTIAN, "An Overview of Gradient Descent Optimization Algorithms", In repository of arXiv:1609.04747, June 15, 2017, 14 Pages.) in view of U.S. PGPubs 2020/0380369 to Case et al., further in view of U.S. PGPubs 2019/0034784 to Li et al.

Regarding claim 1, Ruder  teaches a method in a system comprising a gradient optimizer, the method comprising: using the gradient optimizer, performing gradient optimization using the third set of momentum values and the fourth set of momentum values (section 4, perform gradient descent optimization using gradient descent optimization algorithms has two set of momentum values (see section 4.6 and 4.7)).  
But Ruder keeps silent for teaching a method in a system comprising a memory configured to store momentum values associated with a neural network model comprising L layers, wherein L is an integer greater than one, the method comprising: retrieving from the memory a first set of momentum values, corresponding to a layer of the neural network model, having a selected storage format and retrieving a second set of momentum values from the memory, corresponding to the layer of the neural network model, having the selected storage format.
In related endeavor, Case et al. teach a method in a system comprising a gradient optimizer (par 0043-0046, par 0056, “a system performs backpropagation by at least computing a gradient and an optimization algorithm is used to learn from said computed gradient. In at least one embodiment, a stochastic gradient descent algorithm is a type of optimization algorithm. In at least one embodiment, a system computes a gradient estimate”) and a memory configured to store momentum values associated with a neural network model comprising L layers, wherein L is an integer greater than one (Fig. 1, par 0043, “a solver used in neural network training uses terms in addition to gradient to update weight information 102 (e.g., weight values). In at least one embodiment, weight information W.sub.t 102 of a neural network is updated using momentum V.sub.t 104 in a stochastic gradient descent solver: V.sub.t+1=μV.sub.t−α∇L(W.sub.t) where W.sub.t is weight at a step t, V.sub.t is momentum at a step t, ∇L(W.sub.t) is gradient with respect to weight which is a combination of derivative of each individual weight, and α and μ are scalar values“, “par 0056, par 0065, “a system performs backpropagation by at least computing a gradient and an optimization algorithm is used to learn from said computed gradient. In at least one embodiment, stochastic gradient descent algorithm is a type of optimization algorithm. In at least one embodiment, a system computes a gradient with respect to a weight which is a combination of derivative of each individual weight: ∇L(W.sub.t) where L( ) is a per-sample loss function. In at least one embodiment, a system computes a momentum update V.sub.t+1=μV.sub.t−α∇L(W.sub.t) where W is weight, V is momentum, ∇L(W.sub.t) is gradient with respect to weight which is a combination of derivative of each individual weight, α is a learning rate, and μ is a momentum coefficient”, par 0066-0067, a neural network, weight values, metadata, and momentum values are stored in memory), the method comprising: 
retrieving from the memory a first set of momentum values, corresponding to a layer of the neural network model, having a selected storage format and retrieving a second set of momentum values from the memory, corresponding to the layer of the neural network model, having the selected storage format (par 0043-0046, par 0054, par 0061, retrieve data such as weight information, momentum data, metadata, and momentum coefficients is used to train a neural network, par 0211, the data is stored in memory as in a certain numerical formats such as integer or floating point (see par 0211));
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Ruder to include a method in a system comprising a memory configured to store momentum values associated with a neural network model comprising L layers, wherein L is an integer greater than one, the method comprising: retrieving from the memory a first set of momentum values, corresponding to a layer of the neural network model, having a selected storage format and retrieving a second set of momentum values from the memory, corresponding to the layer of the neural network model, having the selected storage format as taught by Case et al. to have memory to store second moment (the uncentered variance), as taught by Ruder, with first moment (the mean), as taught by both prior art, to train a neural network to improve computationally efficiency.
 But Ruder as modified by Case et al. do not explicitly teach converting the first set of momentum values having the selected storage format to a third set of momentum values having a training format associated with the gradient optimizer and converting the second set of momentum values having the selected storage format to a fourth set of momentum values having a training format associated with the gradient optimizer.
In related endeavor, Li et al. teach converting the first set of momentum values having the selected storage format to a third set of momentum values having a training format associated with the gradient optimizer and converting the second set of momentum values having the selected storage format to a fourth set of momentum values having a training format associated with the gradient optimizer (par 0041-0051, par 0066, par 0085-0089, par 0130-0131, disclose a dynamic conversion scheme to convert training parameters from floating point data to fixed point data to perform training of neural network using Gradient Descent Algorithm).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Ruder as modified by Ruder to include converting the first set of momentum values having the selected storage format to a third set of momentum values having a training format associated with the gradient optimizer and converting the second set of momentum values having the selected storage format to a fourth set of momentum values having a training format associated with the gradient optimizer as taught by Li et al. to conversion of neural network parameters to a different numerical format for computation, to convert training parameters including the first moment and second moment values from first numerical format to second numerical format to perform gradient optimization for neural network in order to solve a technical problem of reducing the computational and storage consumption when training  a neural network.

Regarding of claim 2, Ruder as modified by Case et al. and Li et al. teach all the limitation of claim 1, and Li et al. further teach wherein the selected storage format comprises a reduced single-precision format (par 0057, par 0060, less than 32 bit). This would be obvious for the same reason given in the rejection for claim 1.


Regarding of claim 3, Ruder as modified by Case et al. and Li et al. teach all the limitation of claim 1, and Li et al. further teach wherein the training format comprises a single- precision format or a double-precision format (par 0057, 32 bit). This would be obvious for the same reason given in the rejection for claim 1.

Regarding of claim 6, Ruder as modified by Case et al. and Li et al. teach all the limitation of claim 1, and Ruder further teach wherein performing gradient optimization comprises implementing an adaptive moment estimation algorithm (section 4.6-4.8).

Regarding of claim 8, Ruder as modified by Case et al. and Li et al. teach a system, including a gradient optimizer, comprising: a memory configured to store momentum values associated with a neural network model comprising L layers, wherein L is an integer greater than one (Case et al.: par 0043-0046, par 0056). The remaining limitations of the claim are similar in scope to claim 1 and rejected under the same rationale.

Regarding of claims 9-10 and 13, Ruder as modified by Case et al. and Li et al. teach all the limitation of claim 8, the claims 9-10 and 13 are similar in scope to claims 2-3 and 6 and are rejected under the same rational.

Regarding of claim 15, Ruder as modified by Case et al. and Li et al. teach a method in a system comprising a gradient optimizer and a memory configured to store weights and momentum values associated with a neural network model comprising L layers, wherein L is an integer greater than one (Case et al.: par 0043-0046, par 0056). The remaining limitations of the claim are similar in scope to claim 1 and rejected under the same rationale.

Regarding of claims 16-18, Ruder as modified by Case et al. and Li et al. teach all the limitation of claim 15, the claims 16-18 are similar in scope to claims 2-3 and 6 and are rejected under the same rational.

Regarding of claim 19, Ruder as modified by Case et al. and Li et al. teach all the limitation of claim 15, and Li et al. further teach wherein the training format comprises 32-bit floating point format (par 0057, 32 bit) and wherein the storage format comprises 8-bit floating point format (par 0057, par 0060, less than 32 bit). This would be obvious for the same reason given in the rejection for claim 1.

Allowable Subject Matter
Claims 4-5, 7, 11-12, 14, and 20 are objected to as being dependent upon a rejected base, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter: The cited prior art fails to teach the combination of elements recited in claim 4, including "wherein the converting the first set of momentum values having the selected storage format to the third set of momentum values having the training format comprises padding extra zero bits to form single-precision format momentum values".
The following is a statement of reasons for the indication of allowable subject matter: The cited prior art fails to teach the combination of elements recited in claims 5 and 12, including " further comprising generating a fifth set of momentum values and a sixth set of momentum values for a next iteration of gradient optimization and prior to storing each of the fifth set of momentum values and the sixth set of momentum values converting each of the fifth set of momentum values and the sixth set of momentum values into the storage format by storing only the sign bit and seven most-significant bits associated with each of respective momentum values".
The following is a statement of reasons for the indication of allowable subject matter: The cited prior art fails to teach the combination of elements recited in claims 7, 14, and 20, including " wherein the gradient optimizer is implemented using a field programmable gate array (FPGA), and wherein the gradient optimizer is configured to operate in a burst mode such that successive burst cycles result in streaming of gradients through the gradient optimizer".
The following is a statement of reasons for the indication of allowable subject matter: The cited prior art fails to teach the combination of elements recited in claim 11, including " wherein the system is further configured to pad extra zero bits to form single-precision format momentum values or double- precision format momentum values".

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jin Ge whose telephone number is (571)272-5556. The examiner can normally be reached 8:00 to 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee M Tung can be reached on (571)272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

JIN . GE
Examiner
Art Unit 2616



/JIN GE/           Primary Examiner, Art Unit 2616