Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claim 8 is objected to because of the following informalities: 
In line 11, “at least one gradient buffer” was probably meant to be: the at least one gradient buffer.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.



Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Independent Claims 1, 8 and 15 recites the limitation “the new weights” at lines 11, 14 and 17 respectively, that lacks antecedent basis. Dependent claims are also subsequently rejected.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Matthews, US 11,057,318 B1, in view of Chetlur, US 2021/0133583 A1.

Regarding Claim 1, Matthews teaches:
A method in a system comprising a memory configured to store weights associated with a neural network model comprising L layers, wherein L is an integer greater than one (C2, L1-23; C5, L4-6, 29-30: memory for storing compute data that includes gradients and weights of the multilayer neural network), 
a gradient optimizer (C2, L40-46; C16, L43-55: the gradient descent algorithm that is the gradient optimizer), 
and a plurality of workers, wherein each of the plurality of workers is configured to perform a forward pass and a backward pass on any one of the L layers associated with the neural network model, the method comprising (C6, L35-39, 54-63: compute/worker nodes performing forward and backward passes on the neural network): 
during a single burst cycle moving a first set of gradients, received from each of the plurality of workers, from at least one gradient buffer to the gradient optimizer and moving weights from at least one buffer, coupled to the memory, to the gradient optimizer (C5, L3-9; C18, L15-17; C50, L53-58; C51, L13-14: compute subsystem for buffering gradients and other compute data that would include the weights used in the gradient optimization, and performing processing/moving of this data in a given clock/burst cycle); 
during the single burst cycle writing back the new weights, calculated by the gradient optimizer, to the memory (C5, L29-31: storing the compute data that would include the weights in the compute memory. Examiner’s note: see also Chetlur, for example paragraph 73, updated weights are in a shared memory).
Although Matthews may have indirectly taught the following, and with Matthews teaching the single burst cycle and gradient optimizer as previously pointed out, Chetlur is used to more directly show:
and during the single burst cycle transmitting the new weights, from the gradient optimizer, to each of the plurality of workers (paragraph 52: distributing/transmitting the updated/new weights among the workers). (Emphasis added).
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to use the teachings of Chetlur with that of Matthews for transmitting the new weights to each of the plurality of workers.
The ordinary artisan would have been motivated to Matthews in the manner set forth above for the purposes of improving training of a machine learning model by allowing node weights to be updated in parallel [Chetlur: paragraph 52].

Regarding Claim 2, Matthews further teaches:
The method of claim 1 further comprising, during each of successive burst cycles, providing gradients from the at least one buffer to the gradient optimizer such that the successive burst cycles result in streaming of gradients through the gradient optimizer (C53, L21-25: processing the compute data, that includes the gradients, over a number of clock cycles. Examiner’s note: see also Chetlur, for example paragraph 329, processing performed over consecutive clock cycles).

Regarding Claim 3, Matthews further teaches:
The method of claim 1, wherein during each single burst cycle the gradient optimizer operates on a gradient burst, and wherein each gradient burst comprises a fixed number of gradients (C32, L51-55; C104, L5-7: data sent back by each node that includes the gradients are of a fixed size. Examiner’s note: see also Chetlur, for example paragraph 69).

Regarding Claim 4, Matthews further teaches:
The method of claim 1, wherein during each single burst cycle the gradient optimizer operates on a gradient burst having a burst size related to a number of gradients in each gradient burst, and wherein the burst size is configurable (C32, L51-55; C32, L66 to C33, L5; C104, L5-7: data sent back by each node that includes the gradients are of a fixed size but this size can also be configured).

Regarding Claim 5, Matthews further teaches:
The method of claim 1, wherein the gradient optimizer is configured to operate only on gradients buffered in the at least one gradient buffer, and wherein gradients buffered in the at least one gradient buffer are discarded after processing by the gradient optimizer (C5, L4-6, L29-37: buffer storing the gradient that is overwritten/discarded after use).

Regarding Claim 6, Matthews further teaches:
The method of claim 1, wherein the system further comprises a reduction block coupled to provide reduced gradients to the gradient optimizer (Abstract; C4, L33-38: reduction operations on the gradients).

Regarding Claim 7, Chetlur further teaches:
The method of claim 1, wherein the gradient optimizer is configured to implement any one of a plurality of optimization algorithms, wherein the plurality of optimization algorithms comprises stochastic gradient descent (SGD) algorithm, SGD with momentum algorithm, and adaptive moment estimation algorithm (paragraph 60, 103: stochastic gradient descent, SGD with momentum and Adam. Examiner’s note: see also the NPL of Kingma, section 4).

Claim 15 is similar to Claim 1 (the expanded limitations on the successive first and second burst cycles for transmitting gradients and fetching weights also being taught by Matthews, see C38, L15-18; C50, L53-58; C53, L21-25; C108, L54-59: processing the compute data, that would include the gradients and weights, over a number of clock cycles; see also Chetlur, for example paragraph 329, processing performed over consecutive clock cycles) and is rejected under the same rationale as stated above for that claim.
 
Regarding Claim 20, Matthews further teaches:
The method of claim 15, wherein the moving the first set of gradients from the at least one gradient buffer to the gradient optimizer and moving the fetched weights from the at least one buffer to the gradient optimizer is aligned in time (C93, L34-37: synchronization/alignment in time of transferring/moving data, that would include the gradients and weights, from the various compute nodes. Examiner’s note: see also Chetlur paragraphs 54, 302: synchronous data transfer).

Claims 8-14 are similar to Claims 1-7 respectively (the FPGA of Claim 9 also being taught by Matthews, see C11, L18-25; C31, L1-9; and Chetlur, see paragraph 311) and are rejected under the same rationale as stated above for those claims.
Claims 16-19 are similar to Claims 2-5 respectively and are rejected under the same rationale as stated above for those claims.

Examiner's Note:
The Examiner cites particular pages, sections, columns, line numbers, and/or paragraphs in the references as applied to the claims above for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references in its entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner and the additional related prior arts made of record that are considered pertinent to applicant's disclosure to further show the general state of the art. The Examiner's interpretations in parenthesis are provided with the cited references to assist the applicants to better understand how the examiner interprets the prior art to read on the claims. Such comments are entirely consistent with the intent and spirit of compact prosecution.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO-892 the relevant prior art pertaining to this application where for example the NPL of Kingma, teaches ADAM, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVE MISIR whose telephone number is (571)272-5243. The examiner can normally be reached M-R 8-5 pm, F some hours.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Al Kawsar can be reached on 5712703169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DAVE MISIR/Primary Examiner, Art Unit 2127