DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are presented for examination.

Information Disclosure Statement
The information disclosure statements (IDS) submitted on October 21, 2019, January 28, 2021, and September 2, 2022 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Drawings
The drawings are objected to because (a) in Fig. 6, reference character 618, “tasks are executed” should be “tasks executed”, and (b) Figure 9 has reference characters oriented both horizontally and vertically, see 37 CFR § 1.84(p)(3).  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Specification
The disclosure is objected to because of the following informalities:
In paragraphs 2, 32, and 136, “such as, for example” should be merely “such as”.
In paragraph 5, “an example a” should be “an example of a”.
In paragraph 18, “multiple iterations … is performed” should be “multiple iterations … are performed”.
In paragraph 24, “factors including for example” should be “factors, including, for example”; “each of other worker” should be “each of the other worker”.
In paragraph 27, “arrangements enables” should be “arrangements enable”; “entries that stores” should be “entries that store”.
In paragraph 33, “multilayer perception” should be “multilayer perceptron”.
  In paragraph 34, “x2, … xn” should be “x2, …, xn”.
In paragraph 45, “may include such as” should be merely “may include”.
In paragraph 46, “for example performing” should be “for example, performing”.
In paragraph 48, “receive data … combine with … generate output” should be “receives data … combines them with … generates output”.
In paragraph 55, “operation can be performed” should be “operations can be performed”.
In paragraph 56, “a forward propagations” should be “a forward propagation”; “cannot performed” should be “cannot be performed”.
In paragraph 58, “each of other worker” should be “each of the other worker”.
In paragraph 60, “systems that supports” should be “systems that support”.
In paragraph 63, “prior to next” should be “prior to the next”.
In paragraph 66, “forward full set” should be “forward a full set”.
In paragraph 67, “arrangements … allows” should be ‘arrangements … allow”.
In paragraph 69, “delaying … do not necessarily” should be “delaying … does not necessarily”; “operations starts … and ends” should be “operations start … and end”.
In paragraph 71, “including for example” should be “including, for example”.
In paragraph 72, “after” should be deleted from the first sentence.
In paragraph 74, “stores second plurality” should be “stores the second plurality”; “[a]fter second plurality … complete” should be “[a]fter the second plurality … is complete”.
Paragraph 83 states that Fig. 5E illustrates a state of a buffer when all exchange tasks have been completed, yet Fig. 5E itself contains multiple incomplete tasks.
In paragraph 84, “a neural network hardware” should be “neural network hardware”; “a second neural network” should be “a second neural network layer”.
In paragraphs 86, 88, and 90, “each of other” should be “each of the other”.
In paragraph 93, “different … than” should be “different … from”.
In paragraph 94, “nodes, at least” should be “nodes; at least”.
In paragraph 102, “data arrives” should be “data arrive”.
In paragraph 104, “data is a value” should be “data are values”.
In paragraph 106, “memory banks 714 can” should be “memory banks 714, can”; “communication fabric 710, to” should be “communication fabric 720 to”.
In paragraph 109, “include for example” should be “include, for example”; “that can operating” should be “that can operate”.
In paragraph 111, “for example one-dimensional” should be “for example, one-dimensional”.
In paragraph 129, “data was” (two instances) should be “data were”.
In paragraph 136, “computer(s), may” should be “computers may”.
Appropriate correction is required.
The use of the terms UNIX, LINUX, WINDOWS, MAC OS, iOS, and ANDROID (paragraphs 121 and 140), which are trade names or marks used in commerce, has been noted in this application. The terms should be accompanied by the generic terminology; furthermore, the terms should be capitalized wherever they appear or, where appropriate, include a proper symbol indicating use in commerce such as ™, SM , or ® following the terms.
Although the use of trade names and marks used in commerce (i.e., trademarks, service marks, certification marks, and collective marks) is permissible in patent applications, the proprietary nature of the marks should be respected and every effort made to prevent their use in any manner which might adversely affect their validity as commercial marks.

Claim Objections
Examiner objects to claims 7-20.
Claims 7 and 17 are objected to because of the following informalities: a “second computer system” is recited, but not a “first computer system,” causing confusion as to how many computer systems are required by the claims.  Examiner will presume for purposes of examination that the “one or more hardware processors” of claim 7 and the “neural network processor” of claim 17 are the first computer system.
Claim 9 is objected to because of the following informalities:  the “and” should be deleted from the end of the store limitation and the colon at the end of the receive limitation should be a semicolon; furthermore, the indentation for the execute limitation and all subsequent limitations should be reduced by one tab.
Claim 10 is objected to because of the following informalities: “gradients complete” should be “gradients is complete”.
Claim 17 is objected to because of the following informalities: processor is configured” should be “processor configured”; “gradients complete” should be “gradients is complete”. 
Claim 19 is objected to because of the following informalities:  the “and” should be deleted from the end of the store limitation and the colon at the end of the receive limitation should be a semicolon; furthermore, the indentation for the retrieve limitation and all subsequent limitations should be reduced by one tab.  
Appropriate correction is required.
All claims dependent on a claim objected to hereunder are also objected to for being dependent on an objected-to base claim.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2, 6-8, and 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Huo et al., “Decoupled Parallel Backpropagation with Convergence Guarantee,” in Int’l Conf. Machine Learning 2098-2106 (2018) (“Huo”) in view of Javadi et al. (US 20200053029) (“Javadi”).
Regarding claim 1, Huo discloses “[a] method of training a neural network model in a distributed system, the 2distributed system comprising a first worker node and a second worker node, the neural network 3model comprising a first neural network layer and a second neural network layer, the method 4being performed by the first worker node1 (Huo Fig. 2 and accompanying text show that a multilayer feedforward neural network can be split into three modules, where each module is a stack of layers; module A [first worker node] is depicted as performing a backward pass using a stale error gradient                         
                            
                                
                                    δ
                                
                                
                                    A
                                
                                
                                    t
                                    -
                                    2
                                
                            
                        
                     and a forward pass of                         
                            
                                
                                    h
                                
                                
                                    A
                                
                                
                                    t
                                
                            
                        
                     to module B; see also sec. 1, second paragraph (disclosing that each module can be instantiated as a separate GPU [worker node])) and comprising:  
5performing backward propagation computations for the second neural network 6layer to generate second layer data gradients and second layer weight gradients (Huo Fig. 2 discloses that module C generates error gradients δB [second layer gradients] for module B of the neural network [which contains a second layer assuming a three-layer network]; sec. 3.1 discloses that backward computation in each module comprises computation of an error gradient and a weight gradient);  
7generating a first plurality of exchange tasks each corresponding to an exchange 8of a portion of the second layer weight gradients with the second worker node (Huo Fig. 2 discloses that module C [first worker node] sends the error gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    ,                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                
                            
                        
                    , etc. to module B [second worker node] by time step [each exchange constituting a separate exchange task of a portion of the second layer gradients δB]);  
9executing a first exchange task of the first plurality of exchange tasks to exchange 10a first portion of the second layer weight gradients with the second worker node (Huo Fig. 2 discloses that module C [first worker node] sends the error gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    ,                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                
                            
                        
                    , etc. to module B [second worker node] by time step [each exchange constituting a separate exchange task of a portion of the second layer gradients δB; e.g., module C executes a first exchange task to send                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                     to module B]);  
11performing backward propagation computations for the first neural network layer 12based on the second layer data gradients to generate first layer data gradients and first layer 13weight gradients (Huo Fig. 2 discloses that module B calculates error gradients δA for module A [containing a first layer of the neural network]; sec. 3.1 discloses that the weight gradients                         
                            
                                
                                    ∂
                                    f
                                
                                
                                    ∂
                                    w
                                
                            
                        
                     for each layer are a function of the data gradients                         
                            
                                
                                    ∂
                                    f
                                
                                
                                    ∂
                                    h
                                
                            
                        
                    , that the data gradients are a function of the data gradients for a subsequent layer [i.e., the weight gradients for layer l – 1 are a function of the data gradients for layer l], and that both weight gradients and data gradients are computed in each module);  
14generating a second plurality of exchange tasks each corresponding to an 15exchange of a portion of the first layer weight gradients with the second worker node (Huo Fig. 2 discloses that first layer error gradients δA are sent [exchanged] from module B [second worker node] to module A and that the exchange tasks are divided by time step into                         
                            
                                
                                    δ
                                
                                
                                    A
                                
                                
                                    t
                                    -
                                    2
                                
                            
                        
                    ,                         
                            
                                
                                    δ
                                
                                
                                    A
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    , etc.); …
16 19updating weights for the first neural network layer based on … exchanged first 20layer weight gradients (in a backward pass, all modules except the last one have delayed error gradients in store such that they can execute the backward computation without locking; the last module updates with the up-to-date gradients – Huo, sec. 3.1; Figure 2 and accompanying text show that, in one iteration, module A [containing the first layer] can perform a backward pass [weight update] using the stale error gradient                         
                            
                                
                                    δ
                                
                                
                                    A
                                
                                
                                    t
                                    -
                                    2
                                
                            
                        
                     [first layer weight gradient]);  
21performing, by the first worker node, forward propagation computations for the 22first neural network layer based on the updated weights (after a data sample is input to the network, a forward pass is run from the first module [i.e., module A, the first worker node] to the last module – Huo, sec. 3.1; see also Fig. 2 (displaying that the forward passes and backward passes are cyclical in nature, so the forward propagation of                         
                            
                                
                                    h
                                
                                
                                    A
                                
                                
                                    t
                                
                            
                        
                     from module A to module B after updating would be based on the weights updated with error gradient                         
                            
                                
                                    δ
                                
                                
                                    A
                                
                                
                                    t
                                    -
                                    2
                                
                            
                        
                    ));  
23executing the remaining exchange tasks of the first plurality of exchange tasks to 24exchange the remaining portions of the second layer weight gradients with the second worker 25node (Huo Fig. 2 and accompanying text disclose that the error gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                
                            
                        
                     [remaining portions of second layer weight gradients] are transmitted from module C to module B [second worker node] after transmission of the first set of second weight gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    ); and  
26updating weights for the second neural network layer based on the exchanged 27second layer weight gradients (Huo sec. 3.1 discloses that the backward computation in each module of the weight gradients is based on a data gradient for that layer and that the data gradient for each layer is based on the data gradient for a subsequent layer [so that the updated weights for module B [containing the second neural network layer], for instance, would be based on, inter alia, the error gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                
                            
                        
                     passed from module C to module B, as shown in Fig. 2 and accompanying text]).”
Huo appears not to disclose explicitly the further limitations of the claim.  However, Javadi discloses “after the execution of the first exchange task completes, executing the second 17plurality of exchange tasks to exchange the first layer weight gradients with the second worker 18node (in a data plane forwarding circuit that has a parameter collecting circuit to store and distribute parameter values computed by several machine in a network, each machine learning machine sends a data message to the data plane with 32 weight gradients identified by the numbers 0-31 and the letter associated with the ML machine; after collecting the weight gradients from all of the ML machines, the data plane sends four messages to each of the ML machines, including message 705 containing weight gradients A0-L7, message 710 including weight gradients A8-L15, message 715 including weight gradients A16-L23, and message 720 including weight gradients A24-A31 – Javadi, paragraphs 77-78 and Figs. 7-8 [first layer weight gradients = weight gradients other than those transmitted by the receiving ML model corresponding to the first layer, first exchange task = transmission of an ML machine’s weight gradients to the data plane, second worker node = any one of the ML systems other than the one that executes the exchange of its own weight gradients with the data plane])….”
Javadi and the instant application both relate to distributed machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huo to transmit other weight gradients to a second worker node after collecting first weight gradients, as disclosed by Javadi, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the system to distribute computing operations to each individual computer system in accordance with its individual powers, thus allowing operations to be performed at faster rates.  See Javadi, paragraph 1.

Regarding claim 12, Huo, as modified by Javadi, discloses that “the backward propagation computations 2for the first neural network layer are performed in parallel with the exchange of the first portion of the second layer weight gradients with the second worker node (Huo Fig. 2 and accompanying text disclose that the gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                     [portion of second layer weight gradients] are transmitted from module C to module B [second worker node] at the same time as the computation of the weight updates of module A [containing the first neural network layer] based on the gradients                         
                            
                                
                                    δ
                                
                                
                                    A
                                
                                
                                    t
                                    -
                                    2
                                
                            
                        
                    ; see also sec. 1, second paragraph (disclosing that each module may correspond to a separate GPU), sec. 3.1 (disclosing that parallel updating may be achieved in the backward pass)).”  

48	Regarding claim 6, Huo, as modified by Javadi, discloses “performing forward propagation computations for the first neural network layer 3based on the updated weights for the first neural network layer (after a data sample is input to the network, a forward pass is run from the first module [i.e., module A, containing the first neural network layer] to the last module – Huo, sec. 3.1; see also Fig. 2 (displaying that the forward passes and backward passes are cyclical in nature, so the forward propagation of                         
                            
                                
                                    h
                                
                                
                                    A
                                
                                
                                    t
                                
                            
                        
                     from module A to module B after updating would be based on the weights updated with error gradient                         
                            
                                
                                    δ
                                
                                
                                    A
                                
                                
                                    t
                                    -
                                    2
                                
                            
                        
                    )), 
4wherein the forward propagation computations for the first neural network layer 5are performed in parallel with the exchange of at least some of the remaining portions of the 6second layer weight gradients (Huo Fig. 2 and accompanying text disclose that the gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                
                            
                        
                     [second layer weight gradients] are transmitted from module C to module B [containing the second layer] for the same time step t as the activations                         
                            
                                
                                    h
                                
                                
                                    A
                                
                                
                                    t
                                
                            
                        
                     forward propagated from module A to module B; see also sec. 1, second paragraph (disclosing that each module may correspond to a separate GPU), sec. 3.1 (disclosing that parallel updating may be achieved in the backward pass)).”

Regarding claim 17, Huo discloses “[a] non-transitory computer readable medium storing instructions that, 2when executed by one or more hardware processors, cause the one or more hardware processors (deep neural network may be split into modules [instructions, stored in a non-transitory computer readable medium] and distributed across multiple GPUs [processors] – Huo, sec. 1, second paragraph) 3to:  
4perform backward propagation computations for a second layer of a neural 5network to generate second weight gradients (Huo Fig. 2 discloses a multilayer feedforward neural network split into three modules, where each module is a stack of layers; module C is depicted as transmitting error gradients δ-B [second weight gradients] to module B [assuming a three-layer neural network, module B corresponds to the second layer; thus, module C is computing the error gradients for the second layer]);  
6split the second weight gradients into portions (Huo Fig. 2 shows that the error gradients for each layer are split up by time step, so that the error gradients δ-B are split up into gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                
                            
                        
                     and                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    );  
7cause a hardware interface to exchange a first portion of the second weight 8gradients with a second computer system (Huo Fig. 2 shows that                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                     [first portion of second weight gradients] is transmitted from module C to module B [second computer system]; sec. 1, second paragraph discloses that each module may correspond to a separate GPU [containing a hardware interface]);  
9perform backward propagation computations for a first layer of the neural 10network to generate first weight gradients when the exchange of the first portion of the second 11weight gradients is underway, the first layer being a lower layer than the second layer in the 12neural network (Huo Fig. 2 shows that module B calculates error gradients δA [first weight gradients] and sends them to module A [corresponding to a first layer of the neural network, lower than the second layer]; sec. 3.1 discloses that all modules except the last one have delayed error gradients in store such that they can execute computation without locking; the last module updates with the up-to-date gradients [i.e., the computation of error gradients                         
                            
                                
                                    δ
                                
                                
                                    A
                                
                                
                                    t
                                    -
                                    2
                                
                            
                        
                     takes place concurrently with the transmission of the gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    ]); … [and]
16after the transmission of the first weight gradients completes, cause the hardware 17interface to transmit the remaining portions of the second weight gradients to the second 18computer system (Huo Fig. 2 discloses that after transmission of error gradients                         
                            
                                
                                    δ
                                
                                
                                    A
                                
                                
                                    t
                                    -
                                    2
                                
                            
                        
                     [first weight gradients] to module A, module C transmits error gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                
                            
                        
                     [remaining portions of second weight gradients] are transmitted to module B [second computer system]).”  
Huo appears not to disclose explicitly the further limitations of the claim.  However, Javadi discloses “13after transmission of the first portion of the second weight gradients completes, 14caus[ing] the hardware interface to transmit the first weight gradients to the second computer 15system (in a data plane forwarding circuit that has a parameter collecting circuit to store and distribute parameter values computed by several machine in a network, each machine learning machine sends a data message to the data plane with 32 weight gradients [first portion of second weight gradients] identified by the numbers 0-31 and the letter associated with the ML machine; after collecting the weight gradients from all of the ML machines, the data plane sends four messages [via a hardware interface] to each of the ML machines, including message 705 containing weight gradients A0-L7, message 710 including weight gradients A8-L15, message 715 including weight gradients A16-L23, and message 720 including weight gradients A24-A31 – paragraphs 77-78 and Figs. 7-8 [first weight gradients = all weight gradients other than those transmitted by the receiving ML model, second computer system = any one of the ML systems other than the one that transmits the first portion of the second weight gradients, so, for instance, weight gradients I0-I31 transmitted from ML machine I to the data plane are the first portion of the second weight gradients, and the first weight gradients are any combination of the weight gradients not including I0-I31 sent to, say, second computer system ML machine H])….”
Javadi and the instant application both relate to distributed machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huo to transmit other weight gradients to a second computer system after collecting first weight gradients, as disclosed by Javadi, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the system to distribute computing operations to each individual computer system in accordance with its individual powers, thus allowing operations to be performed at faster rates.  See Javadi, paragraph 1.

Regarding claim 18, Huo, as modified by Javadi, discloses that “the second weight gradients are generated before the first weight gradients (see Huo Fig. 2 and note that the gradients are generated in multiple time steps, such that there are certain of the second gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    2
                                
                            
                        
                    , for instance, that would be generated before certain of the first gradients                         
                            
                                
                                    δ
                                
                                
                                    A
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    ).”  

Regarding claim 116, Huo, as modified by Javadi, discloses “2instructions that, when executed by one or more hardware processors, cause the one or more 3hardware processors to:  
4perform forward propagation computations for the first layer (Huo Fig. 2 and accompanying text disclose that a forward pass [forward propagation] is executed among modules A-C, and in particular a set of activations                         
                            
                                
                                    h
                                
                                
                                    A
                                
                                
                                    t
                                
                            
                        
                     is passed from module A [containing the first layer] to module B); and  
5cause a hardware interface to transmit at least a portion of the second weight 6gradients to the second computer system when performing the forward propagation computations for the first layer (Huo Fig. 2 discloses that the gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                
                            
                        
                     [portion of second weight gradients] are transmitted from module C to module B [second computer system] for the same time step t as the activations                         
                            
                                
                                    h
                                
                                
                                    A
                                
                                
                                    t
                                
                            
                        
                     passed from module A to module B; see also sec. 1, second paragraph (disclosing that each module may correspond to a separate GPU [containing a hardware interface])).”  

52	Regarding claim 17, Huo discloses “[a]n apparatus, comprising: 
a neural network processor (Huo sec. 3.2 discloses that when there are multiple modules of a neural network system, the network can be distributed across multiple GPUs [processors]);  
3a hardware interface (Huo sec. 3.2 discloses that when there are multiple modules of a neural network system, the network can be distributed across multiple GPUs [processors]; Fig. 2 and accompanying text disclose that error gradients and activations can be passed from one module to another [i.e., there is a hardware interface within each module that allows this transfer of data]);  
4an integrated circuit comprising a weight gradients splitter and an exchange 5processor (Huo Fig. 2 and accompanying caption disclose that, for instance, module C transmits error gradients δB in timesteps                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    , etc. [suggesting that module C contains circuitry to split the error gradients by timestep – i.e., has a weight gradients splitter] and that those error gradients are transmitted to module B [suggesting that module C contains circuitry to transmit the weight gradients to module B – i.e., an exchange processor]; see also sec. 1, second paragraph (disclosing that each module can be instantiated as a separate GPU [integrated circuit])); and  
6a controller (Huo Fig. 2 and accompanying caption and sec. 3.1 disclose that each module performs splitting of the error gradients by timestep, backpropagation computations, and exchange of weight gradients [all of which are functions attributed to the claimed controller infra]; see also sec. 1, second paragraph (disclosing that each module can be instantiated as a separate GPU [which performs the claimed functions and thus contains the claimed controller])) configured to:  
7control the neural network processor … configured to perform backward 8propagation computations for a second layer of a neural network to generate second weight 9gradients (Huo Fig. 2 and accompanying text disclose that gradients δB [second weight gradients] are sent from module C to module B, where each module is a stack of layers of a neural network [i.e., in the case of a three-layer network, module B holds the second layer, so the gradients δB are for the second layer]; sec. 3.1 discloses backpropagation by computation in each module of weight gradients and error gradients);  
10control the weight gradients splitter to split the second weight gradients 11into portions (Huo Fig. 2 discloses that module C sends gradients δB to module B in stages                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                
                            
                        
                    ,                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    , etc. [suggesting that the GPU that executes module C has a gradients splitter to send the gradients to module B incrementally instead of in batches]);  
12control, via the exchange processor, the hardware interface to exchange a 13first portion of the second weight gradients with a second computer system (Huo Fig. 2 and accompanying text disclose that gradients                          
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                     [first portion of the second weight gradients] are transmitted [exchanged] from module C to module B [second computer system]; see also sec. 1, second paragraph (disclosing that each module can comprise a separate GPU [containing an exchange processor for performing the transfer]));  
14control the neural network processor to perform backward propagation 15computations for a first layer of the neural network to generate first weight gradients when the 16exchange of the first portion of the second weight gradients is underway, the first layer being a 17lower layer than the second layer in the neural network (Huo Fig. 2 shows that module B calculates error gradients δA [first weight gradients] and sends them to module A [corresponding to a first layer of the neural network, lower than the second layer]; sec. 3.1 discloses that all modules except the last one have delayed error gradients in store such that they can execute computation without locking; the last module updates with the up-to-date gradients [i.e., the computation of error gradients                         
                            
                                
                                    δ
                                
                                
                                    A
                                
                                
                                    t
                                    -
                                    2
                                
                            
                        
                     takes place concurrently with the transmission of the gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    ]); … [and]
21after the transmission of the first weight gradients [is] complete, control, via 22the exchange processor, the hardware interface to transmit the remaining portions of the second 23weight gradients to the second computer system (Huo Fig. 2 discloses that after transmission of error gradients                         
                            
                                
                                    δ
                                
                                
                                    A
                                
                                
                                    t
                                    -
                                    2
                                
                            
                        
                     [first weight gradients] to module A, module C transmits error gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                
                            
                        
                     [remaining portions of second weight gradients] to module B [second computer system]).”
Huo appears not to disclose explicitly the further limitations of the claim.  However, Javadi discloses “18after transmission of the first portion of the second weight gradients 19completes, control[ling], via the exchange processor, the hardware interface to transmit the first weight 20gradients to the second computer system (in a data plane forwarding circuit that has a parameter collecting circuit to store and distribute parameter values computed by several machine in a network, each machine learning machine sends a data message to the data plane with 32 weight gradients [first portion of second weight gradients] identified by the numbers 0-31 and the letter associated with the ML machine; after collecting the weight gradients from all of the ML machines, the data plane [exchange processor] sends four messages [via a hardware interface] to each of the ML machines, including message 705 containing weight gradients A0-L7, message 710 including weight gradients A8-L15, message 715 including weight gradients A16-L23, and message 720 including weight gradients A24-A31 – paragraphs 77-78 and Figs. 7-8 [first weight gradients = all weight gradients other than those transmitted by the receiving ML model, second computer system = any one of the ML systems other than the one that transmits the first portion of the second weight gradients, so, for instance, weight gradients I0-I31 transmitted from ML machine I to the data plane are the first portion of the second weight gradients, and the first weight gradients are any combination of the weight gradients not including I0-I31 sent to, say, second computer system ML machine H])….”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huo to transmit other weight gradients to a second computer system after collecting first weight gradients, as disclosed by Javadi, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the system to distribute computing operations to each individual computer system in accordance with its individual powers, thus allowing operations to be performed at faster rates.  See Javadi, paragraph 1.  

Regarding claim 118, Huo, as modified by Javadi, discloses that “the controller is configured to control 2the neural network processor to perform forward propagation computations for the second layer 3in parallel with the transmission of at least a part of the remaining portions of the second weight gradient to the second computer system (Huo Fig. 2 and accompanying text disclose that the gradients                         
                            
                                
                                    δ
                                
                                
                                    B
                                
                                
                                    t
                                
                            
                        
                     [portion of second weight gradients] are transmitted from module C to module B [second computer system] for the same time step t as the activations                         
                            
                                
                                    h
                                
                                
                                    A
                                
                                
                                    t
                                
                            
                        
                     forward propagated from module A to module B; see also sec. 1, second paragraph (disclosing that each module may correspond to a separate GPU [containing a hardware interface]), sec. 3.1 (disclosing that parallel updating may be achieved in the backward pass)).”

Allowable Subject Matter
Claims 3-5, 9-15, and 19-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RYAN C VAUGHN whose telephone number is (571)272-4849. The examiner can normally be reached M-R 7a-5:30p ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar, can be reached at 571-272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RYAN C VAUGHN/             Examiner, Art Unit 2125                                                                                                                                                                                                                                                                                                                                                                                     


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 Examiner does not read this limitation as requiring that the entire method be performed by the first worker node.  As long as the method is performed using the first worker node, this will be deemed by Examiner to satisfy the limitation even if some portions of the method are performed using other worker nodes.