DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-2, 4, 7-8, 10, 14-15, 17, and 20-25 have been amended. Claims 1-25 are pending and have been examined.

Response to Arguments/Amendments
The claim amendments have overcome the prior rejection under 35 USC § 101, which has been withdrawn accordingly.
Applicant’s arguments, see p. 10, filed 4/20/2022, with respect to the rejection(s) of claim(s) 1, 7, 14, and 20 under 35 USC § 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Tanaka et al., U.S. Patent Application Publication 20210357760.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Malaya (US 20190391850 A1) in view of Renggli (Renggli et al., SPARCML: High-Performance Sparse Communication for Machine Learning, February 22, 2018, arXiv, v1) and Tanaka (Tanaka et al., U.S. Patent Application Publication 20210357760).


With respect to Claim 1, Malaya teaches A machine learning system, (para. [0016], “Described herein is a method and system for opportunistic load balancing in deep neural networks (DNNs) using metadata.”)
comprising: 
memory; and logic communicatively coupled to the memory and a neural network (para. [0016], “Described herein is a method and system for opportunistic load balancing in deep neural networks (DNNs) using metadata.” para. [0022], “FIG. 2 is a block diagram of the device 100, illustrating additional details related to execution of processing tasks on the APD 116. The processor 102 maintains, in system memory 104, one or more control logic modules for execution by the processor 102. The control logic modules include an operating system 120, a kernel mode driver 122, and applications 126. These control logic modules control various features of the operation of the processor 102 and the APD 116.”)
during a … phase associated with the neural network, overlap execution of a first layer of the neural network with transmission of a first message,… (See Malaya, ¶ 0039-0040, e.g. “Convolutions in a neural network are a local operation, in that only the output from a few neurons is necessary to compute some of the neurons in a subsequent layer. As a result, computations in the subsequent layer can progress in parallel without waiting for all the neuron computations to be finished in the current layer. … Modern DNNs have many hidden layers (hundreds or even thousands) and so it could be advantageous to begin computations many layers deep before finishing the computations in a single layer.” Also see ¶ 0048, “Once the scheduler has an entire graph of anticipated computations and the metadata, the scheduler uses a heuristic to schedule runs across the available computational resources to ensure efficient processing.”) Malaya’s scheduler provides messages for execution. 
Malaya does not expressly disclose: during backward propagation … wherein the first message is associated with a first collective operation associated with a second layer of the neural network; and. However, this is taught by Renggli and Tanaka. (Renggli, section 5 on p. 11, e.g. “We also implement the previous algorithms in a non-blocking way. Specifically, we allow a thread to trigger a collective operation, such as AllReduce, in a nonblocking way. This enables the thread to proceed with local computations while the operation is performed in the background. For deep networks, we can nicely overlap communication and computation during the gradient aggregation phase by calling the aggregation per layer in a non-blocking fashion. As of MPI-3, implementations support nonblocking collective operations.”) Renggli does not expressly disclose backward propagation. However, this is taught by Tanaka. (Tanaka, ¶ 0110, e.g. “As described above, the GPU 103 that realizes the forward propagation calculation unit 11 and the backpropagation calculation unit 12 is a device that can execute a plurality of processing in parallel with each other.” ¶ 0112, e.g. “Meanwhile, in the backpropagation calculation, the gradients are output by performing calculation in the order of the output layer, the middle layer, and the input layer. Therefore, the adjustment unit 16 according to this embodiment changes the order of the gradients on which the Allreduce processing has been performed that are input to the forward propagation calculation unit 11 to an order of the input layer, the middle layer, and the output layer.” Also note ¶ 0065, e.g. “Allreduce processing apparatus 2 … reduces the gradients for the entirety of the output layers, and returns the reduced gradients of the output layers to the computers 1-0 to 1-2.” Thus, Allreduce is regarded as a collective operation.) 
It would have been obvious to an artisan of ordinary skill before the effective filing date of the claimed invention to combine the machine learning system of Malaya with the logic to overlap messages for a current layer of the neural network with messages of one or more prior layers of the neural network in a backward propagation phase in order to efficiently exchange messages for neural network training. (Renggli, Introduction). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Malaya’s neural network with Malaya’s backpropagation in order to provide distributed deep learning at higher speed as suggested by Tanaka (Tanaka ¶ 0023).
during a … propagation phase associated with the neural network, … (Malaya, ¶ 0039, “As a result, computations in the subsequent layer can progress in parallel without waiting for all the neuron computations to be finished in the current layer.”)
Malaya does not expressly disclose: during forward propagation … provide a second message related to a second collective operation for the neural network, wherein the second collective operation is an operation of a third layer of the neural network that was not completed during the backward propagation phase. However, this is taught by Renggli and Tanaka. (Renggli, section 4 on p. 11, e.g. “. Specifically, we allow a thread to trigger a collective operation, such as AllReduce, in a nonblocking way. This enables the thread to proceed with local computations while the operation is performed in the background. For deep networks, we can nicely overlap communication and computation during the gradient aggregation phase by calling the aggregation per layer in a non-blocking fashion.”) As described by Renggli, collective operations are nonblocking, which means that network computation continues without having to wait for the collective operations to return. While generally describing deep networks, Renggli does not expressly disclose forward propagation. However, this is further taught by Tanaka. (Tanaka, ¶ 0081, e.g. “In parallel with the above, the GPU 103 acquires the outputs for the layers on which the Allreduce processing has been performed from the main memory 102 and executes the forward propagation calculation.” ¶ 0087, e.g. “The communication units 15 transmit the gradients of the output layers to the Allreduce processing apparatus 2 first.”  ¶ 0092, e.g. “The forward propagation calculation units 11 perform the forward propagation calculation by using new learning data 2, 4, and 6 and the updated weights of the layers as inputs.” ¶ 0093, e.g. “the backpropagation calculation and the Allreduce processing can be executed in parallel with each other, and hence the waiting time from the backpropagation calculation to the start of the forward propagation calculation can be decreased and the distributed deep learning processing can be performed at a higher speed.” ¶ 0106, e.g. “the Allreduce processing and the forward propagation calculation are executed in parallel with each other.” ¶ 0110, e.g. “Therefore, the GPU 103 can execute the forward propagation calculation while acquiring the gradient information for each layer on which the Allreduce processing has been performed from the storage unit 14 that is a main memory of each of the computers 1.”) Tanaka’s forward propagation layers must wait for the particular Allreduce collective processing/updated weights for each respective layer. Once the collective operation for that particular layer is complete, parallel execution of that layer can begin. That is, forward propagation for one layer may proceed while the Allreduce collective/weight updates which began during backpropagation of other layers is still waiting. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the machine learning system of Malaya with embedding one or more trigger operations in one or more messages related to collective operations for the neural network in order to efficiently exchange messages for neural network training. (Renggli, Introduction). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Malaya’s parallel neural network with Renggli and Tanaka’s collective operations and parallel forward propagation in order to provide parallel distributed deep learning at higher speed as suggested by Tanaka (Tanaka ¶ 0023).

	With respect to Claim 2, modified Malaya teaches the machine learning system of Claim 1, and Malaya also teaches wherein the logic is further to: construct a directed acyclic graph corresponding to the collective operations for the neural network including the first and second collective operations; (para. [0039], “Convolutions in a neural network are a local operation, in that only the output from a few neurons is necessary to compute some of the neurons in a subsequent layer. As a result, computations in the subsequent layer can progress in parallel without waiting for all the neuron computations to be finished in the current layer. FIG. 6 illustrates two representative layers, layer 1 605 and layer 2 610, from a DNN 600 in a directed acyclic graph (DAG) representation. Layer 1 605 includes, for example, neurons 620, 622, and 624 and layer 2 610 includes, for example, neurons 630, 632, 634, 636 and 638. In some cases, some of the second layer neurons would be capable of running before the entire first layer had been evaluated. This is shown in FIG. 6 by the bolder and thicker lines, where two neurons, e.g., neurons 630 and 632, in the layer 2 610 can be evaluated before the final neuron, e.g., neuron 624, in the layer 1 605 was computed.” A directed acyclic graph represents the required ordering of operations related to a neural network. As established with respect to Claim 1, these operations may be collective operations.)
and offload execution of the directed acyclic graph to a hardware-based message scheduler. (para. [0044], “In another illustrative example, the location is dependent on which scheduler is using the metadata. If metadata is utilized at the O/S level, such as the O/S 120 in FIG. 2, the metadata can be embedded in a job request message, which is sent to the OS scheduler. If metadata is utilized at the haredware scheduler or dispatcher level, such as the scheduler 136 in FIG. 2 or hardware dispatcher 720 in FIG. 7, the metadata can be stored in a hardware table, such as hardware table 710, which is co-located with the scheduler 136 or hardware dispatcher 720. In the event that the hardware scheduler also utilizes metadata that are captured by the software (e.g., activation functions), the metadata can be passed from the software to the job request message, then to the OS, and finally to the hardware table.”)

With respect to Claim 3, modified Malaya teaches the machine learning system of Claim 1, and Renggli also teaches wherein the logic is further to: organize a set of collective operations for gradient exchange based on all layers of the neural network. (Section 1, "We take on this challenge in SPARCML. Our implementation is efficient both in theory and in practice: for some workload parameters, it can be shown to be within constant factors of optimal in terms of bandwidth and latency cost. At the same time, our implementation achieves order-of-magnitude speedups versus highly optimized dense collective implementations, or over naive sparse implementations, both in synthetic tests and in real application scenarios. SPARCML has several additional features. It has efficient support for reduced-precision collectives and for non-blocking operations. For example, we can perform all-to-all sparse reductions for gradient exchange at 4bits of precision per coordinate, overlapping computation and communication.")
It would have been obvious to an artisan of ordinary skill before the effective filing date of the claimed invention to combine the machine learning system of Malaya with the logic to organize a set of collective operations for gradient exchange based on all layers of the neural network in order to efficiently exchange messages for neural network training. (Renggli, Introduction)

With respect to Claim 4, modified Malaya teaches the machine learning system of Claim 3, and Renggli and Tanaka also teaches wherein the logic is further to: overlap messages for a current layer of the neural network with messages of one or more prior layers of the neural network in the backward propagation phase. (Section 4.1, "The key distinction is that each node has some subset of non-zero (non-neutral) elements assigned initially. We can obtain instances of the classical problems as follows: First, if none of these non-zero index sets                         
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                        
                     overlap, we obtain an AllGather instance on a subspace of N. The resulting set of non-zero indices therefore would have size                        
                             
                            
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                    
                                        P
                                    
                                
                                
                                    
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    i
                                                
                                            
                                        
                                    
                                
                            
                        
                    . Second, the AllReduce problem is obtained if                         
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                            =
                             
                            
                                
                                    H
                                
                                
                                    j
                                
                            
                        
                     for all nodes i and j, that is, the sets fully overlap and therefore the reduced result has                         
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            1
                                        
                                    
                                
                            
                        
                     elements. If for all i, this is just dense AllReduce. If                        
                             
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            i
                                        
                                    
                                
                            
                            =
                            k
                            <
                            N
                        
                    , we say that the problem is equivalent to a dense AllReduce on a subspace of dimension k rather than N. Those two instances are illustrated in Figure 1."; Section 5, “We also implement the previous algorithms in a non-blocking way. Specifically, we allow a thread to trigger a collective operation, such as AllReduce, in a nonblocking way. This enables the thread to proceed with local computations while the operation is performed in the background. For deep networks, we can nicely overlap communication and computation during the gradient aggregation phase by calling the aggregation per layer in a non-blocking fashion. As of MPI-3, implementations support nonblocking collective operations. However, rendering a custom operation implementation non-blocking is not entirely straightforward [22, 23] and needs to consider subtle message progression issues [21].” Communication for a current layer of the neural network are overlapped with backpropagation operations for previous layers.)
wherein the first collective operation is an Allreduce operation, and wherein the second collective operation is an Allreduce operation. (Renggli, section 5 on p. 11, “Specifically, we allow a thread to trigger a collective operation, such as AllReduce, in a nonblocking way.” Also see Tanaka, (Tanaka, 0023 “Allreduce processing.”) 
It would have been obvious to an artisan of ordinary skill before the effective filing date of the claimed invention to combine the machine learning system of Malaya with the logic to overlap messages for a current layer of the neural network with messages of one or more prior layers of the neural network in a backward propagation phase as taught by Renggli and Tanaka in order to efficiently exchange messages for neural network training. (Renggli, Introduction)

	With respect to Claim 5, modified Malaya teaches the machine learning system of Claim 1, and Malaya also teaches wherein the logic is further to: issue messages for a subsequent iteration of collective operations based on information corresponding to a previous iteration of collective operations. (para. [0051], “In an implementation, metadata is applicable to dynamic pruning and sparsity of the DNN. In these cases, individual neurons are randomly cut out or removed from the DNN. A dynamic means of evaluating and dispatching work would permit load balancing between individual iterations. A means to accomplish this would be permitting the scheduler or a separate helper thread, for example, to check the readiness of a neuron to be computed, where readiness refers to or accounts for the dependencies of each neuron (which can be precomputed). This information, i.e., the readiness, would be tagged as metadata for that neuron. For example, when a neuron is pruned, the readiness of all dependent neurons is updated and is used by the scheduler to load balance between individual iterations.” The neural network may be evaluated between iterations, and may undergo pruning. As a result, messages for subsequent iterations of collective operations will be modified based on the changes to previous iterations of collective operations.)

	With respect to Claim 6, modified Malaya teaches the machine learning system of Claim 1, and Malaya also teaches wherein the neural network comprises a deep learning neural network. (para. [0016], “Described herein is a method and system for opportunistic load balancing in deep neural networks (DNNs) using metadata. The parallelism of DNN computations is fully leveraged by exposing the entire graph of computations or at least portions thereof to a hardware scheduler, compiler, dispatcher or operating system (O/S) scheduler (collectively “scheduler”). In an implementation, computation kernels, neurons, layers or other architectural, functional or computational aspects, portions, characteristics and/or features of the DNN are tagged with metadata so that the scheduler can more effectively and intelligently predict computational complexity, and load balance across existing resources. These metadata provide basic information on the computational complexity of the computation kernel, permitting accurate load balancing. For example, convolutional neural networks, which exhibit repeated computations with regularity and frequency, are particularly amenable to improved load balancing and job scheduling. However, the method is applicable to other types of networks that have regular computational patterns. In an implementation, the method is applicable to dataflow-like architectures, where explicitly exposing the entire computational graph permits fully leveraging the parallelism inherent in DNNs.”)

	With respect to Claim 7, it is substantially similar to Claim 1 and is rejected in the same manner, the same art and reasoning applying. Further, Malaya also teaches A semiconductor package apparatus, (para. [0055], “The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.” The claim language is descriptive of an integrated circuit produced as the result of a semiconductor manufacturing process.)
comprising: one or more substrates; (para. [0055], “The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.” The claim language is descriptive of an integrated circuit produced as the result of a semiconductor manufacturing process.)
and logic coupled to the one or more substrates, (para. [0055], “The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.” The claim language is descriptive of an integrated circuit produced as the result of a semiconductor manufacturing process.)
wherein the logic is at least partly implemented in one or more of configurable logic and fixed-functionality hardware logic, (para. [0055], “The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.” The claim language is descriptive of an integrated circuit produced as the result of a semiconductor manufacturing process. FPGAs are configurable and ASICs are fixed-functionality.)

With respect to Claim 8, it is substantially similar to Claim 2 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 9, it is substantially similar to Claim 3 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 10, it is substantially similar to Claim 4 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 11, it is substantially similar to Claim 5 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 12, it is substantially similar to Claim 6 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 13, modified Malaya teaches the semiconductor package apparatus of Claim 7, and Malaya also teaches wherein the logic coupled to the one or more substrates includes transistor channel regions that are positioned within the one or more substrates. (para. [0055], “The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.” The claim language is descriptive of an integrated circuit produced as the result of a semiconductor manufacturing process.)

With respect to Claim 14, it is substantially similar to Claim 1 and is rejected in the same manner, the same art and reasoning applying. Further, Malaya also teaches A method of machine learning, (para. [0016], “Described herein is a method and system for opportunistic load balancing in deep neural networks (DNNs) using metadata. The parallelism of DNN computations is fully leveraged by exposing the entire graph of computations or at least portions thereof to a hardware scheduler, compiler, dispatcher or operating system (O/S) scheduler (collectively “scheduler”). In an implementation, computation kernels, neurons, layers or other architectural, functional or computational aspects, portions, characteristics and/or features of the DNN are tagged with metadata so that the scheduler can more effectively and intelligently predict computational complexity, and load balance across existing resources. These metadata provide basic information on the computational complexity of the computation kernel, permitting accurate load balancing. For example, convolutional neural networks, which exhibit repeated computations with regularity and frequency, are particularly amenable to improved load balancing and job scheduling. However, the method is applicable to other types of networks that have regular computational patterns. In an implementation, the method is applicable to dataflow-like architectures, where explicitly exposing the entire computational graph permits fully leveraging the parallelism inherent in DNNs.”)

With respect to Claim 15, it is substantially similar to Claim 2 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 16, it is substantially similar to Claim 3 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 17, it is substantially similar to Claim 4 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 18, it is substantially similar to Claim 5 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 19, it is substantially similar to Claim 6 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 20, it is substantially similar to Claim 1 and is rejected in the same manner, the same art and reasoning applying. Further, Malaya also teaches At least one non-transitory computer readable storage medium, (para. [0056], “The methods or flow charts provided herein can be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor. Examples of non-transitory computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).”)

With respect to Claim 21, it is substantially similar to Claim 2 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 22, it is substantially similar to Claim 3 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 23, it is substantially similar to Claim 4 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 24, it is substantially similar to Claim 5 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 25, it is substantially similar to Claim 6 and is rejected in the same manner, the same art and reasoning applying.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
U.S. Patent Application Publication 2018/0032911 by Yamazaki See ¶ 0088-0089, e.g. “Accordingly, compared to the comparative example of FIG. 4, the aggregation process of each layer is started earlier than executing the process of aggregating the variations (Δw) of the weights (w) after completing the backward propagation processes with respect to all the neuron layers as in FIG. 4 illustrating the comparative example … The inter-node communication of another neuron layer L+1 can be performed in parallel during the execution of the aggregation process of a certain neuron layer L.” ¶ 0101, e.g. “Hereat, in parallel with the aggregation process of the thread 1, the thread 2 already starts up the thread of the inter-node communication process about the segmented variation (Δw2), and pipeline-executes the inter-node communication process and the aggregation process in the same way as by the thread 1. The thread 3 also pipeline-executes the inter-node communication process and the aggregation process in the same way as by the threads 1, 2.”
U.S. Patent Application Publication 2019/0312772 by Zhao. See Zhao, ¶ 0047, e.g. “gradients for the output layer parameters are available significantly before gradients for the previous layers. Since the AllReduce operation can operate on a subset of the parameters of the network at a time, the AllReduce operations can start on the output layer parameters while the other gradients are still being computed. This allows the communication to be overlaid with the rest of the computation in the backpropagation step, which effectively reduces the total amount of time each GPU needs to wait for the communication to be complete. In other words, for a DL backpropagation process, we can overlap the layer (i−1) computing and the layer (i) gradient communication, thereby avoiding massive bust traffic.”

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to James D Rutten whose telephone number is (571)272-3703. The examiner can normally be reached M-F 9:00-5:30 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/James D. Rutten/Primary Examiner, Art Unit 2121