DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 09/11/2018 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 20-25 rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because the claims are directed to signals per se. Computer readable media (CRM), under the broadest reasonable interpretation (BRI), will cover an ineligible signal per se unless defined otherwise in the application as filed to specifically exclude transitory, propagating signals. "At least one computer readable storage medium", as recited in the claims, does not explicitly exclude transitory, propagating signals.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Malaya (US 20190391850 A1) in view of Renggli (Renggli et al., SPARCML: High-Performance Sparse Communication for Machine Learning, February 22, 2018, arXiv, v1).

With respect to Claim 1, Malaya teaches A machine learning system, (para. [0016], “Described herein is a method and system for opportunistic load balancing in deep neural networks (DNNs) using metadata.”)
comprising: a neural network; (para. [0016], “Described herein is a method and system for opportunistic load balancing in deep neural networks (DNNs) using metadata.”)
memory communicatively coupled to the neural network; (para. [0022], “FIG. 2 is a block diagram of the device 100, illustrating additional details related to execution of processing tasks on the APD 116. The processor 102 maintains, in system memory 104, one or more control logic modules for execution by the processor 102. The control logic modules include an operating system 120, a kernel mode driver 122, and applications 126. These control logic modules control various features of the operation of the processor 102 and the APD 116.”)
logic communicatively coupled to the neural network (para. [0022], “FIG. 2 is a block diagram of the device 100, illustrating additional details related to execution of processing tasks on the APD 116. The processor 102 maintains, in system memory 104, one or more control logic modules for execution by the processor 102. The control logic modules include an operating system 120, a kernel mode driver 122, and applications 126. These control logic modules control various features of the operation of the processor 102 and the APD 116.”)
and issue the one or more messages related to the collective operations to a hardware-based message scheduler in a desired order of execution. (para. [0044], “In another illustrative example, the location is dependent on which scheduler is using the metadata. If metadata is utilized at the O/S level, such as the O/S 120 in FIG. 2, the metadata can be embedded in a job request message, which is sent to the OS scheduler. If metadata is utilized at the haredware scheduler or dispatcher level, such as the scheduler 136 in FIG. 2 or hardware dispatcher 720 in FIG. 7, the metadata can be stored in a hardware table, such as hardware table 710, which is co-located with the scheduler 136 or hardware dispatcher 720. In the event that the hardware scheduler also utilizes metadata that are captured by the software (e.g., activation functions), the metadata can be passed from the software to the job request message, then to the OS, and finally to the hardware table.” The job request messages are sent to a scheduler, and in some embodiments this scheduler is implemented in hardware. As previously established, in view of Renggli, these messages include collective operations.)
But Malaya does not explicitly teach to embed one or more trigger operations in one or more messages related to collective operations for the neural network.
Renggli, however, does teach to: embed one or more trigger operations in one or more messages related to collective operations for the neural network (Section 5, "We also implement the previous algorithms in a non-blocking way. Specifically, we allow a thread to trigger a collective operation, such as AllReduce, in a nonblocking way. This enables the thread to proceed with local computations while the operation is performed in the background. For deep networks, we can nicely overlap communication and computation during the gradient aggregation phase by calling the aggregation per layer in a non-blocking fashion. As of MPI-3, implementations support nonblocking collective operations. However, rendering a custom operation implementation non-blocking is not entirely straightforward [22, 23] and needs to consider subtle message progression issues [21]." MPI refers to a message passing interface. Threads, communicating via the message passing interface, trigger collective operations.)
It would have been obvious to an artisan of ordinary skill before the effective filing date of the claimed invention to combine the machine learning system of Malaya with embedding one or more trigger operations in one or more messages related to collective operations for the neural network in order to efficiently exchange messages for neural network training. (Renggli, Introduction)

	With respect to Claim 2, modified Malaya teaches the machine learning system of Claim 1, and Malaya also teaches wherein the logic is further to: construct a directed acyclic graph corresponding to the collective operations for the neural network; (para. [0039], “Convolutions in a neural network are a local operation, in that only the output from a few neurons is necessary to compute some of the neurons in a subsequent layer. As a result, computations in the subsequent layer can progress in parallel without waiting for all the neuron computations to be finished in the current layer. FIG. 6 illustrates two representative layers, layer 1 605 and layer 2 610, from a DNN 600 in a directed acyclic graph (DAG) representation. Layer 1 605 includes, for example, neurons 620, 622, and 624 and layer 2 610 includes, for example, neurons 630, 632, 634, 636 and 638. In some cases, some of the second layer neurons would be capable of running before the entire first layer had been evaluated. This is shown in FIG. 6 by the bolder and thicker lines, where two neurons, e.g., neurons 630 and 632, in the layer 2 610 can be evaluated before the final neuron, e.g., neuron 624, in the layer 1 605 was computed.” A directed acyclic graph represents the required ordering of operations related to a neural network. As established with respect to Claim 1, these operations may be collective operations.)
and offload execution of the directed acyclic graph to the hardware-based message scheduler. (para. [0044], “In another illustrative example, the location is dependent on which scheduler is using the metadata. If metadata is utilized at the O/S level, such as the O/S 120 in FIG. 2, the metadata can be embedded in a job request message, which is sent to the OS scheduler. If metadata is utilized at the haredware scheduler or dispatcher level, such as the scheduler 136 in FIG. 2 or hardware dispatcher 720 in FIG. 7, the metadata can be stored in a hardware table, such as hardware table 710, which is co-located with the scheduler 136 or hardware dispatcher 720. In the event that the hardware scheduler also utilizes metadata that are captured by the software (e.g., activation functions), the metadata can be passed from the software to the job request message, then to the OS, and finally to the hardware table.”)

With respect to Claim 3, modified Malaya teaches the machine learning system of Claim 1, and Renggli also teaches wherein the logic is further to: organize a set of collective operations for gradient exchange based on all layers of the neural network. (Section 1, "We take on this challenge in SPARCML. Our implementation is efficient both in theory and in practice: for some workload parameters, it can be shown to be within constant factors of optimal in terms of bandwidth and latency cost. At the same time, our implementation achieves order-of-magnitude speedups versus highly optimized dense collective implementations, or over naive sparse implementations, both in synthetic tests and in real application scenarios. SPARCML has several additional features. It has efficient support for reduced-precision collectives and for non-blocking operations. For example, we can perform all-to-all sparse reductions for gradient exchange at 4bits of precision per coordinate, overlapping computation and communication.")
It would have been obvious to an artisan of ordinary skill before the effective filing date of the claimed invention to combine the machine learning system of Malaya with (Renggli, Introduction)

With respect to Claim 4, modified Malaya teaches the machine learning system of Claim 3, and Renggli also teaches wherein the logic is further to: overlap messages for a current layer of the neural network with messages of one or more prior layers of the neural network in a backward propagation phase. (Section 4.1, "The key distinction is that each node has some subset of non-zero (non-neutral) elements assigned initially. We can obtain instances of the classical problems as follows: First, if none of these non-zero index sets                 
                    
                        
                            H
                        
                        
                            i
                        
                    
                
             overlap, we obtain an AllGather instance on a subspace of N. The resulting set of non-zero indices therefore would have size                
                     
                    
                        
                            ∑
                            
                                i
                                =
                                1
                            
                            
                                P
                            
                        
                        
                            
                                
                                    
                                        
                                            H
                                        
                                        
                                            i
                                        
                                    
                                
                            
                        
                    
                
            . Second, the AllReduce problem is obtained if                 
                    
                        
                            H
                        
                        
                            i
                        
                    
                    =
                     
                    
                        
                            H
                        
                        
                            j
                        
                    
                
             for all nodes i and j, that is, the sets fully overlap and therefore the reduced result has                 
                    
                        
                            
                                
                                    H
                                
                                
                                    1
                                
                            
                        
                    
                
             elements. If for all i, this is just dense AllReduce. If                
                     
                    
                        
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                        
                    
                    =
                    k
                    <
                    N
                
            , we say that the problem is equivalent to a dense AllReduce on a subspace of dimension k rather than N. Those two instances are illustrated in Figure 1."; Section 5, “We also implement the previous algorithms in a non-blocking way. Specifically, we allow a thread to trigger a collective operation, such as AllReduce, in a nonblocking way. This enables the thread to proceed with local computations while the operation is performed in the background. For deep networks, we can nicely overlap communication and computation during the gradient aggregation phase by calling the aggregation per layer in a non-blocking fashion. As of MPI-3, implementations support nonblocking collective operations. However, rendering a custom operation implementation non-blocking is not entirely straightforward [22, 23] and needs to consider subtle message progression issues [21].” Communication for a current layer of the neural network are overlapped with backpropagation operations for previous layers.)
It would have been obvious to an artisan of ordinary skill before the effective filing date of the claimed invention to combine the machine learning system of Malaya with the logic to overlap messages for a current layer of the neural network with messages of one or more prior layers of the neural network in a backward propagation phase in order to efficiently exchange messages for neural network training. (Renggli, Introduction)

	With respect to Claim 5, modified Malaya teaches the machine learning system of Claim 1, and Malaya also teaches wherein the logic is further to: issue messages for a subsequent iteration of collective operations based on information corresponding to a previous iteration of collective operations. (para. [0051], “In an implementation, metadata is applicable to dynamic pruning and sparsity of the DNN. In these cases, individual neurons are randomly cut out or removed from the DNN. A dynamic means of evaluating and dispatching work would permit load balancing between individual iterations. A means to accomplish this would be permitting the scheduler or a separate helper thread, for example, to check the readiness of a neuron to be computed, where readiness refers to or accounts for the dependencies of each neuron (which can be precomputed). This information, i.e., the readiness, would be tagged as metadata for that neuron. For example, when a neuron is pruned, the readiness of all dependent neurons is updated and is used by the scheduler to load balance between individual iterations.” The neural network may be evaluated between iterations, and may undergo pruning. As a result, messages for subsequent iterations of collective operations will be modified based on the changes to previous iterations of collective operations.)

	With respect to Claim 6, modified Malaya teaches the machine learning system of Claim 1, and Malaya also teaches wherein the neural network comprises a deep learning neural network. (para. [0016], “Described herein is a method and system for opportunistic load balancing in deep neural networks (DNNs) using metadata. The parallelism of DNN computations is fully leveraged by exposing the entire graph of computations or at least portions thereof to a hardware scheduler, compiler, dispatcher or operating system (O/S) scheduler (collectively “scheduler”). In an implementation, computation kernels, neurons, layers or other architectural, functional or computational aspects, portions, characteristics and/or features of the DNN are tagged with metadata so that the scheduler can more effectively and intelligently predict computational complexity, and load balance across existing resources. These metadata provide basic information on the computational complexity of the computation kernel, permitting accurate load balancing. For example, convolutional neural networks, which exhibit repeated computations with regularity and frequency, are particularly amenable to improved load balancing and job scheduling. However, the method is applicable to other types of networks that have regular computational patterns. In an implementation, the method is applicable to dataflow-like architectures, where explicitly exposing the entire computational graph permits fully leveraging the parallelism inherent in DNNs.”)

	With respect to Claim 7, it is substantially similar to Claim 1 and is rejected in the same manner, the same art and reasoning applying. Further, Malaya also teaches A semiconductor package apparatus, (para. [0055], “The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.” The claim language is descriptive of an integrated circuit produced as the result of a semiconductor manufacturing process.)
(para. [0055], “The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.” The claim language is descriptive of an integrated circuit produced as the result of a semiconductor manufacturing process.)
and logic coupled to the one or more substrates, (para. [0055], “The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.” The claim language is descriptive of an integrated circuit produced as the result of a semiconductor manufacturing process.)
wherein the logic is at least partly implemented in one or more of configurable logic and fixed-functionality hardware logic, (para. [0055], “The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.” The claim language is descriptive of an integrated circuit produced as the result of a semiconductor manufacturing process. FPGAs are configurable and ASICs are fixed-functionality.)
With respect to Claim 8, it is substantially similar to Claim 2 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 9, it is substantially similar to Claim 3 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 10, it is substantially similar to Claim 4 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 11, it is substantially similar to Claim 5 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 12, it is substantially similar to Claim 6 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 13, modified Malaya teaches the semiconductor package apparatus of Claim 7, and Malaya also teaches wherein the logic coupled to the one or more substrates includes transistor channel regions that are positioned within the one or more substrates. (para. [0055], “The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.” The claim language is descriptive of an integrated circuit produced as the result of a semiconductor manufacturing process.)

With respect to Claim 14, it is substantially similar to Claim 1 and is rejected in the same manner, the same art and reasoning applying. Further, Malaya also teaches A method of machine learning, (para. [0016], “Described herein is a method and system for opportunistic load balancing in deep neural networks (DNNs) using metadata. The parallelism of DNN computations is fully leveraged by exposing the entire graph of computations or at least portions thereof to a hardware scheduler, compiler, dispatcher or operating system (O/S) scheduler (collectively “scheduler”). In an implementation, computation kernels, neurons, layers or other architectural, functional or computational aspects, portions, characteristics and/or features of the DNN are tagged with metadata so that the scheduler can more effectively and intelligently predict computational complexity, and load balance across existing resources. These metadata provide basic information on the computational complexity of the computation kernel, permitting accurate load balancing. For example, convolutional neural networks, which exhibit repeated computations with regularity and frequency, are particularly amenable to improved load balancing and job scheduling. However, the method is applicable to other types of networks that have regular computational patterns. In an implementation, the method is applicable to dataflow-like architectures, where explicitly exposing the entire computational graph permits fully leveraging the parallelism inherent in DNNs.”)

With respect to Claim 15, it is substantially similar to Claim 2 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 16, it is substantially similar to Claim 3 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 17, it is substantially similar to Claim 4 and is rejected in the same manner, the same art and reasoning applying.



With respect to Claim 19, it is substantially similar to Claim 6 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 20, it is substantially similar to Claim 1 and is rejected in the same manner, the same art and reasoning applying. Further, Malaya also teaches At least one computer readable storage medium, (para. [0056], “The methods or flow charts provided herein can be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor. Examples of non-transitory computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).”)

With respect to Claim 21, it is substantially similar to Claim 2 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 22, it is substantially similar to Claim 3 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 23, it is substantially similar to Claim 4 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 24, it is substantially similar to Claim 5 and is rejected in the same manner, the same art and reasoning applying.

With respect to Claim 25, it is substantially similar to Claim 6 and is rejected in the same manner, the same art and reasoning applying.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARK J TURNER whose telephone number is (571)272-8469. The examiner can normally be reached Monday-Thursday 9am-7pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.






/M.J.T./           Examiner, Art Unit 2121  




/Li B. Zhen/           Supervisory Patent Examiner, Art Unit 2121