Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on February 28, 2022, in which claims 1, 5, 7-10, 16, and 20 are amended. Claims 21 and 22 have been added.  Claims 1-10, and 16-22 are currently pending.

Specification
Applicant's amendments made to the specification are acknowledged. Examiner’s objection to the specification are hereby withdrawn, as necessitated by Applicant’s amendments made to the specification.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on February 2, 2022 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Arguments
The rejections to claim 9 under 35 U.S.C. § 112(b)/(f) are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.

Applicant’s arguments with respect to claim 5 have also been considered, however have not been deemed persuasive. The MPI standard introduces point-to-point communication and collective communication as opposite forms of communication patterns, as further outlined below it’s unclear from the instant specification how the Applicant intends to overcome this definition, and based on the Examiner’s interpretation the claim amounts to simply using a collective communication pattern which is explicitly addressed in the MPI standard. 
The remaining arguments are moot in view of a new ground of rejection set forth below.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it 

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 5 and 20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, because the best mode contemplated by the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s) has not been disclosed.  

Regarding claims 5 and 20, "wherein the communication pattern is a collective communication pattern that is implemented according to the set of point-to-point primitives" is not supported by the specification , therefore, said limitations are considered to incorporate new matter in the claim which does not contain support in the original disclosure”.  The only mention of a collective communication routine in the specification is directed towards known methods of the MPI library such as all-gather, reduce, and broadcast.

The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 5, 6 and 20 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Regarding claims 5 and 20, "wherein the communication pattern is a collective communication pattern that is implemented according to the set of point-to-point primitives" is indefinite.  One of ordinary skill in the art would recognize that point-to-point communication and collective communication patterns are two distinctly different communication patterns such that a point-to-point collective communication pattern would be contradictory.  The instant specification does not detail how said pattern might be implemented.  In the interest of further examination this is interpreted as simply providing a collective communication pattern to the plurality of nodes.  

	The remaining claims are rejected with respect to their dependence on the rejected claims. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: 
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6 and 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zou (US 2016/0321776 A1), and the Message Passing Interface Forum (“MPI: A Message-Passing Interface Standard Version 3.1”, 2015) and in further view of Iandola (“Distributed deep neural network training: A measurement study”, 2016).

	Regarding claim 1, Zou teaches A system to configure distributed training of a neural network, the system comprising: ([¶0066] "FIG. 1 The memory 102 may be configured to store a software program and module, such as a program" [¶0038] "FIG. 8 is a schematic diagram of gradient and parameter updating during training of a convolutional neural network (CNN) model;") 
	memory to store a library to facilitate transmission of data during distributed training of the neural network, the data associated with trainable parameters of the neural network; ([¶0083] "when the training begins, each worker group replicates the data on the main memory to a video memory of a corresponding GPU (e.g., through cudaMemcpyHostToDevice calling in NVIDIA's CUDA)")
	a network interface to transmit and receive gradient data associated with the trainable parameters; ( Gradient data described in [¶0061] "computing an updated gradient on each hidden layer through an updated error value, and finally feeding back the updated gradient to the input." to update and feed back input interpreted as synonymous with receiving and sending.)
	a general-purpose processor to execute instructions provided by the library, the instructions to cause the general-purpose processor to configure the network interface to transmit and receive the gradient data associated with the trainable parameters during a workflow of a machine learning framework; and ([¶0038] "FIGS. 9-11 are schematic diagrams of parameter exchange in a data processing method according to one embodiment of the present invention" [¶0066] "FIG. 1 and the processors 104 execute different functional applications and perform data processing by running the software program and module stored in the memory 102" [¶0067]  "FIG. 1 the memory 102 may further include memories remotely disposed relative to the processor 106, and these remote memories may be connected to an electronic terminal 100 through a network" Gradient data described in [¶0061] "(explanation of DNN)  computing an updated gradient on each hidden layer through an updated error value, and finally feeding back the updated gradient to the input." to update and feed back input interpreted as synonymous with receiving and sending.).
	a graphics processor to perform compute operations associated with machine learning framework workflow to generate the gradient data associated with the trainable parameters, wherein, based on the machine learning framework workflow, the library, via the set of point-to-point primitives, is to enable interleave of the compute operations on the graphics processor with transmission and receipt of gradient data via the network interface ([¶0063] "multi-GPU technology makes effective use of characteristics of parallelism, which can speed up the training process of the DNN." [¶0061] "computing an updated gradient on each hidden layer through an updated error value, and finally feeding back the updated gradient to the input" [0089] " It should be noted that, the parameter server herein may be a server configured to update parameters connected with the server 100 through a 
	However, Zou does not explicitly teach wherein the library includes a set of point-to-point primitives to facilitate the transmission of the data during distributed training of the neural network; wherein the network interface is a fabric interface to enable a connection to a communication fabric, the communication fabric includes a plurality of point-to-point interconnects between worker nodes of a distributed training network, 
	and the network interface includes hardware logic to accelerate the set of point-to-point primitives.  

The Message Passing Interface Forum who teaches a related art of parallelizing distributed compute operations teaches wherein the library includes a set of point-to-point primitives to facilitate the transmission of the data [during distributed training of the neural network] ([p. 23 §3.1] “Sending and receiving of messages by processes is the basic MPI communication mechanism. The basic point-to-point communication operations are send and receive. Their use is illustrated in the example below”). 


 
	However, the combination of Zou and the Message Passing Interface Forum does not explicitly teach wherein the network interface is a fabric interface to enable a connection to a communication fabric, the communication fabric includes a plurality of point-to-point interconnects between worker nodes of a distributed training network, 
	and the network interface includes hardware logic to accelerate the set of point-to-point primitives.  

Iandola, who teaches a related art of distributed neural network training, teaches  wherein the network interface is a fabric interface to enable a connection to a communication fabric, the communication fabric includes a plurality of point-to-point interconnects between worker nodes of a distributed training network, ([p. 6 §5] "In our cluster, each server has a GPU. When executing FireCaffe, we perform the computation on the GPUs, and we use the Infiniband fabric to communicate among servers. The GPUs are connected to the Infiniband cards (without going through the CPU) by PLXLink PCIe hubs” Figure 1 shows workers connected via point-to-point interconnects.  Infiniband is taught as fabric, and Infiniband cards/PCIe hubs are taught as fabric interface.).
	and the network interface includes hardware logic to accelerate the set of point-to-point primitives ([p. 6 §5] "In our cluster, each server has a GPU. When executing FireCaffe, we perform the computation on the GPUs, and we use the Infiniband fabric to communicate among servers. The GPUs are connected to the Infiniband cards (without going through the CPU) by PLXLink PCIe hubs. The PCIe bus is offers 14 GB/s communication, and NVIDIA has released even higher intra-server communication in the form of NVLink" The comparison of performance of NVLink to PLXLink is interpreted as showing the network acceleration of each hardware accelerator.  NVLink is interpreted as having hardware logic to accelerate the point-to-point primitives.). 

	Zou and Iandola are both directed towards distributed neural network training. Therefore, Zou and Iandola are analogous arts since they are in the same field of endeavor.  Similarly, Zou, MPI, and Iandola are all directed towards synchronous distributed communications, therefore Zou, MPI, and Iandola are analogous arts in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the 

	Regarding claim 2, the combination of Zou, the Message Passing Interface Forum, and Iandola teaches The system as in claim 1, wherein a compute operation is configured to overlap with a communication operation to send or receive gradient data via the network interface. (the Message Passing Interface Forum [Chapter 5 section 12] "performance of many applications can be improved by overlapping communication and computation, and many systems enable this"
 See FIG. 9 in Zou for sending and receiving gradient data.). 

	Regarding claim 3, the combination of Zou, the Message Passing Interface Forum, and Iandola teaches The system as in claim 2, the machine learning framework workflow to cause the graphics processor to perform a compute operation associated with a first portion of a first layer of the neural network (Zou [¶0061] " the backward propagation means back-propagating output errors in a certain form layer by layer through each hidden layer, computing an updated gradient on each 

	Regarding claim 4, the combination of Zou, the Message Passing Interface Forum, and Iandola teaches The system as in claim 3, wherein in response to a notification of completion of the compute operation associated with the first portion of the first layer of the neural network, the library is to cause the network interface to transmit a result of the compute operation (MPI [p. 484 Chapter 12 Section 4] "In a thread-compliant implementation, an MPI process is a process that may be multithreaded. Each thread can issue MPI calls; however, threads are not separately addressable: a rank in a send or receive call identifies a process, not a thread. A message sent to a process can be received by any thread in this process." See also MPI_GREQUEST_COMPLETE function callback.  Claim limitation is interpreted as synonymous with thread callbacks which are well known in the art and further described in the MPI standards.). 

The system as in claim 4, the network interface to transmit the result of the compute operation according to a communication pattern for messages to be transmitted between worker nodes during distributed training of the neural network. (Zou [¶0078] "After the data is replicated to the video memory, the GPU takes out the mini-batch data each time, to perform mini-batch training, a gradient Δw is obtained according to a result of the mini-batch training…in this way, the plurality of GPUs in the parallel training all has the latest model parameter.").
	wherein the communication pattern is a collective communication pattern that is implemented according to the set of point-to-point primitives. (the Message Passing Interface Forum [p. 6 §1.12] "Chapter 5, Collective Communication, defines process-group collective communication operations. Well known examples of this are barrier and broadcast over a group of processes (not necessarily all the processes). With MPI-2, the semantics of collective communication was extended to include intercommunicators. It also adds two new collective operations. " Broadcast interpreted as a collective communication pattern implemented according to the set of point-to-point primitives.). 

	Regarding claim 6, the combination of Zou, the Message Passing Interface Forum, and Iandola teaches The system as in claim 5, wherein the communication pattern is a gather, scatter, allgather, alltoall, reduce, reduce scatter, or all-reduce. (the Message Passing Interface Forum [p. 6 §1.12] "Chapter 5, Collective 

Claim 16 is substantially similar to claim 1, therefore the rejection applied to claim 1 also applied to claim 16.  Dependent claims 17-20 also mirror dependent claims 2-5 and are therefore similarly rejected for the same reasons.  

	Claims 7, 8, 21, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zou, the Message Passing Interface Forum, and Iandola and in further view of Li (“MALT: Distributed Data-Parallelism for Existing ML Applications”, 2015).

	Regarding claim 7, the combination of Zou, MPI, and Iandola teaches The system as in claim 1.
	However, the combination of Zou, MPI, and Iandola does not explicitly teach wherein the set of point-to-point primitives includes a remote atomic operation.  

	Li, who teaches a related art of distributed training of neural networks, teaches wherein the set of point-to-point primitives includes a remote atomic operation 
	
Zou, Iandola, and Li are all directed towards distributed training of neural networks.  Therefore, Zou and Iandola are analogous arts since they are in the same field of endeavor. Similarly, Zou, MPI, Iandola, and Li are all directed towards synchronous distributed communications, therefore Zou, MPI, Iandola, and Li are analogous arts in the same field of endeavor.  The intent of distributed training of a neural network as taught by Zou, Iandola, and Li are all interpreted as highly analogous to the claimed invention.  MPI teaches atomic operations in a distributed parallel system in §11.7.1, therefore the combination of the MPI standard and Li would be obvious before the effective filing date.  Li teaches that their system was explicitly built for the task of distributed training ([p. 4 §3 “We build dstorm (dstorm stands for DiSTributed Onesided Remote Memory) to facilitate efficient shared memory for ML workloads.”) and provides as a motivation for combination with existing systems, specifically those implemented using MPI ([p. 9 §5] “GASPI is similar to MPI, and MALT can be implemented over MPI. However, GASPI has superior performance to certain MPI implementations”).  

The system as in claim 7, to transmit gradient data associated with the trainable parameters, the network interface is to perform operations associated with a remote atomic store, the remote atomic store performed in response to a primitive in the set of point-to- point primitives ([p. 5 Col. 1] "Torn reads: When a model replica sends a model update to another model replica, the sender may overwrite the model update while the receiver is reading it in the case where the replicas operate asynchronously and the receive queue is full. MALT provides an additional atomic gather which reads the shared memory in an atomic fashion").  

	Regarding claim 21, the combination of Zou, MPI, and Iandola teaches The method as in claim 16. 
	However, the combination of Zou, MPI, and Iandola does not explicitly teach further comprising transmitting gradient data associated with the trainable parameters via the network interface in response to a remote atomic store primitive of the set of point-to-point primitives

Li teaches further comprising transmitting gradient data associated with the trainable parameters via the network interface in response to a remote atomic store primitive of the set of point-to-point primitives ([p. 5 Col. 1] "Torn reads: When a model replica sends a model update to another model replica, the sender may overwrite the model update while the receiver is reading it in the case where the 

Zou, Iandola, and Li are all directed towards distributed training of neural networks.  Therefore, Zou and Iandola are analogous arts since they are in the same field of endeavor. . Similarly, Zou, MPI, Iandola, and Li are all directed towards synchronous distributed communications, therefore Zou, MPI, Iandola, and Li are analogous arts in the same field of endeavor.  The intent of distributed training of a neural network as taught by Zou, Iandola, and Li are all interpreted as highly analogous to the claimed invention.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Zou, Iandola, and Li by implementing atomic operations.  MPI teaches atomic operations in a distributed parallel system in §11.7.1, therefore the combination of the MPI standard and Li would be obvious before the effective filing date.  Li teaches that their system was explicitly built for the task of distributed training ([p. 4 §3 “We build dstorm (dstorm stands for DiSTributed Onesided Remote Memory) to facilitate efficient shared memory for ML workloads.”) and provides as a motivation for combination with existing systems, specifically those implemented using MPI ([p. 9 §5] “GASPI is similar to MPI, and MALT can be implemented over MPI. However, GASPI has superior performance to certain MPI implementations”).  

	Regarding claim 22, the combination of Zou, MPI, Iandola, and Li teaches The method as in claim 21, further comprising: receiving, via the network interface a remote atomic store primitive from a remote worker node of the distributed training network; (Li [p. 5 §3.2] "ML developers can specify gradients or parameters as a VOL vector...creating a vector in turn creates a dstorm segment that allows this vector to be propagated to all machines as described in the dataflow graph. This dataflow describes which machines may send updates to one another (in the simplest case, everyone may send their updates to everyone). Hence, an edge in the graph from node A to nodes B and C implies that when node A pushes a model update, it is received by nodes B and node C")
	and in response to the remote atomic store primitive, atomically updating an address in memory with gradient data received from the remote worker node. (Li [p. 4 §3] " Developers use the MALT API to shard input data across replicas and send/receive gradients" [p. 5 Col. 1] "Torn reads: When a model replica sends a model update to another model replica, the sender may overwrite the model update while the receiver is reading it in the case where the replicas operate asynchronously and the receive queue is full. MALT provides an additional atomic gather which reads the shared memory in an atomic fashion"). 

	Claims 9 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zou, the Message Passing Interface Forum, Iandola, and Li and in further view of Nvidia (“NVIDIA® NVLink TM High Speed Interconnect: Application Performance”, 2014).

Regarding claim 9, the combination of Zou, Iandola, MPI, and Li teaches The system as in claim 7.  However, the combination of Zou, Iandola, MPI, and Li does not explicitly teach wherein the fabric interface includes a direct interface between the graphics processor and a separate graphics processor  

Nvidia, who teaches a related art of parallelizing distributed compute operations, teaches wherein the fabric interface includes a direct interface between the graphics processor and a separate graphics processor ([p. 5] "2-GPU-NVLink provides a fast NVLink interconnect between the two GPUs, bonding together all four of the NVLink interconnection points for a total peak bandwidth of 80 GB/s (64 GB/s effective) per direction between them. By contrast, 2-GPUPCIe, reflecting a common configuration seen in production today, requires that peer-to-peer communication share the same PCIe links as are used for communication with the CPU" ). 

	Zou, Iandola, MPI, Li, and Nvidia are all directed towards parallelizing distributed compute operations.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Zou, Iandola, MPI, and Li with the teachings of NVidia. Iandola explicitly teaches the use of Nvidia NVLink for the purpose of accelerating neural network training, and therefore the Nvidia whitepaper is provided to reinforce the combination and to further teach the claimed limitations.  Zou and Iandola, similar to the claimed invention teach using a GPU network for distributed neural network training.  Both arts teach that PCIe can be used as a network fiber for connecting the distributed nodes.  Nvidia explicitly teaches 

	Regarding claim 10, the combination of Zou, Iandola, MPI, and Li teaches The system as in claim 7.  However, the combination of Zou, Iandola, MPI, and Li does not explicitly teach wherein the graphics processor includes the fabric interface and the separate graphics processor includes a corresponding fabric interface  

Nvidia, who teaches a related art of parallelizing distributed compute operations, teaches wherein the graphics processor includes the fabric interface and the separate graphics processor includes a corresponding fabric interface ([p. 5] "2-GPU-NVLink provides a fast NVLink interconnect between the two GPUs, bonding together all four of the NVLink interconnection points for a total peak bandwidth of 80 GB/s (64 GB/s effective) per direction between them. By contrast, 2-GPUPCIe, reflecting a common configuration seen in production today, requires that peer-to-peer communication share the same PCIe links as are used for communication with the CPU" NVLink interface interpreted as fabric interface taught as onboard compatible GPUs.). 

Zou, Iandola, MPI, Li, and Nvidia are all directed towards parallelizing distributed compute operations.  It would have been obvious to a person of ordinary skill in the art, .  


Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/SB/Examiner, Art Unit 2124      
LUIS A SITIRICHE/Primary Examiner, Art Unit 2126