DETAILED ACTION
This action is written in response to the remarks dated 9/28/21. The Examiner acknowledges election of claims 1-12 and 17-20, as outlined in the Applicant’s remarks. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
The Examiner objects to the following informalities in the claims:
In claim 1, “the second set of instructions facilitate ...” should be amended to read “the second set of instruction to facilitate ...”.

Written Description Objections
The Examiner objects to the following portions of the written description:
In [0003] (last sentence) : 'wi4der' should be corrected to remove the character ‘4’.

Claim Interpretation - 35 USC § 112(f)
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. - An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f). The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f), is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f). The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f), is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f), except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f), except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f), because the claim limitations uses a generic 
In claims 1 and 17: “a fabric interface”.
In claim 3: “hardware to accelerate”.
Because these claim limitations are being interpreted under 35 U.S.C. 112(f), they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have these limitations interpreted under 35 U.S.C. 112(f), applicant may: (1) amend the claim limitations to avoid them being interpreted under 35 U.S.C. 112(f) (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitations recites sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The following are the references relied upon in the rejections below:
Chetlur
Keuper J, Pfreundt FJ. Asynchronous parallel stochastic gradient descent: A numeric core for scalable distributed machine learning algorithms. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments 2015 Nov 15 (pp. 1-11).
Message Passing Interface Forum (MPIF): MPI: A Message-Passing Interface Standard, ver. 3.1, 4 June 2015, available at: https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf, accessed 1/12/22, 868 pages.

Claims 1-3, 7-12, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Chetlur and Keuper.
Regarding claim 1, Chetlur discloses a system to compute and distribute data for distributed training of a neural network, the system including:
first memory to store a first set of instructions including a machine learning framework;
Fig. 1, “memory bridge 105”, described at [0028].Also [0014] “Notably, applications may implement convolutional neural networks (CNNs) based on the disclosed techniques to minimize error rates while optimizing both on-chip memory usage and execution time.” (Emphasis added.) The Examiner notes that a neural network is a machine learning technique. The disclosed system is directed to implementing a CNN.[0030] “Although not shown, the system memory 104 also includes any number of software applications that execute on the CPU 102, may issue commands that control the operation of the PPUs, and may leverage the convolution engine 125 to efficiently execute CNNs.” (Emphasis added.)
a fabric interface to enable transmission and receipt of data associated with a set of trainable machine learning parameters;
Fig. 1, “communication path 113”, described at [0036].
a first set of general-purpose processor cores to execute the first set of instructions, the first set of instructions to provide a training workflow ... for the set of trainable machine learning parameters and to communicate with a second set of instructions, the second set of instructions facilitate transmission and receipt ... via the fabric interface; and
Fig. 1, “CPU 102”, described at [0032]: “The connection topology, including the number and arrangement of bridges, the number of CPUs 102, and the number of parallel processing Subsystems 112, may be modified as desired.” (Emphasis added.)[0035] “In operation, CPU 102 is the master processor of computer system 100, controlling and coordinating operations of other system components. In particular, CPU 102 issues commands that control the operation of PPU 202. In some embodiments, CPU 102 writes a stream of commands for PPU 202 to a data structure (not explicitly shown in either FIG. 1 or FIG. 2) that may be located in system memory 104, PP memory 204, or another storage location accessible to both CPU 102 and PPU 202.” (Emphasis added.)[0036] “I/O unit 205 generates packets (or other signals) for transmission on communication path 113 and also receives all incoming packets (or other signals) from communication path 113, directing the incoming packets to appropriate components of PPU 202. For example, commands related to processing tasks may be directed to a host interface 206, while commands related to memory operations (e.g., reading from or writing to PP memory 204) may be directed to a crossbar unit 210.” (Emphasis added.)
a graphics processor to perform compute operations associated with the training workflow ... for the set of trainable machine learning parameters.
[0034] “Although FIG.2 depicts one PPU 202, as indicated above, parallel processing Subsystem 112 may include any number of PPUs 202. .... In some embodiments, PPU 202 comprises a graphics processing unit (GPU) that may be configured to implement a graphics rendering pipeline to perform various operations related to generating pixel databased on graphics data supplied by CPU 102 and/or system memory 104.”
Keuper discloses the following further limitations which Chetlur does not disclose explicitly:
... the first set of instructions to provide a training workflow for computation of gradients for the set of trainable machine learning parameters and to communicate with a second set of instructions...

a ... processor to perform compute operations associated with the training workflow to generate the gradients for the set of trainable machine learning parameters.
PP. 5-6, sec. 4, The ASGD Algorithm.
At the time of filing, it would have been obvious to a person of ordinary skill to combine the gradient descent techniques described by Keuper with the convolutional neural network processing system disclosed by Chetlur because—as noted by Keuper—“Stochastic Gradient Descent (SGD) methods have long proven to provide good results for ML optimization problems.” (P. 2, first col., internal citation omitted.) Gradient descent may be inherent in the Chetlur system, but the details of the CNN algorithm used therein are beyond the scope of that disclosures. Both disclosures pertain to neural network processing.

Independent claim 17 recites a method comprising functionality which is substantially identical to that of claim 1. Therefore, the above rejection of claim 1 applies equally to claim 17.

Regarding claim 2, Chetlur discloses its further limitation the second set of instructions including a set of point-to-point communication primitives to perform a set of pre-defined communication operations via the fabric interface.
[0036] “I/O unit 205 generates packets (or other signals) for transmission on communication path 113 and also receives all incoming packets (or other signals) from communication path 113, directing the incoming packets to appropriate components of PPU 202. For example, commands related to processing tasks may be directed to a host interface 206, while commands related to memory operations (e.g., reading from or writing to PP memory 204) may be directed to a crossbar unit 210.” (Emphasis added.)
the fabric interface including hardware to accelerate at least a portion of the pre-defined communication operations or at least a subset of the point-to-point primitives.
[0038] “In operation, front end 212 transmits processing tasks received from host interface 206 to a work distribution unit (not shown) within task/work unit 207. The work distribution unit receives pointers to processing tasks that are encoded as task metadata (TMD) and stored in memory.”

Regarding claim 7, Chetlur discloses its further limitation the fabric interface to communicatively couple with multiple compute nodes configured for distributed training of the neural network, at least two of the multiple compute nodes to be indirectly connected via the fabric interface, wherein the fabric interface is to route a message between indirectly connected compute nodes.
Fig. 1: “CPU 102”, “Memory Bridge 105”, “Parallel Processing System 112”, and “Communication Path 113”.[0036] “I/O unit 205 generates packets (or other signals) for transmission on communication path 113 and also receives all incoming packets (or other signals) from communication path 113, directing the incoming packets to appropriate components of PPU 202. For example, commands related to processing tasks may be directed to a host interface 206, while commands related to memory operations (e.g., reading from or writing to PP memory 204) may be directed to a crossbar unit 210.” (Emphasis added.)

Regarding claim 8, Chetlur discloses its further limitation the fabric interface is to route the message between the indirectly connected compute nodes based on a target memory address associated with the message.
generates packets (or other signals) for transmission on communication path 113 and also receives all incoming packets (or other signals) from communication path 113, directing the incoming packets to appropriate components of PPU 202. For example, commands related to processing tasks may be directed to a host interface 206, while commands related to memory operations (e.g., reading from or writing to PP memory 204) may be directed to a crossbar unit 210.” (Emphasis added.)

Regarding claim 9, Chetlur discloses its further limitation: additionally including second memory coupled with the graphics processor, the second memory to store the gradients for the set trainable machine learning parameters.
Fig. 1, system memory 104, discussed at [0030]: “Although not shown, the system memory 104 also includes any number of software applications that execute on the CPU 102, may issue commands that control the operation of the PPUs, and may leverage the convolution engine 125 to efficiently execute CNNs.”

Regarding claim 10, Chetlur discloses its further limitation: the fabric interface having with a virtual address space mapped to at least a portion of the second memory.
[0050]: “The MMU 320 includes a set of page table entries (PTEs) used to map a virtual address to a physical address of a tile or memory page and optionally a cache line index.”

Regarding claim 11, Chetlur discloses its further limitation wherein the second memory is physical memory shared between the fabric interface and the graphics processor.
Fig. 1, system memory 104, discussed at [0034]: “In some embodiments, PPU 202 comprises a graphics processing unit (GPU) that may be configured to implement a graphics rendering pipeline to perform various operations related to generating pixel databased on graphics data supplied by CPU 102 and/or system memory 104.”

Regarding claim 12, Chetlur discloses its further limitation: the graphics processor to store the [parameters] to the second memory and the fabric interface to transmit the [parameters] from the second memory.
Fig. 1, system memory 104, discussed at [0030]: “Although not shown, the system memory 104 also includes any number of software applications that execute on the CPU 102, may issue commands that control the operation of the PPUs, and may leverage the convolution engine 125 to efficiently execute CNNs.” Also “Memory bridge 105” and “Communication path 113”.
Although Chetlur does not disclose explicitly implementation of a gradient descent CNN, Keuper discloses this feature as discussed in the rejection of claim 1 supra.

Regarding claim 18, Chetlur discloses its further limitation additionally comprising executing at least a portion of the second instructions via the fabric interface, the second set of instructions to cause the fabric interface to transmit the [parameters].
Fig. 1, system memory 104, discussed at [0030]: “Although not shown, the system memory 104 also includes any number of software applications that execute on the CPU 102, may issue commands that control the operation of the PPUs, and may leverage the convolution engine 125 to efficiently execute CNNs.”
Although Chetlur does not disclose explicitly implementation of a gradient descent CNN, Keuper discloses this feature as discussed in the rejection of claim 1 supra.

Regarding claim 19, Chetlur discloses its further limitation additionally comprising executing at least a portion of the second instructions via a processor on the fabric interface.
generates packets (or other signals) for transmission on communication path 113 and also receives all incoming packets (or other signals) from communication path 113, directing the incoming packets to appropriate components of PPU 202. For example, commands related to processing tasks may be directed to a host interface 206, while commands related to memory operations (e.g., reading from or writing to PP memory 204) may be directed to a crossbar unit 210.” (Emphasis added.)

Regarding claim 20, Chetlur discloses its further limitation comprising mapping a virtual address space of the fabric interface to a unified address space shared with the graphics processor.
[0050]: “The MMU 320 includes a set of page table entries (PTEs) used to map a virtual address to a physical address of a tile or memory page and optionally a cache line index.”

Claims 4-6 are rejected under 35 U.S.C. 103 as being unpatentable over Chetlur, Keuper and MPIF.
Regarding claim 4, MPIF discloses its further limitation which neither Chetlur nor Keuper discloses explicitly: the pre-defined communication operations including a store-with-notify operation and a remote procedure call.
Store-with-notify: p. 417: “MPI_PUT and MPI_RPUT transfer data from the caller memory (origin) to the target memory .... The transfer is completed, at the origin or both the origin and the target, when a subsequent synchronization call is issued by the caller on the involved window object. These synchronization calls are described in Section 11.5. Transfers can also be completed with calls to flush routines”. See also p. 418, sec. 11.3.1 ‘Put’.Remote procedure call: p. 420: “Similar to MPI_PUT, except that the direction of data transfer is reversed. Data are copied from the target memory to the origin”.
See MPIF p. 1, Introduction.) Chetlur explicitly notes the applicability of “any... point-to-point communication protocol known in the art”. (See [0028].)

Regarding claim 5, MPIF discloses its further limitation which neither Chetlur nor Keuper discloses explicitly: the pre-defined communication operations additionally including a remote atomic memory operation.
P. 401: “Remote atomic swap operations”.
The obviousness analysis of claim 4 applies equally here.

Regarding claim 6, MPIF discloses its further limitation which neither Chetlur nor Keuper discloses explicitly: the pre-defined communication operations additionally including a load with gather list and store with scatter list.
P. 149 et seq., sec. 5.5: Gather.P. 159 eq seq., sec. 5.6: Scatter.
The obviousness analysis of claim 4 applies equally here.

Additional Relevant Prior Art
The following references were identified by the Examiner as being relevant to the disclosed invention, but are not relied upon in any particular prior art rejection:
Strom discloses a distributed deep neural network training system employing stochastic gradient descent during training. See p. 1489, second col. for discussion of peer-to-peer messages. (Strom N., Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association 2015.)
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Vincent Gonzales whose telephone number is (571) 270-3837. The examiner can normally be reached on Monday-Friday 7 a.m. to 4 p.m. MT.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached at (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/Vincent Gonzales/Primary Examiner, Art Unit 2124