DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 3/3/2021 has been entered.
 
Response to Amendment
Acknowledgement is made of Applicant's claim amendments on 3/3/2021. The claim amendments are entered. Presently, claims 1-20 remain pending. Claims 1 and 11 have been amended.

Response to Arguments
Applicant's arguments filed on 3/3/2021 have been fully considered but they are not persuasive.

Applicant has sufficiently amended claims 1 and 11 to remove the unsupported terms “high speed processing capacity” and “low speed processing capacity” and have amended the 

Applicant has sufficiently amended Figs. 4C and 4D to have reference labels that are supported by the specification. Accordingly, the drawing objections against these figures are withdrawn. 

Applicant argues that the combination of the cited references allegedly fails to cure the deficiencies because they do not teach the newly amended claim limitations (Applicant’s Reply pgs. 5-7). The cited references have been updated and when considered in conjunction with Boesch, which has been incorporated into the rejection of the independent claims as necessitated by Applicant’s amendments, can teach the amended claim limitations. 

Claim Objections
Applicant is advised that should claim 1 and its respective dependent claims 2-10 be found allowable, claim 11 and its respective dependent claims 12-20 will be objected to under 37 CFR 1.75 as being substantial duplicates thereof. When two claims in an application are duplicates or else are so close in content that they both cover the same thing, despite a slight difference in wording, it is proper after allowing one claim to object to the other as being a substantial duplicate of the allowed claim. See MPEP § 608.01(m).
Claim 1 recites an apparatus while claim 11 recites an electronic device, however the claims do not recite a substantial difference between those two elements and the body of the claims are the same. Accordingly, claims 1 and 11 
Claim 11 is objected to because of the following informality: the limitation “a processor having” on line 2 was erroneously omitted from the claim. This limitation was not canceled in any previous claim sets. Thus, the present claim should recite: “a processor having a plurality of compute engines….” Applicant is advised to review the claim limitations again and make the record clear as to what the limitation should be. For claim interpretation purposes, the present recitation in the RCE filing with the limitation being omitted is used. Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1 and 11, and their respective dependent claims 2-10 and 12-20, are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

The term “fast weight averaging” in claims 1 and 11 are a relative term which renders the claim indefinite. The term “fast weight averaging”
Claims 1 and 11, and their respective dependent claims 8 and 18, are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Applicant recites “further comprising a processor” in dependent claims 8 and 18, but it is not clear whether this processor is the same as or different from “a processor” that is recited in the respective independent claims. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.

3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1, 6, 11, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Haruki et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2018/0121806, hereinafter Haruki) in view of Boesch et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2018/0189641, hereinafter Boesch).

Regarding claim 1, Haruki teaches:
An apparatus comprising: 
a plurality of compute engines to train a neural network ([0037]: describing a “system 200 for parallel training of a neural network model. The system 200 includes a computer system 100 as described in FIG. 1. The computer system 100 or host computer includes one or more processors 110 and a memory 120 with a training module 123. The computer system 100 further includes a network model 124 with gradient arrays 125 that are created and modified by the training module 123. The computer system 100 has one or more processors as described above coupled to one or more graphics processing units (GPUs) 210.” See also Figs. 2-3: showing a plurality of GPUs, which can comprise compute engines, and wherein the GPUs and computer processing units (CPUs) as part of the computer system are a type of hardware that can be used for computations. See Fig. 5 and [0040]: showing the training logic used by the computer system as shown in Fig. 2 to train the neural networkcomprising at least a first layer and a second layer ([0017] and [0020]: describing multiple layers in a neural network), 
the plurality of compute engines comprising at least a first set of compute engines coupled to a high-speed interconnect ([0037]: describing that “the GPUs 210 may reside elsewhere and be coupled to the host [computer system] 100 with a high speed communication link”. Wherein the computer system can comprise of a computer processing unit (CPU) ([0038]).) and …; and 
a hardware engine to accelerate a weight update process for training the neural network ([0020]: “The use of [computer processing units] CPUs enables the power of advanced processors to be used to accelerate training by overlapping most of the communication overhead in the data parallelism behind the backward phase of training. This is particularly effective for convolutional neural networks.” Similarly, see [0038]-[0042]: describing the use of CPU to help accelerate updating parameters in the training a neural network, wherein the parameters comprise weights ([0002], [0003], and [0018].), 
…; and
([0040]: describing that “the processing described in this section as being performed by the host [computer] 100 is preferably done by the training module 123 (FIG. 1 and FIG. 2). The host 100 gathers gradients from one or more connected GPUs 210. In this example, the host is connected to GPU0 210A and GPUk 210K, where k is a variable that represents that any number GPUs could be used.” Wherein each GPU processes a particular layer of the neural network as determined by the host computer and training module ([0041]). See also [0022]: describing that the computer system used “comprises of one or more processors”.).

While the cited reference teaches the limitations of claim 1, it does not explicitly teach: “a second set of compute engines having a low power operation-the weight update process applying at least one of a fast weight averaging operation or a voting operation to weights in the neural network” on lines 8-9. Boesch discloses the claim limitations, teaching: 
“a second set of compute engines having a low power operation-describing that “convolution accelerators [CAs] 600 may be arranged as described herein to implement low power (e.g., battery powered) neural networks” (Boesch [0127] and [0148]). Wherein the convolutional accelerators are part of the “configurable accelerator framework (CAF)” of a system-on-a-chip (SOC), which acts as a configurable “high-performance, energy efficient hardware accelerated DCNN [deep convolutional neural network] processor” (Boesch [0124] and [0149]-[0152]). That is, the SOC comprising CAs and CAFs can operate as a hardware accelerator compute engine for the DCNN, wherein the CAs can perform the computations (Boesch [0274]). 
“the weight update process applying at least one of a fast weight averaging operation or a voting operation to weights in the neural network”: describing “votes” or “list of votes” in correlation with weights can be used in a neural network (Boesch [0047]-[0048]. Wherein the neural network can have voting layers and can “automatically adjust weighting values that are applied in a voting layer” (Boesch [0050]-[0052]). 
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the components in the cited reference to include the voting operations in Boesch. Doing so would enable “a hardware accelerator engine that supports efficient mapping of convolutional stages of deep neural network algorithms. The hardware accelerator engine includes a plurality of convolution accelerators, and each one of the plurality of convolution accelerators includes a kernel buffer, a feature line buffer, and a plurality of multiply-accumulate (MAC) units.” (Boesch Abstract). 

Regarding claim 6, Haruki teaches:
The apparatus of claim 1, wherein: a decision making routine of the neural network executes on at least two different compute engines ([0038]: describing “FIG. 3 represents a timeline for actions taken on the computer processor (CPU 310) and two graphics processing units (GPU0 312A and GPU1 312B). As described above, training the neural network model typically includes three basic steps. These steps include forward computation 314, backward computation 316A, and parameter updating. ” Wherein the training steps comprise a decision making routine and each GPU computes theses training steps ([0038] and [0041]-[0044].).

Regarding claim 11, Haruki teaches:
An electronic device, comprising: 
a plurality of compute engines to train a neural network ([0037]: describing a “system 200 for parallel training of a neural network model. The system 200 includes a computer system 100 as described in FIG. 1. The computer system 100 or host computer includes one or more processors 110 and a memory 120 with a training module 123. The computer system 100 further includes a network model 124 with gradient arrays 125 that are created and modified by the training module 123. The computer system 100 has one or more processors as described above coupled to one or more graphics processing units (GPUs) 210.” See also Figs. 2-3: showing a plurality of GPUs, which can comprise compute engines, and wherein the GPUs and computer processing units (CPUs) as part of the computer system are a type of hardware that can be used for computations. See Fig. 5 and [0040]: showing the training logic used by the computer system as shown in Fig. 2 to train the neural network.) comprising at least a first layer and a second layer ([0017] and [0020]: describing multiple layers in a neural network), 
the plurality of compute engines comprising at least a first set of compute engines coupled to a high-speed interconnect ([0037]: describing that “the GPUs 210 may reside elsewhere and be coupled to the host [computer system] 100 with a high speed communication link”. Wherein the computer system can comprise of a computer processing unit (CPU) ([0038]).) and 

a hardware engine to accelerate a weight update process for training the neural network ([0020]: “The use of [computer processing units] CPUs enables the power of advanced processors to be used to accelerate training by overlapping most of the communication overhead in the data parallelism behind the backward phase of training. This is particularly effective for convolutional neural networks.” Similarly, see [0038]-[0042]: describing the use of CPU to help accelerate updating parameters in the training a neural network, wherein the parameters comprise weights ([0002], [0003], and [0018].), 
…; and 
a processor to prioritize execution of the neural network by assigning the first layer to the first set of compute engines and the second layer to the second set of compute engines ([0040]: describing that “the processing described in this section as being performed by the host [computer] 100 is preferably done by the training module 123 (FIG. 1 and FIG. 2). The host 100 gathers gradients from one or more connected GPUs 210. In this example, the host is connected to GPU0 210A and GPUk 210K, where k is a variable that represents that any number GPUs could be used.” Wherein each GPU processes a particular layer of the neural network as determined by the host computer and training module ([0041]). See also [0022]: describing that the computer system used “comprises of one or more processors”.).

While the cited reference teaches the limitations of claim 11, it does not explicitly teach: “a second set of compute engines having a low power operation the weight update process applying at least one of a fast weight averaging operation or a voting operation to weights in the neural network” on lines 8-9. Boesch discloses the claim limitations, teaching: 
“a second set of compute engines having a low power operation-describing that “convolution accelerators [CAs] 600 may be arranged as described herein to implement low power (e.g., battery powered) neural networks” (Boesch [0127] and [0148]). Wherein the convolutional accelerators are part of the “configurable accelerator framework (CAF)” of a system-on-a-chip (SOC), which acts as a configurable “high-performance, energy efficient hardware accelerated DCNN [deep convolutional neural network] processor” (Boesch [0124] and [0149]-[0152]). That is, the SOC comprising CAs and CAFs can operate as a hardware accelerator compute engine for the DCNN, wherein the CAs can perform the computations (Boesch [0274]). 
“the weight update process applying at least one of a fast weight averaging operation or a voting operation to weights in the neural network”: describing “votes” or “list of votes” in correlation with weights can be used in a neural network (Boesch [0047]-[0048]. Wherein the neural network can have voting layers and can “automatically adjust weighting values that are applied in a voting layer” (Boesch [0050]-[0052]). 
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the components in the cited reference to include the voting operations in Boesch. Doing so would enable “a hardware accelerator engine that supports efficient mapping of convolutional stages of deep neural network algorithms. The hardware accelerator engine includes a plurality of convolution accelerators, and each one of the plurality of convolution accelerators includes a kernel buffer, a feature line buffer, and a plurality of multiply-accumulate (MAC) units.” (Boesch Abstract). 

Regarding claim 16, claim 16 is substantially similar to claim 6 and therefore is rejected on the same ground as claim 6. Claim 16 is a device claim that corresponds to apparatus claim 6.

Claims 2 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Haruki et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2018/0121806, hereinafter Haruki) and Boesch et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2018/0189641, hereinafter Boesch) in view of Chapelle al. (U.S. Pat. App. Pre-Grant Pub. No. 2013/0290223, hereinafter Chapelle).

Regarding claim 2, the rejection of claim 1 is incorporated. While the cited references teach the claim limitations, they do not explicitly teach: “the hardware engine implements Chapelle discloses the claim limitations, teaching: “AllReduce operation is used to average these weights non-uniformly using the local gradients (local parameters)…. 	Concretely, node k maintains a local weight vector w and a diagonal matrix G' based on the gradients in the adaptive gradient updates (see Algorithm 1). The following weighted average is calculated over all in nodes….  The algorithm benefits from the fast reduction of error initially that an online algorithm provides, and rapid convergence in a good neighborhood guaranteed by Quasi-Newton algorithms. (Chapelle [0058]-[0061]). Similarly, see also Algorithms 1 and 2 and [0053]-[0054].  See also Figs. 2-4: showing a plurality of nodes in a machine learning network. 
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the components in the cited references to include the computation in Chapelle. Doing so would allow “for distributed machine learning on a cluster including a plurality of nodes…. A machine learning process is performed in each of the plurality of nodes based on a respective subset of training data to calculate a local parameter. The training data is partitioned over the plurality of nodes. A plurality of operation nodes are determined from the plurality of nodes based on a status of the machine learning process performed in each of the plurality of nodes. The plurality of operation nodes are connected to form a network topology. An aggregated parameter is generated by merging local parameters calculated in each of the plurality of operation nodes in accordance with the network topology.” (Chapelle Abstract).

Regarding claim 12, claim 12 is substantially similar to claim 2 and therefore is rejected on the same ground as claim 2. Claim 12 is a device claim that corresponds to apparatus claim 2.

Claims 3-5, and 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over Haruki et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2018/0121806, hereinafter Haruki) and Boesch et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2018/0189641, hereinafter Boesch) in view of Chapelle al. (U.S. Pat. App. Pre-Grant Pub. No. 2013/0290223, hereinafter Chapelle) and Jin al. (U.S. Pat. App. Pre-Grant Pub. No. 2016/0321777, hereinafter Jin).
Regarding claim 3, the rejection of claim 2 is incorporated. While the cited references teach the claim limitations, they do not explicitly teach: “the neural network comprises a plurality of sub-neural networks; and each sub-neural network is trained separately.” Jin teaches the claim limitations, teaching: training wherein “for the single-GPU training, only one mini-batch can be trained within each training cycle, and the operation of updating model parameters is completed in passing after training of the mini-batch ends; a plurality of groups of mini-batch data are trained simultaneously at a plurality of GPUs, each data parallel group makes full use of exclusive GPU computing resources assigned to the group, a process of exchanging and updating parameters from various GPUs is further required when the training of the mini-batch ends, and finally each GPU holds the latest model copy, to continue the next training process.” (Jin [0073]). Similarly, see [0074], Table 1, and Figs. 4, 5, and 9-13. 
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the components in the cited references to include the computation in Jin. Doing so would enable “[a] parallel data processing method based on multiple graphic processing units (GPUs) is provided, including: creating, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups including one or more GPUs; binding each worker thread to a corresponding GPU; loading a plurality of batches of training data from a nonvolatile memory to GPU video memories in the plurality of worker groups; and controlling the plurality of GPUs to perform data processing in parallel through the worker threads. The method can enhance efficiency of multi-GPU parallel data processing.” (Jin Abstract). That is, “parallel computing capability of the GPUs is fully used, thereby enhancing the training efficiency” (Jin [0088]). 

Regarding claim 4, the rejection of claim 3 is incorporated. While the cited references teach the claim limitations, they do not explicitly teach: “further comprising logic, wherein: the plurality of sub-neural networks operate according to a priority.” Jin discloses the claim limitations, teaching: “during computing, the model parallel workers are advanced according to an array order: positively sequenced in the event of forward propagation, negatively sequenced in the event of backward propagation, thereby meeting the requirement for a computing sequence of network sub-models. Synchronization waiting control logic between the workers is controlled by a worker group engine on each worker so as to ensure parallelism and correctness of advances in model computing. ” (Jin [0187]). Wherein the worker groups comprise of GPUs comprising a “CNN network hierarchical model” (Jin [0120]-[0126]) in correlation with a CPU.
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the components in the cited references to include the computation in Jin. A motivation to combine the cited references with Jin is previously given.  

Regarding claim 5, the rejection of claim 3 is incorporated. While the cited references teach the claim limitations, they do not explicitly teach: “the output of a first sub-neural network may be provided as an input to a second sub-neural network”. Jin discloses the claim limitations, teaching: “for the intra-group multi-GPU model parallel training manner, the parameter exchange is only performed between correspondingly parts. For example, Worker(0,0) and Worker(1,0) exchange parameters therebetween, while Worker(0,1) and Worker(1,1) exchange parameters therebetween. That is to say, for the same model parts, the parameter exchange can be performed respectively according to the flows shown in FIG. 10 and FIG. 11. After each model part completes the parameter exchange, the model of each worker group is the latest complete model.” (Jin [0118]). Similarly, see Jin [0119]: describing that the convolutional neural network (CNN) model can be split across a plurality of GPUs and that “[t]he inc in FIG. 16 denotes waiting of a lower layer for an upper layer, that is, the result of training of the previous layer serves as input of next layer”. Similarly, see also [0120]-[0128] and Figs. 5, 10, 11, 13-16, and 19-20. 
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the components in the cited references to include the computation in Jin. A motivation to combine the cited references with Jin is previously given.

Regarding claim 13, claim 13 is substantially similar to claim 3 and therefore is rejected on the same ground as claim 3. Claim 13 is a device claim that corresponds to apparatus claim 3.

Regarding claim 14, claim 14 is substantially similar to claim 4 and therefore is rejected on the same ground as claim 4. Claim 14 is a device claim that corresponds to apparatus claim 4. 

Regarding claim 15, claim 15 is substantially similar to claim 5 and therefore is rejected on the same ground as claim 5. Claim 15 is a device claim that corresponds to apparatus claim 5.

Claims 7-9 and 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Haruki et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2018/0121806, hereinafter Haruki) and Boesch et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2018/0189641, hereinafter Boesch) in view of Jin al. (U.S. Pat. App. Pre-Grant Pub. No. 2016/0321777, hereinafter Jin).

Regarding claim 7, the rejection of claim 6 is incorporated. While the cited references teach the claim limitations, they do not explicitly teach: “further comprising a processor to: compare results of the decision making routine executed on the at least two different compute engines.” Jin discloses the claim limitations, teaching: “When mini-batch training begins, an execution engine of each GPU starts at the same time. The execution engine judges whether each layer in a sub-model held by the worker (GPU) meets the requirement for executing forward propagation or backward propagation, and if yes, executes the forward propagation or backward propagation.” (Jin [0131]-[0132]). Similarly, see Jin [0133]-[0135]. 
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the components in the cited references to include the computation in Jin. Doing so would enable “[a] parallel data processing method based on multiple graphic processing units (GPUs) is provided, including: creating, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups including one or more GPUs; binding each worker thread to a corresponding GPU; loading a plurality of batches of training data from a nonvolatile memory to GPU video memories in the plurality of worker groups; and controlling the plurality of GPUs to perform data processing in parallel through the worker threads. The method can enhance efficiency of multi-GPU parallel data processing.” (Jin Abstract). That is, “parallel computing capability of the GPUs is fully used, thereby enhancing the training efficiency” (Jin [0088]).

Regarding claim 8, Haruki teaches:
The apparatus of claim 7, further comprising a driver a processor to: continue processing if the results of the decision making routine executed on the at least two different compute engines match ([0019]: describing a server GPU whereby “every GPU has the same complete neural network and trains it with different inputs. Once they have all finished the backward phase, they exchange gradients and a server GPU updates the parameters. The updated parameters are synchronized among the GPUs at the beginning of the next training iteration to ensure that all GPUs use the same parameters for training.”).

Regarding claim 9, the rejection of claim 8 is incorporated. While the cited references teach the claim limitations, they do not explicitly teach: “further comprising a thread scheduler to: generate a cyclical redundancy check (CRC) using the results of the decision making routine executed on the at least two different compute engines.” Jin discloses the claim limitations, teaching: “[i]n order [for] parallelism effectiveness when multiple GPU s jointly participate in computing, one GPU may be bound to one CPU thread (worker thread, also referred to as worker), and then scheduling of data parallel training is implemented through the CPU threads in a CPU context. In one example, a binding relationship between CPU threads and GPUs as well as GPU worker groups is as shown in Table 1.” (Jin [0074]). Similarly, see Jin [0087] and [0156]: describing the CPU threads. 
See also Jin [0164]-[0166]: describing a “cycle within a cycle where the sequence number is k (k is an integer and l ≤ k ≤ 2N-1 ), replicating a preset partition in the N partitions from a GPU whose sequence number is i to a GPU whose sequence number is j, and merging the gradients, wherein i = (2m+k+1)% N, j = (2m+k+2)% N, m is an integer and 0 ≤ m ≤ N-1; and for partition owners in the 2N GPUs, updating the model parameters according to gradient merging results in the corresponding partitions, wherein the partition owners are GPUs having gradient merging results in all other GPUs for a preset partition.” Similarly, see Jin [0168]-[0170] and [0096]-[0102]. 
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the components in the cited references to include the computation in Jin. A motivation to combine the cited references with Jin is previously given.

Regarding claim 17, claim 17 is substantially similar to claim 7 and therefore is rejected on the same ground as claim 7. Claim 17 is a device claim that corresponds to apparatus claim 7.

Regarding claim 18, claim 18 is substantially similar to claim 8 and therefore is rejected on the same ground as claim 8. Claim 18 is a device claim that corresponds to apparatus claim 8.

Regarding claim 19, claim 19 is substantially similar to claim 9 and therefore is rejected on the same ground as claim 9. Claim 19 is a device claim that corresponds to apparatus claim 9.

Claims 10 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Haruki et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2018/0121806, hereinafter Haruki) and Boesch et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2018/0189641, hereinafter Boesch) in view of Intel, “The Compute Architecture of Intel® Processor Graphics Gen9” (hereinafter Intel).

Regarding claim 10, the rejection of claim 1 is incorporated. While the cited references teach the claim limitations, they do not explicitly teach: “wherein the plurality of compute engines are on a single integrated circuit”. Intel discloses the claim limitations, teaching: Section 5.6 and Figs. 6-9: showing various configurations of a plurality of compute engines on a single system-on-a chip (SoC), i.e. an integrated circuit. Similarly, see also Section 5. Wherein the “gen9 compute architecture is designed for scalability across a wide range of target products. The architecture's modularity enables exact product targeting to a particular market segment or product power envelope. The architecture begins with compute components called execution units. Execution units are clustered into groups called subslices.” (Intel Section 5.2). 
Intel. Doing so enables “the entire gen9 compute architecture interfaces to the rest of the SoC components via a dedicated unit called the graphics technology interface”. 

Regarding claim 20, claim 20 is substantially similar to claim 10 and therefore is rejected on the same ground as claim 10. Claim 20 is a device claim that corresponds to apparatus claim 10.

Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicant's disclosure:
Chilimbi et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2015/0324690): describing an accelerated training process for large deep neural networks to obtain faster computation. Wherein the accelerated training process can comprise training across a partitioning of the neural network models across a plurality of training machines, as well as fast data serving and fast weight updates.  
Wang et. al, “DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family”: describing hardware acceleration for neural network using accelerators, wherein the accelerators result in a lower power consumption and faster computation speed. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SELENE A HAEDI whose telephone number is (571)270-5762.  The examiner can normally be reached on M-F 11 AM - 7 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on (571)272-3768.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/S.H./Examiner, Art Unit 2121                                                                                                                                                                                                       

/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121