DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 05/27/2022 has been entered.
 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response to amendments and remarks submitted with the RCE filed on 05/27/2022, which has been entered. In the current amendments, claims 1, 9, 12 and 15 are amended and claims 4-5 and 18 were previously cancelled. Claims 1-3, 6-17, and 19-20 are pending and have been examined.
In response to amendments and remarks filed on 05/27/2022, the objection to Specification, the 35 U.S.C. 112(f) Claim Interpretation, and the 35 U.S.C. 112(a) rejection made in the previous Office Action have been withdrawn.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-3, 6-17 and 19-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites the limitation "the same minibatches" in line 16.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the same minibatches" has been interpreted as "a same minibatches".
Claim 2 recites the limitation "the worker computing devices" in line 2-3.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the worker computing devices" has been interpreted as "the plurality of worker computing devices".
Claim 9 recites the limitation "the same minibatches" in line 20.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the same minibatches" has been interpreted as "a same minibatches".
Claim 14 recites the limitation "the plurality of the worker computing devices" (emphasis added) in line 2-3.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the plurality of the worker computing devices" has been interpreted as "the plurality of worker computing devices".
Claim 15 recites the limitation "the same minibatches" in line 16.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the same minibatches" has been interpreted as "a same minibatches".
Claim 16 recites the limitation "the worker computing devices" in line 3.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the worker computing devices" has been interpreted as "the plurality of worker computing devices".

Each dependent claim is rejected based on the same rationale of the claim from which it depends.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 6-9, 11-13, 15, 17, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Seide et al. (US 2014/0142929 A1) in view of Chilimbi et al. (US 2016/0092765 A1) and further in view of Luo et al. (“Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training”).
Regarding Claim 1,
Seide et al. teaches A computer-implemented method, comprising (Fig. 1 and Fig. 5 teach a computer-implemented method for parallelizing training of a DNN model; also see pg. 1 [0012]):
generating a profile of a deep neural network (DNN) model, the DNN model comprising a plurality of layers (pg. 6 [0047]: “the model striping module 222 may compare the size of the top layer 114(N) to an average size of the hidden layers, such as the hidden layers 114(2)-114( 4), to produce a ratio value, a size of the smallest layer (e.g., input layer 114(1)) of the DNNs 112 to produce a ratio value or a total size of the hidden layers 114(2)-114(4) produce a ratio value” teaches producing (generating) a ratio value, which corresponds to profile since the ratio value represents a description of the characteristics of DNN layers);
...wherein each of the plurality of stages comprises one or more of the layers of the DNN model (Fig. 5 Step 506 teaches grouping two layers (corresponds to first grouping, or first stage) and Fig. 5 Step 508-510 teaches placing the top layer in another grouping (corresponds to another grouping, or another stage)), and 
wherein the partitioning is optimized to minimize a time to train the DNN model (pg. 2 [0020]: “the top layer 114(N) of the DNNs 112 may have a size that is ten times larger than that of the next largest layer in the DNNs 112. Accordingly, the processing of the top layer 114(N) may be paralleled across multiple multi-core processors. In this way, the model striping 122 of the top layer 114(N) may reduce the execution time of the pipelined algorithm 110 for training the DNNs112” teaches that the partitioning scheme (which includes performing model striping on the top layer) is optimized to reduce execution time for training the DNN); 
assigning the plurality of stages to a plurality of worker computing devices based upon the partitioning (Fig. 5 Step 506 teaches grouping two layers (corresponds to first grouping, or first stage), which is assigned to a multi-core processor of the plurality of multi-core processors, and Fig. 5 Step 508-510 teaches placing the top layer in another grouping (corresponds to another grouping, or another stage), which is assigned to multiple multi-core processors);
...one or more of the plurality of worker computing devices alternate between performing forward processing and backward processing of the same minibatches of training data (pg. 5 [0038]: “Each of the computation iterations performed by the pipelined algorithm 110 may execute the following steps in sequence: forward propagation of input data, error back propagation, and model update” teaches the pipeline algorithm, implemented by worker computing devices (see Fig. 1 multi-core processors 108(1)-108(N)), trains the DNN by performing a plurality of computation iterations that alternate between forward propagation and error back propagation; pg. 7 [0064]: “At block 502, the training engine 102 may allocate the batches 128 of sample frames from the training data 116 (e.g., a speech corpus) for training the DNNs 112. The training may be performed using the pipelined algorithm 110” and pg. 7 [0069]: “At block 512, the training engine 102 may pipeline an execution of the algorithm 110 on a set of multi-core processors to train the DNNs 112 based on the batches 128 of the training data 116” teach the pipeline algorithm (includes forward propagation and error back propagation) trains the DNN through processing same batches of training data).
Seide et al. does not appear to explicitly teach partitioning the layers of the DNN model into a plurality of stages based on the profile...one or more of the plurality of worker computing devices alternate between performing forward processing and backward processing of different minibatches of training data.
However, Chilimbi et al. teaches partitioning the layers of the DNN model into a plurality of stages based on the profile (Fig. 11 and pg. 10 [0142]: “a DPS solution is defined by input information along three main dimensions. First, the input information includes parameters which define resources to be used, including the number of parameter modules (SP), a number of replica units (RA), a number of worker units (WO) per replica unit, and a maximum number H of threads per worker unit. Second, the input information specifies parameters that define a number partitions and replications at each layer of the DNN model 114. Third, the input information describes the manner in which resources are mapped to the features of the DNN model 114, such as the manner in which segments are mapped to worker units, etc.” teach how the layers of the DNN model are partitioned into stages is based on input information (profile) regarding the DNN (also see Fig. 12 element 1204))...
one or more of the plurality of worker computing devices alternate between performing forward processing and backward processing of different minibatches  of training data (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different minibatches of input training data).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al. pg. 4 [0056]).
Seide et al. in view of Chilimbi et al. does not appear to explicitly teach training the DNN model using a one-forward one-backward scheduling policy, whereby in a steady state each of the plurality of worker computing devices performs forward processing and backward processing for the layers of the DNN model in an assigned stage.
However, Luo et al. teaches training the DNN model using a one-forward one-backward scheduling policy, whereby in a steady state each of the plurality of worker computing devices performs forward processing and backward processing for the layers of the DNN model in an assigned stage (Fig. 3 and pg. 2 Section 2: “Modern neural networks can have hundreds of layers making up multi-megabyte-size models.The training process has three phases. In the forward pass, a prediction is generated for an input. In the backward pass, the prediction is compared with a label to calculate prediction error; then, through backpropagation [49], the gradient for each parameter is calculated with respect to this error. The model is then updated using these gradients, often using a variant of the gradient descent optimization algorithm. Computation is often done on GPUs or other accelerators suited to regular data-parallel operations, processing tens to hundreds of samples at once (minibatching). The distributed training process (Figure 3) is different in a few ways. First, a mean gradient is calculated across all minibatches in all the GPUs in each machine. Then, the mean of the gradients from each machine is calculated. Finally, the model is updated based on that mean, new parameters are broadcast to each machine and GPU, and the next batch is trained. This paper focuses on optimizing calculation for both the mean gradient across machines and subsequent model updates (or parameter exchange)” teach distributed training of a deep neural network (DNN) using a one-forward one-backward scheduling policy (see Fig. 3), wherein worker computing devices perform forward processing followed by backward processing for layers of DNN in an assigned iteration (stage), and wherein Fig. 3 shows the distributed training process is in a steady state where all workers are performing processing).
Seide et al., Chilimbi et al., and Luo et al. are analogous art to the claimed invention because they are directed to distributed machine learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Luo et al. to the disclosed invention of Seide et al. in view of Chilimbi et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to perform “optimizing calculation for both the mean gradient across machines and subsequent model updates (or parameter exchange)” by leveraging “a high performance, multitenant, rack-scale” parameter server design for cloud-based distributed deep neural network training (Luo et al. pg. 1 last full paragraph & pg. 2 Section 2).
Regarding Claim 3,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the computer-implemented method of claim 1.
Seide et al. further teaches wherein the partitioning is further optimized such that each of the plurality of worker computing devices performs a same amount of processing during training of the DNN model (pg. 5 [0043]: “the load balance module 220 may assign each of four groups of multiple layers from the layers 114(1)-114(N) to a corresponding multi-core processor, such that the amount of data processed by each of the four multicore processors for its respective assigned layers is equalized or as equalized as possible” teaches partitioning is optimized such that each processor (device) processes an equalized amount of data (corresponds to a same amount of processing) during training).
Regarding Claim 6,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the computer-implemented method of claim 1.
Seide et al. further teaches wherein at least one of the plurality of worker computing devices is configured for model parallel processing (pg. 1 [0012]: “multiple layers of the DNNs may be processed in parallel on the multiple multi-core processors. Further, the pipelined algorithm may be configured to process input data sample batches having a size that is defined to optimize a tradeoff between computation accuracy and execution efficiency. In other words, the size may maximize both computation accuracy and execution efficiency of the pipelined algorithm 110” teaches multi-core processor (corresponds to worker computing device) perform model parallel processing; also see pg. 3 [0024]).
Chilimbi et al. further teaches whereby the DNN model is replicated to the at least one of the plurality of worker computing devices for training (Fig. 7 teaches DNN model is replicated; Fig. 5 and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to process layers of the DNN models).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al pg. 4 [0056]).


Regarding Claim 7,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the computer-implemented method of claim 1.
Seide et al. further teaches wherein at least one of the plurality of worker computing devices is configured for model parallel processing (pg. 1 [0012]: “multiple layers of the DNNs may be processed in parallel on the multiple multi-core processors. Further, the pipelined algorithm may be configured to process input data sample batches having a size that is defined to optimize a tradeoff between computation accuracy and execution efficiency. In other words, the size may maximize both computation accuracy and execution efficiency of the pipelined algorithm 110” teaches multi-core processor (corresponds to worker computing device) perform model parallel processing; also see pg. 3 [0024]).
Chilimbi et al. further teaches whereby the multiple worker computing devices are assigned to process the layers of the DNN in a stage (Fig. 6 teaches multiple threads (worker computing devices) are assigned to process layers of the DNN in a stage (the replica stage contains many replica layers)),
each of the multiple worker computing devices processing different minibatches of training data during training (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2, which is not an output layer (see Fig. 5), therefore the threads assigned to layer z2 correspond to worker computing devices other than the worker computing device assigned the output layer of the DNN; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different minibatches  of input training data).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al pg. 4 [0056]).
Regarding Claim 8,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the computer-implemented method of claim 1.
Seide et al. further teaches wherein at least one of the plurality of worker computing devices is configured for model parallel processing (pg. 1 [0012]: “multiple layers of the DNNs may be processed in parallel on the multiple multi-core processors. Further, the pipelined algorithm may be configured to process input data sample batches having a size that is defined to optimize a tradeoff between computation accuracy and execution efficiency. In other words, the size may maximize both computation accuracy and execution efficiency of the pipelined algorithm 110” teaches multi-core processor (corresponds to worker computing device) perform model parallel processing; also see pg. 3 [0024])...
and wherein multiple worker computing devices of the plurality of worker computing devices are configured for data parallel processing (Fig. 1 and Fig. 3 teach multiple multi-core processors (worker computing devices) are configured for data parallel processing),
Chilimbi et al. further teaches whereby the DNN model is replicated to the at least one of the plurality of worker computing devices for training (Fig. 7 teaches DNN model is replicated; Fig. 5 and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to process layers of the DNN models)...
whereby the multiple worker computing devices are configured to process the layers of the DNN in a stage (Fig. 6 teaches multiple threads (worker computing devices) are assigned to process layers of the DNN in a stage (the replica stage contains many replica layers)),
each of the multiple worker computing devices processing different minibatches of training data during training (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2, which is not an output layer (see Fig. 5), therefore the threads assigned to layer z2 correspond to worker computing devices other than the worker computing device assigned the output layer of the DNN; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different minibatches  of input training data).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al pg. 4 [0056]).
Regarding Claim 9,
Seide et al. teaches A computing device, comprising: one or more processors; and at least one non-transitory computer storage medium having computer-executable instructions stored thereupon which, when executed by the one or more processors, will cause the computing device to (see pg. 3 [0024], [0028], and Fig. 2):
partition layers of a deep neural network (DNN) model into a plurality of stages, wherein each of the plurality of stages comprises one or more of the layers of the DNN model (Fig. 5 Step 506 teaches partitioning the layers of a DNN by grouping two layers (corresponds to first grouping, or first stage) and Fig. 5 Step 508-510 teaches partitioning the layers of a DNN by placing the top layer in another grouping (corresponds to another grouping, or another stage)), and 
wherein the partitioning is optimized to minimize a time to train the DNN model (pg. 2 [0020]: “the top layer 114(N) of the DNNs 112 may have a size that is ten times larger than that of the next largest layer in the DNNs 112. Accordingly, the processing of the top layer 114(N) may be paralleled across multiple multi-core processors. In this way, the model striping 122 of the top layer 114(N) may reduce the execution time of the pipelined algorithm 110 for training the DNNs112” teaches that the partitioning scheme (which includes performing model striping on the top layer) is optimized to reduce execution time for training the DNN); 
assign the plurality of stages to each of a plurality of worker computing devices based upon the partitioning (Fig. 5 Step 506 teaches grouping two layers (corresponds to first grouping, or first stage), which is assigned to a multi-core processor of the plurality of multi-core processors, and Fig. 5 Step 508-510 teaches placing the top layer in another grouping (corresponds to another grouping, or another stage), which is assigned to multiple multi-core processors);
...one or more of the plurality of worker computing devices alternate between forward and backward processing of the same minibatches  of training data (pg. 5 [0038]: “Each of the computation iterations performed by the pipelined algorithm 110 may execute the following steps in sequence: forward propagation of input data, error back propagation, and model update” teaches the pipeline algorithm, implemented by worker computing devices (see Fig. 1 multi-core processors 108(1)-108(N)), trains the DNN by performing a plurality of computation iterations that alternate between forward propagation and error back propagation; pg. 7 [0064]: “At block 502, the training engine 102 may allocate the batches 128 of sample frames from the training data 116 (e.g., a speech corpus) for training the DNNs 112. The training may be performed using the pipelined algorithm 110” and pg. 7 [0069]: “At block 512, the training engine 102 may pipeline an execution of the algorithm 110 on a set of multi-core processors to train the DNNs 112 based on the batches 128 of the training data 116” teach the pipeline algorithm (includes forward propagation and error back propagation) trains the DNN through processing same batches of training data).


Seide et al. does not appear to explicitly teach one or more of the plurality of worker computing devices alternate between performing forward and backward processing of different minibatches of training data.
However, Chilimbi et al. teaches one or more of the plurality of worker computing devices alternate between performing forward and backward processing of different minibatches of training data (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different minibatches  of input training data).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al pg. 4 [0056]).
Seide et al. in view of Chilimbi et al. does not appear to explicitly teach train the DNN model using a one-forward one-backward scheduling policy, whereby in a steady state each of the plurality of worker computing devices performs forward processing and backward processing for the layers of the DNN model in an assigned stage.
However, Luo et al. teaches train the DNN model using a one-forward one-backward scheduling policy, whereby in a steady state each of the plurality of worker computing devices performs forward processing and backward processing for the layers of the DNN model in an assigned stage (Fig. 3 and pg. 2 Section 2: “Modern neural networks can have hundreds of layers making up multi-megabyte-size models.The training process has three phases. In the forward pass, a prediction is generated for an input. In the backward pass, the prediction is compared with a label to calculate prediction error; then, through backpropagation [49], the gradient for each parameter is calculated with respect to this error. The model is then updated using these gradients, often using a variant of the gradient descent optimization algorithm. Computation is often done on GPUs or other accelerators suited to regular data-parallel operations, processing tens to hundreds of samples at once (minibatching). The distributed training process (Figure 3) is different in a few ways. First, a mean gradient is calculated across all minibatches in all the GPUs in each machine. Then, the mean of the gradients from each machine is calculated. Finally, the model is updated based on that mean, new parameters are broadcast to each machine and GPU, and the next batch is trained. This paper focuses on optimizing calculation for both the mean gradient across machines and subsequent model updates (or parameter exchange)” teach distributed training of a deep neural network (DNN) using a one-forward one-backward scheduling policy (see Fig. 3), wherein worker computing devices perform forward processing followed by backward processing for layers of DNN in an assigned iteration (stage), and wherein Fig. 3 shows the distributed training process is in a steady state where all workers are performing processing).
Seide et al., Chilimbi et al., and Luo et al. are analogous art to the claimed invention because they are directed to distributed machine learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Luo et al. to the disclosed invention of Seide et al. in view of Chilimbi et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to perform “optimizing calculation for both the mean gradient across machines and subsequent model updates (or parameter exchange)” by leveraging “a high performance, multitenant, rack-scale” parameter server design for cloud-based distributed deep neural network training (Luo et al. pg. 1 last full paragraph & pg. 2 Section 2).
Regarding Claim 11,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the computing device of claim 9.
Seide et al. further teaches wherein the partitioning is further optimized such that each of the plurality of worker computing devices performs a same amount of processing during training of the DNN model (pg. 5 [0043]: “the load balance module 220 may assign each of four groups of multiple layers from the layers 114(1)-114(N) to a corresponding multi-core processor, such that the amount of data processed by each of the four multicore processors for its respective assigned layers is equalized or as equalized as possible” teaches partitioning is optimized such that each processor (device) processes an equalized amount of data (corresponds to a same amount of processing) during training).


Regarding Claim 12,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the computing device of claim 9.
Seide et al. further teaches wherein at least one of the plurality of worker computing devices performs model parallel processing (pg. 1 [0012]: “multiple layers of the DNNs may be processed in parallel on the multiple multi-core processors. Further, the pipelined algorithm may be configured to process input data sample batches having a size that is defined to optimize a tradeoff between computation accuracy and execution efficiency. In other words, the size may maximize both computation accuracy and execution efficiency of the pipelined algorithm 110” teaches multi-core processor (corresponds to worker computing device) perform model parallel processing; pg. 1 [0012] teaches the multi-core processors can be GPUs; also see pg. 3 [0024])...
and wherein multiple worker computing devices of the plurality of worker computing devices perform data parallel processing (Fig. 1 and Fig. 3 teach multiple multi-core processors (worker computing devices) are configured for data parallel processing),
Chilimbi et al. further teaches whereby the DNN model is replicated to the at least one of the plurality of worker computing devices for training (Fig. 7 teaches DNN model is replicated; Fig. 5 and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to process layers of the DNN models; pg. 2 [0036] teaches the processing cores can be GPUs)...
whereby the multiple worker computing devices process the layers of the DNN in a stage (Fig. 6 teaches multiple threads (worker computing devices) are assigned to process layers of the DNN in a stage (the replica stage contains many replica layers)),
each of the multiple worker computing devices processing different minibatches  of training data during training (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2, which is not an output layer (see Fig. 5), therefore the threads assigned to layer z2 correspond to worker computing devices other than the worker computing device assigned the output layer of the DNN; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different minibatches  of input training data).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al pg. 4 [0056]).

Regarding Claim 13,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the computing device of claim 9.
Seide et al. further teaches wherein the at least one non-transitory computer storage medium has further computer-executable instructions stored thereupon to  (see pg. 3 [0024], [0028], and Fig. 2):
generate a profile of the deep neural network (DNN) model (pg. 6 [0047]: “the model striping module 222 may compare the size of the top layer 114(N) to an average size of the hidden layers, such as the hidden layers 114(2)-114( 4), to produce a ratio value, a size of the smallest layer (e.g., input layer 114(1)) of the DNNs 112 to produce a ratio value or a total size of the hidden layers 114(2)-114(4) produce a ratio value” teaches producing (generating) a ratio value, which corresponds to profile since the ratio value represents a description of the characteristics of DNN layers).
Chilimbi et al. further teaches partition the layers of the DNN model into the plurality of stages based upon the profile (Fig. 11 and pg. 10 [0142]: “a DPS solution is defined by input information along three main dimensions. First, the input information includes parameters which define resources to be used, including the number of parameter modules (SP), a number of replica units (RA), a number of worker units (WO) per replica unit, and a maximum number H of threads per worker unit. Second, the input information specifies parameters that define a number partitions and replications at each layer of the DNN model 114. Third, the input information describes the manner in which resources are mapped to the features of the DNN model 114, such as the manner in which segments are mapped to worker units, etc.” teach how the layers of the DNN model are partitioned into stages is based on input information (profile) regarding the DNN (also see Fig. 12 element 1204))...
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al pg. 4 [0056]).
Regarding Claim 15,
Seide et al. teaches A non-transitory computer storage medium having computer-executable instructions stored thereupon which, when executed by one or more processors of a computing device, will cause the computing device to (see pg. 3 [0024], [0028], and Fig. 2):
partition the layers of a deep neural network (DNN) model into a plurality of stages, wherein each of the plurality of stages comprises one or more of the layers of the DNN model (Fig. 5 Step 506 teaches partitioning the layers of a DNN by grouping two layers (corresponds to first grouping, or first stage) and Fig. 5 Step 508-510 teaches partitioning the layers of a DNN by placing the top layer in another grouping (corresponds to another grouping, or another stage)), and 
and wherein the partitioning is optimized to minimize a time to train the DNN model (pg. 2 [0020]: “the top layer 114(N) of the DNNs 112 may have a size that is ten times larger than that of the next largest layer in the DNNs 112. Accordingly, the processing of the top layer 114(N) may be paralleled across multiple multi-core processors. In this way, the model striping 122 of the top layer 114(N) may reduce the execution time of the pipelined algorithm 110 for training the DNNs112” teaches that the partitioning scheme (which includes performing model striping on the top layer) is optimized to reduce execution time for training the DNN); 
assign the plurality of stages to a plurality of worker computing devices based on the partitioning (Fig. 5 Step 506 teaches grouping two layers (corresponds to first grouping, or first stage), which is assigned to a multi-core processor of the plurality of multi-core processors, and Fig. 5 Step 508-510 teaches placing the top layer in another grouping (corresponds to another grouping, or another stage), which is assigned to multiple multi-core processors);
...one or more of the plurality of worker computing devices alternate between performing forward processing and backward processing of the same minibatches of training data (pg. 5 [0038]: “Each of the computation iterations performed by the pipelined algorithm 110 may execute the following steps in sequence: forward propagation of input data, error back propagation, and model update” teaches the pipeline algorithm, implemented by worker computing devices (see Fig. 1 multi-core processors 108(1)-108(N)), trains the DNN by performing a plurality of computation iterations that alternate between forward propagation and error back propagation; pg. 7 [0064]: “At block 502, the training engine 102 may allocate the batches 128 of sample frames from the training data 116 (e.g., a speech corpus) for training the DNNs 112. The training may be performed using the pipelined algorithm 110” and pg. 7 [0069]: “At block 512, the training engine 102 may pipeline an execution of the algorithm 110 on a set of multi-core processors to train the DNNs 112 based on the batches 128 of the training data 116” teach the pipeline algorithm (includes forward propagation and error back propagation) trains the DNN through processing same batches of training data).


Seide et al. does not appear to explicitly teach one or more of the plurality of worker computing devices alternate between performing forward processing and backward processing of different minibatches of training data.
However, Chilimbi et al. teaches one or more of the plurality of worker computing devices alternate between performing forward processing and backward processing of different minibatches of training data (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different minibatches  of input training data).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al pg. 4 [0056]).
Seide et al. in view of Chilimbi et al. does not appear to explicitly teach train the DNN model using a one-forward one-backward scheduling policy, whereby in a steady state each of the plurality of worker computing devices performs forward processing and backward processing for the layers of the DNN model in an assigned stage.
However, Luo et al. teaches train the DNN model using a one-forward one-backward scheduling policy, whereby in a steady state each of the plurality of worker computing devices performs forward processing and backward processing for the layers of the DNN model in an assigned stage (Fig. 3 and pg. 2 Section 2: “Modern neural networks can have hundreds of layers making up multi-megabyte-size models.The training process has three phases. In the forward pass, a prediction is generated for an input. In the backward pass, the prediction is compared with a label to calculate prediction error; then, through backpropagation [49], the gradient for each parameter is calculated with respect to this error. The model is then updated using these gradients, often using a variant of the gradient descent optimization algorithm. Computation is often done on GPUs or other accelerators suited to regular data-parallel operations, processing tens to hundreds of samples at once (minibatching). The distributed training process (Figure 3) is different in a few ways. First, a mean gradient is calculated across all minibatches in all the GPUs in each machine. Then, the mean of the gradients from each machine is calculated. Finally, the model is updated based on that mean, new parameters are broadcast to each machine and GPU, and the next batch is trained. This paper focuses on optimizing calculation for both the mean gradient across machines and subsequent model updates (or parameter exchange)” teach distributed training of a deep neural network (DNN) using a one-forward one-backward scheduling policy (see Fig. 3), wherein worker computing devices perform forward processing followed by backward processing for layers of DNN in an assigned iteration (stage), and wherein Fig. 3 shows the distributed training process is in a steady state where all workers are performing processing).
Seide et al., Chilimbi et al., and Luo et al. are analogous art to the claimed invention because they are directed to distributed machine learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Luo et al. to the disclosed invention of Seide et al. in view of Chilimbi et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to perform “optimizing calculation for both the mean gradient across machines and subsequent model updates (or parameter exchange)” by leveraging “a high performance, multitenant, rack-scale” parameter server design for cloud-based distributed deep neural network training (Luo et al. pg. 1 last full paragraph & pg. 2 Section 2).
Regarding Claim 17,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the non-transitory computer storage medium of claim 15.
Seide et al. further teaches wherein the partitioning is further optimized such that each of the plurality of worker computing devices performs a same amount of processing during training of the DNN model (pg. 5 [0043]: “the load balance module 220 may assign each of four groups of multiple layers from the layers 114(1)-114(N) to a corresponding multi-core processor, such that the amount of data processed by each of the four multicore processors for its respective assigned layers is equalized or as equalized as possible” teaches partitioning is optimized such that each processor (device) processes an equalized amount of data (corresponds to a same amount of processing) during training).


Regarding Claim 19,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the non-transitory computer storage medium of claim 15.
Seide et al. further teaches wherein the at least one non-transitory computer storage medium has further computer-executable instructions stored thereupon to (see pg. 3 [0024], [0028], and Fig. 2):
generate a profile of the deep neural network (DNN) model (pg. 6 [0047]: “the model striping module 222 may compare the size of the top layer 114(N) to an average size of the hidden layers, such as the hidden layers 114(2)-114( 4), to produce a ratio value, a size of the smallest layer (e.g., input layer 114(1)) of the DNNs 112 to produce a ratio value or a total size of the hidden layers 114(2)-114(4) produce a ratio value” teaches producing (generating) a ratio value, which corresponds to profile since the ratio value represents a description of the characteristics of DNN layers).
Chilimbi et al. further teaches partition the layers of the DNN model into the plurality of stages based upon the profile (Fig. 11 and pg. 10 [0142]: “a DPS solution is defined by input information along three main dimensions. First, the input information includes parameters which define resources to be used, including the number of parameter modules (SP), a number of replica units (RA), a number of worker units (WO) per replica unit, and a maximum number H of threads per worker unit. Second, the input information specifies parameters that define a number partitions and replications at each layer of the DNN model 114. Third, the input information describes the manner in which resources are mapped to the features of the DNN model 114, such as the manner in which segments are mapped to worker units, etc.” teach how the layers of the DNN model are partitioned into stages is based on input information (profile) regarding the DNN (also see Fig. 12 element 1204))...
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al pg. 4 [0056]).

Claims 2, 10, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Seide et al. (US 2014/0142929 A1) in view of Chilimbi et al. (US 2016/0092765 A1) in view of Luo et al. (“Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training”) and further in view of Teerapittayanon et al. (“Distributed Deep Neural Networks over the Cloud, the Edge and End Devices”).
Regarding Claim 2,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the computer-implemented method of claim 1.
Seide et al. in view of Chilimbi et al. in view of Luo et al. does not appear to explicitly teach wherein the partitioning is further optimized to minimize data communication between the computing devices.
However, Teerapittayanon et al. teaches wherein the partitioning is further optimized to minimize data communication between the worker computing devices (Fig. 4 teaches partitioning layers of a DNN; pg. 329 fourth full paragraph: “The contributions of this paper include...A joint training method that minimizes communication and resource usage for devices and maximizes usefulness of extracted features which are utilized in the cloud, while allowing low-latency classification via early exit for a high percentage of input samples” teaches optimizing to minimize data communication between devices).
Seide et al., Chilimbi et al., Luo et al., and Teerapittayanon et al. are analogous art to the claimed invention because they are directed to distributed machine learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation as taught by Teerapittayanon et al. to the disclosed invention of Seide et al. in view of Chilimbi et al. in view of Luo et al.
One of ordinary skill in the arts would have been motivated to make this modification because minimizing communication of data between devices can result in benefits such as the following: “the communication cost of DDNN is reduced by a factor of over 20x compared to offloading raw sensor input to a DNN in the cloud which performs all of the inference computation” (Teerapittayanon et al. pg. 338 fourth paragraph).
Regarding Claim 10,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the computing device of claim 9.
Seide et al. in view of Chilimbi et al. in view of Luo et al. does not appear to explicitly teach wherein the partitioning is further optimized to minimize data communication between the worker computing devices.
However, Teerapittayanon et al. teaches wherein the partitioning is further optimized to minimize data communication between the worker computing devices (Fig. 4 teaches partitioning layers of a DNN; pg. 329 fourth full paragraph: “The contributions of this paper include...A joint training method that minimizes communication and resource usage for devices and maximizes usefulness of extracted features which are utilized in the cloud, while allowing low-latency classification via early exit for a high percentage of input samples” teaches optimizing to minimize data communication between devices).
Seide et al., Chilimbi et al., Luo et al., and Teerapittayanon et al. are analogous art to the claimed invention because they are directed to distributed machine learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation as taught by Teerapittayanon et al. to the disclosed invention of Seide et al. in view of Chilimbi et al. in view of Luo et al.
One of ordinary skill in the arts would have been motivated to make this modification because minimizing communication of data between devices can result in benefits such as the following: “the communication cost of DDNN is reduced by a factor of over 20x compared to offloading raw sensor input to a DNN in the cloud which performs all of the inference computation” (Teerapittayanon et al. pg. 338 fourth paragraph).
Regarding Claim 16,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the non-transitory computer storage medium of claim 15.
Seide et al. in view of Chilimbi et al. in view of Luo et al. does not appear to explicitly teach wherein the partitioning is further optimized to minimize data communication between the worker computing devices.
However, Teerapittayanon et al. teaches wherein the partitioning is further optimized to minimize data communication between the worker computing devices (Fig. 4 teaches partitioning layers of a DNN; pg. 329 fourth full paragraph: “The contributions of this paper include...A joint training method that minimizes communication and resource usage for devices and maximizes usefulness of extracted features which are utilized in the cloud, while allowing low-latency classification via early exit for a high percentage of input samples” teaches optimizing to minimize data communication between devices).
Seide et al., Chilimbi et al., Luo et al., and Teerapittayanon et al. are analogous art to the claimed invention because they are directed to distributed machine learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation as taught by Teerapittayanon et al. to the disclosed invention of Seide et al. in view of Chilimbi et al. in view of Luo et al.
One of ordinary skill in the arts would have been motivated to make this modification because minimizing communication of data between devices can result in benefits such as the following: “the communication cost of DDNN is reduced by a factor of over 20x compared to offloading raw sensor input to a DNN in the cloud which performs all of the inference computation” (Teerapittayanon et al. pg. 338 fourth paragraph).

Claims 14 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Seide et al. (US 2014/0142929 A1) in view of Chilimbi et al. (US 2016/0092765 A1) in view of Luo et al. (“Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training”) and further in view of Wesolowski et al. (US 2019/0114537 A1).
Regarding Claim 14,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the computing device of claim 13. 
Seide et al. in view of Chilimbi et al. in view of Luo et al. does not appear to explicitly teach wherein the profile of the DNN model is generated by training the DNN model on a subset of the plurality of the worker computing devices with a subset of the DNN training data for a predetermined period of time.
However, Wesolowski et al. teaches wherein the profile of the DNN model is generated by training the DNN model on a subset of the plurality of the worker computing devices with a subset of the DNN training data for a predetermined period of time (pg. 10 [0066] “In order to better manage, or schedule, the transferring-of-training of a neural network ML model, each machine (or training group or computing system), may generate checkpoints at different times during the training of a neural network. The generation of checkpoints may be controlled by Master ML controller 21, or may be instigated by the machine (or training group of machines) that is training a neural network, in response to various triggering events, or conditions. A check point may be a record of an execution state in the training of a neural network (or graph-segment) with sufficient information to restart the training of the neural network (or graph-segment)” teaches generating a check point (for example, a record of an execution state) of the neural network, which corresponds to profile of the neural network, by training the neural network on a training group of machines (worker computing devices); pg. 12 [0073]: “For example, if training is being executed on a training group made up of service server from bank 27 during off-peak hours, and it is determined that peak hours are approaching, a check-point may be created in anticipation of transferring training off the service machines because of the peak hours” teaches training the neural network during a specific pre-determined period of time (for example, during off-peak hours; see pg. 10 [0063] for what is considered off-peak and peak hours); pg. 1 [0005] teaches that the neural network can be a deep neural network).
Seide et al., Chilimbi et al., Luo et al., and Wesolowski et al. are analogous art to the claimed invention because they are directed to distributed machine learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation as taught by Wesolowski et al. to the disclosed invention of Seide et al. in view of Chilimbi et al. in view of Luo et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to “make use of service computing machines during their off-peak hours for training a machine learning model” through “[providing] for heterogeneous computing for training a machine learning model across different computing systems having different computer architectures/characteristics” (Wesolowski et al. pg. 2 [0021]).
Regarding Claim 20,
Seide et al. in view of Chilimbi et al. in view of Luo et al. teaches the non-transitory computer storage medium of claim 15. 
Seide et al. in view of Chilimbi et al. in view of Luo et al. does not appear to explicitly teach wherein a profile of the DNN model is generated by training the DNN model on a subset of the plurality of the worker computing devices with a subset of the DNN training data for a predetermined period of time.
However, Wesolowski et al. teaches a profile of the DNN model is generated by training the DNN model on a subset of the plurality of the worker computing devices with a subset of the DNN training data for a predetermined period of time (pg. 10 [0066] “In order to better manage, or schedule, the transferring-of-training of a neural network ML model, each machine (or training group or computing system), may generate checkpoints at different times during the training of a neural network. The generation of checkpoints may be controlled by Master ML controller 21, or may be instigated by the machine (or training group of machines) that is training a neural network, in response to various triggering events, or conditions. A check point may be a record of an execution state in the training of a neural network (or graph-segment) with sufficient information to restart the training of the neural network (or graph-segment)” teaches generating a check point (for example, a record of an execution state) of the neural network, which corresponds to profile of the neural network, by training the neural network on a training group of machines (worker computing devices); pg. 12 [0073]: “For example, if training is being executed on a training group made up of service server from bank 27 during off-peak hours, and it is determined that peak hours are approaching, a check-point may be created in anticipation of transferring training off the service machines because of the peak hours” teaches training the neural network during a specific pre-determined period of time (for example, during off-peak hours; see pg. 10 [0063] for what is considered off-peak and peak hours); pg. 1 [0005] teaches that the neural network can be a deep neural network).
Seide et al., Chilimbi et al., Luo et al., and Wesolowski et al. are analogous art to the claimed invention because they are directed to distributed machine learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation as taught by Wesolowski et al. to the disclosed invention of Seide et al. in view of Chilimbi et al. in view of Luo et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to “make use of service computing machines during their off-peak hours for training a machine learning model” through “[providing] for heterogeneous computing for training a machine learning model across different computing systems having different computer architectures/characteristics” (Wesolowski et al. pg. 2 [0021]).

Response to Arguments
Applicant's arguments filed on 05/27/2022 with respect to the 35 U.S.C. 112(b) rejection to claims 1-3, 6-17 and 19-20 have been fully considered but they are not persuasive. Applicant asserts “the Applicant has amended claims 1, 9, and 15 hereby to remove the recitation of "the same minibatches." Accordingly, the Applicant respectfully requests that these rejections be withdrawn” (Remarks, pg. 12).

Examiner’s Response: 
The Examiner respectfully disagrees. Independent claims 1, 9, and 15 currently recite “the same minibatches”, which lacks antecedent basis in the claim. Therefore, the 35 U.S.C. 112(b) rejection to claims 1, 9, and 15 is maintained.

Applicant's arguments filed on 05/27/2022 with respect to the 35 U.S.C. 103 rejection to claims 1-3, 6-17 and 19-20 have been fully considered but they are not persuasive. Applicant asserts the following: “With regard to amended independent claim 1, the cited references do not teach, suggest, describe, or otherwise render obvious the recitations of this claim as amended hereby for "training the DNN model using a one-forward one-backward scheduling policy, whereby in a steady state each of the plurality of worker computing devices performs forward processing and backward processing for the layers of the DNN model in an assigned stage, one or more of the plurality of worker computing devices alternate between performing forward processing and backward processing of the same mini batches of training data, and one or more of the plurality of worker computing devices alternate between performing forward processing and backward processing of different mini batches of training data." The Office alleges that Chilimbi discloses these recitations. The Applicant respectfully disagrees” and “Seide also provides no such teaching or suggestion. The cited references do not, therefore, teach or suggest each and every recitation of amended independent claim 1, even if combined in the manner suggested in the Office Action” (Remarks, pg. 12-13).
Examiner’s Response: 
The Examiner respectfully disagrees. Applicant's arguments fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the claims define a patentable invention without specifically pointing out how the language of the claims patentably distinguishes them from the references.
The Office Action does not assert that Chilimbi teaches "training the DNN model using a one-forward one-backward scheduling policy, whereby in a steady state each of the plurality of worker computing devices performs forward processing and backward processing for the layers of the DNN model in an assigned stage, one or more of the plurality of worker computing devices alternate between performing forward processing and backward processing of the same mini batches of training data”, but that Chilimbi teaches one or more of the plurality of worker computing devices alternate between performing forward processing and backward processing of different minibatches of training data (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different minibatches of input training data).
Moreover, Luo et al. teaches training the DNN model using a one-forward one-backward scheduling policy, whereby in a steady state each of the plurality of worker computing devices performs forward processing and backward processing for the layers of the DNN model in an assigned stage (Fig. 3 and pg. 2 Section 2: “Modern neural networks can have hundreds of layers making up multi-megabyte-size models.The training process has three phases. In the forward pass, a prediction is generated for an input. In the backward pass, the prediction is compared with a label to calculate prediction error; then, through backpropagation [49], the gradient for each parameter is calculated with respect to this error. The model is then updated using these gradients, often using a variant of the gradient descent optimization algorithm. Computation is often done on GPUs or other accelerators suited to regular data-parallel operations, processing tens to hundreds of samples at once (minibatching). The distributed training process (Figure 3) is different in a few ways. First, a mean gradient is calculated across all minibatches in all the GPUs in each machine. Then, the mean of the gradients from each machine is calculated. Finally, the model is updated based on that mean, new parameters are broadcast to each machine and GPU, and the next batch is trained. This paper focuses on optimizing calculation for both the mean gradient across machines and subsequent model updates (or parameter exchange)” teach distributed training of a deep neural network (DNN) using a one-forward one-backward scheduling policy (see Fig. 3), wherein worker computing devices perform forward processing followed by backward processing for layers of DNN in an assigned iteration (stage), and wherein Fig. 3 shows the distributed training process is in a steady state where all workers are performing processing).
Furthermore, Seide et al. teaches one or more of the plurality of worker computing devices alternate between performing forward processing and backward processing of the same minibatches  of training data (pg. 5 [0038]: “Each of the computation iterations performed by the pipelined algorithm 110 may execute the following steps in sequence: forward propagation of input data, error back propagation, and model update” teaches the pipeline algorithm, implemented by worker computing devices (see Fig. 1 multi-core processors 108(1)-108(N)), trains the DNN by performing a plurality of computation iterations that alternate between forward propagation and error back propagation; pg. 7 [0064]: “At block 502, the training engine 102 may allocate the batches 128 of sample frames from the training data 116 (e.g., a speech corpus) for training the DNNs 112. The training may be performed using the pipelined algorithm 110” and pg. 7 [0069]: “At block 512, the training engine 102 may pipeline an execution of the algorithm 110 on a set of multi-core processors to train the DNNs 112 based on the batches 128 of the training data 116” teach the pipeline algorithm (includes forward propagation and error back propagation) trains the DNN through processing same batches of training data).
Applicant relies on the above arguments regarding independent claims 9 and 15 and respective dependent claims of each of the independent claims, therefore the response above is applicable to those claims.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YING YU CHEN whose telephone number is (571)270-1484. The examiner can normally be reached Monday-Friday 7:30 am-5:00 pm (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/YING YU CHEN/               Examiner, Art Unit 2125