DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response to amendments and remarks filed on 10/26/2021. In the current amendments, claims are 4-5 and 18 are cancelled and claims 1-3, 6-9, and 11-20 are amended. Claims 1-3, 6-17 and 19-20 are pending and have been examined.
In response to amendments and remarks filed on 10/26/2021, the objection to claims 1-14, the 35 U.S.C. 101 rejection to claims 9-20, and the 35 U.S.C. 102 rejection to claims 9, 11, 12, 15, 17, and 18 put forth in the previous Office Action have been withdrawn.

Information Disclosure Statement
The information disclosure statement (IDS) was submitted on 11/17/2021. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Specification
The specification is objected to as failing to provide proper antecedent basis for the claimed subject matter.  See 37 CFR 1.75(d)(1) and MPEP § 608.01(o).  Correction of the following is required: the amended limitation “causing worker computing devices other than the worker computing device assigned the output layer of the DNN to alternate between performing forward processing and backward processing of different mini batches of training data” in claims 1-3, 6-9, and 11-20 does not have proper antecedent basis in the Specification. 

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 

Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: 
Claim 12:
the multiple worker computing devices are configured to
multiple worker computing devices of the plurality of worker computing devices are configured for data parallel processing, whereby the multiple worker computing devices are configured to process the layers of the DNN in a stage, each of the multiple worker computing devices processing different minibatches of the DNN training data during training.
Upon a review of the Specification, description of the above limitations are found in at least [0034], [0043], [0048], and Fig. 1, which identifies that the structure for “worker computing devices” performing the claimed functions is a GPU.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-3, 6-17 and 19-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. The amended limitation “causing worker computing devices other than the worker computing device assigned the output layer of the DNN to alternate between performing forward processing and 
Each dependent claim is rejected based on the same rationale of the claim from which it depends.




The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-3, 6-17 and 19-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites the limitation "the same mini batches" in line 13.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the same mini batches" has been interpreted as "a same mini batches".
Claim 9 recites the limitation "the same mini batches" in line 14-45.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the same mini batches" has been interpreted as "a same mini batches".
Claim 15 recites the limitation "the same mini batches" in line 11.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the same mini batches" has been interpreted as "a same mini batches".
Each dependent claim is rejected based on the same rationale of the claim from which it depends.




Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 6-9, 11-13, 15, 17, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Seide et al. (US 2014/0142929 A1) in view of Chilimbi et al. (US 2016/0092765 A1).
Regarding Claim 1,
Seide et al. teaches A computer-implemented method, comprising (Fig. 1 and Fig. 5 teach a computer-implemented method for parallelizing training of a DNN model; also see pg. 1 [0012]):
generating a profile of a deep neural network (DNN) model, the DNN model comprising a plurality of layers (pg. 6 [0047]: “the model striping module 222 may compare the size of the top layer 114(N) to an average size of the hidden layers, such as the hidden layers 114(2)-114( 4), to produce a ratio value, a size of the smallest layer (e.g., input layer 114(1)) of the DNNs 112 to produce a ratio value or a total size of the hidden layers 114(2)-114(4) produce a ratio value” teaches producing (generating) a ratio value, which corresponds to profile since the ratio value represents a description of the characteristics of DNN layers);
...wherein each of the plurality of stages comprises one or more of the layers of the DNN model (Fig. 5 Step 506 teaches grouping two layers (corresponds to first grouping, or first stage) and Fig. 5 Step 508-510 teaches placing the top layer in another grouping (corresponds to another grouping, or another stage)), and 
wherein the partitioning is optimized to minimize a time to train the DNN model (pg. 2 [0020]: “the top layer 114(N) of the DNNs 112 may have a size that is ten times larger than that of the next largest layer in the DNNs 112. Accordingly, the processing of the top layer 114(N) may be paralleled across multiple multi-core processors. In this way, the model striping 122 of the top layer 114(N) may reduce the execution time of the pipelined algorithm 110 for training the DNNs112” teaches that the partitioning scheme (which includes performing model striping on the top layer) is optimized to reduce execution time for training the DNN); 
assigning layers of the DNN model to each of a plurality of worker computing devices based upon the partitioning (pg. 7 [0066]: “At block 506, the training engine 102 may group at least two layers of the DNNs 112 for processing on a single multi-core processor. In various embodiments, the training engine 102 may group the layers in the DNNs 112 into multiple sets of two or more layers, in which each of the multiple sets may be processed by a corresponding multi-core processor” teaches assigning layers to multi-core processor (corresponds to worker computing device); Fig. 1 teaches multiple multi-core processors), 
whereby one of the worker computing devices is assigned an output layer of the DNN (Fig. 1 and pg. 7 [0068]: “At block 510, the training engine 102 may distribute the top layer 114(N) of the DNNs 112 across the multi-core processors 108(1)-108(N) for parallelized processing by the pipelined algorithm 110” teach assigning the top layer (corresponds to output layer) to each of the plurality of multi-core processors 108(1)-108(N));
training the DNN model by causing the worker computing device assigned the output layer of the DNN to alternate between performing forward processing and backward processing of the same mini batches of training data (Fig. 1 and pg. 7 [0068]: “At block 510, the training engine 102 may distribute the top layer 114(N) of the DNNs 112 across the multi-core processors 108(1)-108(N) for parallelized processing by the pipelined algorithm 110” teach assigning the top layer (corresponds to output layer) to worker computing device for parallelized processing by the pipelined algorithm 110; pg. 5 [0038]: “Each of the computation iterations performed by the pipelined algorithm 110 may execute the following steps in sequence: forward propagation of input data, error back propagation, and model update” teaches the pipeline algorithm, implemented by worker computing devices (see Fig. 1 multi-core processors 108(1)-108(N)), trains the DNN by performing a plurality of computation iterations that alternate between forward propagation and error back propagation; pg. 7 [0064]: “At block 502, the training engine 102 may allocate the batches 128 of sample frames from the training data 116 (e.g., a speech corpus) for training the DNNs 112. The training may be performed using the pipelined algorithm 110” and pg. 7 [0069]: “At block 512, the training engine 102 may pipeline an execution of the algorithm 110 on a set of multi-core processors to train the DNNs 112 based on the batches 128 of the training data 116” teach the pipeline algorithm (includes forward propagation and error back propagation) trains the DNN through processing batches of training data).
Seide et al. does not appear to explicitly teach partitioning the layers of the DNN model into a plurality of stages based on the profile...causing worker computing devices other than the worker computing device assigned the output layer of the DNN to alternate between performing forward processing and backward processing of different mini batches of training data.
However, Chilimbi et al. teaches partitioning the layers of the DNN model into a plurality of stages based on the profile (Fig. 11 and pg. 10 [0142]: “a DPS solution is defined by input information along three main dimensions. First, the input information includes parameters which define resources to be used, including the number of parameter modules (SP), a number of replica units (RA), a number of worker units (WO) per replica unit, and a maximum number H of threads per worker unit. Second, the input information specifies parameters that define a number partitions and replications at each layer of the DNN model 114. Third, the input information describes the manner in which resources are mapped to the features of the DNN model 114, such as the manner in which segments are mapped to worker units, etc.” teach how the layers of the DNN model are partitioned into stages is based on input information (profile) regarding the DNN (also see Fig. 12 element 1204))...
causing worker computing devices other than the worker computing device assigned the output layer of the DNN to alternate between performing forward processing and backward processing of different mini batches of training data (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2, which is not an output layer (see Fig. 5), therefore the threads assigned to layer z2 correspond to worker computing devices other than the worker computing device assigned the output layer of the DNN; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different mini batches of input training data).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic 
Regarding Claim 3,
Seide et al. in view of Chilimbi et al. teaches the computer-implemented method of claim 1.
Seide et al. further teaches wherein the partitioning is further optimized such that each of the plurality of worker computing devices performs a same amount of processing during training of the DNN model (pg. 5 [0043]: “the load balance module 220 may assign each of four groups of multiple layers from the layers 114(1)-114(N) to a corresponding multi-core processor, such that the amount of data processed by each of the four multicore processors for its respective assigned layers is equalized or as equalized as possible” teaches partitioning is optimized such that each processor (device) processes an equalized amount of data (corresponds to a same amount of processing) during training).
Regarding Claim 6,
Seide et al. in view of Chilimbi et al. teaches the computer-implemented method of claim 1.
Seide et al. further teaches wherein at least one of the plurality of worker computing devices is configured for model parallel processing (pg. 1 [0012]: “multiple layers of the DNNs may be processed in parallel on the multiple multi-core processors. Further, the pipelined algorithm may be configured to process input data sample batches having a size that is defined to optimize a tradeoff between computation accuracy and execution efficiency. In other words, the size may maximize both computation accuracy and execution efficiency of the pipelined algorithm 110” teaches multi-core processor (corresponds to worker computing device) perform model parallel processing; also see pg. 3 [0024]).
Chilimbi et al. further teaches whereby the DNN model is replicated to the at least one of the plurality of worker computing devices for training (Fig. 7 teaches DNN model is replicated; Fig. 5 and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to process layers of the DNN models).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al pg. 4 [0056]).
Regarding Claim 7,
Seide et al. in view of Chilimbi et al. teaches the computer-implemented method of claim 1.
Seide et al. further teaches wherein at least one of the plurality of worker computing devices is configured for model parallel processing (pg. 1 [0012]: “multiple layers of the DNNs may be processed in parallel on the multiple multi-core processors. Further, the pipelined algorithm may be configured to process input data sample batches having a size that is defined to optimize a tradeoff between computation accuracy and execution efficiency. In other words, the size may maximize both computation accuracy and execution efficiency of the pipelined algorithm 110” teaches multi-core processor (corresponds to worker computing device) perform model parallel processing; also see pg. 3 [0024]).
Chilimbi et al. further teaches whereby the multiple worker computing devices are assigned to process the layers of the DNN in a stage (Fig. 6 teaches multiple threads (worker computing devices) are assigned to process layers of the DNN in a stage (the replica stage contains many replica layers)),
each of the multiple worker computing devices processing different mini batches of training data during training (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2, which is not an output layer (see Fig. 5), therefore the threads assigned to layer z2 correspond to worker computing devices other than the worker computing device assigned the output layer of the DNN; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different mini batches of input training data).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 

Regarding Claim 8,
Seide et al. in view of Chilimbi et al. teaches the computer-implemented method of claim 1.
Seide et al. further teaches wherein at least one of the plurality of worker computing devices is configured for model parallel processing (pg. 1 [0012]: “multiple layers of the DNNs may be processed in parallel on the multiple multi-core processors. Further, the pipelined algorithm may be configured to process input data sample batches having a size that is defined to optimize a tradeoff between computation accuracy and execution efficiency. In other words, the size may maximize both computation accuracy and execution efficiency of the pipelined algorithm 110” teaches multi-core processor (corresponds to worker computing device) perform model parallel processing; also see pg. 3 [0024])...
and wherein multiple worker computing devices of the plurality of worker computing devices are configured for data parallel processing (Fig. 1 and Fig. 3 teach multiple multi-core processors (worker computing devices) are configured for data parallel processing),
Chilimbi et al. further teaches whereby the DNN model is replicated to the at least one of the plurality of worker computing devices for training (Fig. 7 teaches DNN model is replicated; Fig. 5 and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” 
whereby the multiple worker computing devices are configured to process the layers of the DNN in a stage (Fig. 6 teaches multiple threads (worker computing devices) are assigned to process layers of the DNN in a stage (the replica stage contains many replica layers)),
each of the multiple worker computing devices processing different mini batches of training data during training (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2, which is not an output layer (see Fig. 5), therefore the threads assigned to layer z2 correspond to worker computing devices other than the worker computing device assigned the output layer of the DNN; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different mini batches of input training data).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic 
Regarding Claim 9,
Seide et al. teaches A computing device, comprising: one or more processors; and at least one non-transitory computer storage medium having computer-executable instructions stored thereupon which, when executed by the one or more processors, will cause the computing device to (see pg. 3 [0024], [0028], and Fig. 2):
partition layers of a deep neural network (DNN) model into a plurality of stages, wherein each of the plurality of stages comprises one or more of the layers of the DNN model (Fig. 5 Step 506 teaches partitioning the layers of a DNN by grouping two layers (corresponds to first grouping, or first stage) and Fig. 5 Step 508-510 teaches partitioning the layers of a DNN by placing the top layer in another grouping (corresponds to another grouping, or another stage)), and 
wherein the partitioning is optimized to minimize a time to train the DNN model (pg. 2 [0020]: “the top layer 114(N) of the DNNs 112 may have a size that is ten times larger than that of the next largest layer in the DNNs 112. Accordingly, the processing of the top layer 114(N) may be paralleled across multiple multi-core processors. In this way, the model striping 122 of the top layer 114(N) may reduce the execution time of the pipelined algorithm 110 for training the DNNs112” teaches that the partitioning scheme (which includes performing model striping on the top layer) is optimized to reduce execution time for training the DNN); 
assign layers of the DNN model to each of a plurality of worker computing devices based upon the partitioning (pg. 7 [0066]: “At block 506, the training engine 102 may group at least two layers of the DNNs 112 for processing on a single multi-core processor. In various embodiments, the training engine 102 may group the layers in the DNNs 112 into multiple sets of two or more layers, in which each of the multiple sets may be processed by a corresponding multi-core processor” teaches assigning layers to multi-core processor (corresponds to worker computing device); Fig. 1 teaches multiple multi-core processors), 
whereby one of the worker computing devices is assigned an output layer of the DNN (Fig. 1 and pg. 7 [0068]: “At block 510, the training engine 102 may distribute the top layer 114(N) of the DNNs 112 across the multi-core processors 108(1)-108(N) for parallelized processing by the pipelined algorithm 110” teach assigning the top layer (corresponds to output layer) to each of the plurality of multi-core processors 108(1)-108(N));
train the DNN model by causing the worker computing device assigned the output layer of the DNN to alternate between forward and backward processing of the same mini batches of training data (Fig. 1 and pg. 7 [0068]: “At block 510, the training engine 102 may distribute the top layer 114(N) of the DNNs 112 across the multi-core processors 108(1)-108(N) for parallelized processing by the pipelined algorithm 110” teach assigning the top layer (corresponds to output layer) to worker computing device for parallelized processing by the pipelined algorithm 110; pg. 5 [0038]: “Each of the computation iterations performed by the pipelined algorithm 110 may execute the following steps in sequence: forward propagation of input data, error back propagation, and model update” teaches the pipeline algorithm, implemented by worker computing devices (see Fig. 1 multi-core processors 108(1)-108(N)), trains the DNN by performing a plurality of computation iterations that alternate between forward propagation and error back propagation; pg. 7 [0064]: “At block 502, the training engine 102 may allocate the batches 128 of sample frames from the training data 116 (e.g., a speech corpus) for training the DNNs 112. The training may be performed using the pipelined algorithm 110” and pg. 7 [0069]: “At block 512, the training engine 102 may pipeline an execution of the algorithm 110 on a set of multi-core processors to train the DNNs 112 based on the batches 128 of the training data 116” 
Seide et al. does not appear to explicitly teach causing worker computing devices other than the worker computing device assigned the output layer of the DNN to alternate between performing forward and backward processing of different mini batches of training data.
However, Chilimbi et al. teaches causing worker computing devices other than the worker computing device assigned the output layer of the DNN to alternate between performing forward and backward processing of different mini batches of training data (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2, which is not an output layer (see Fig. 5), therefore the threads assigned to layer z2 correspond to worker computing devices other than the worker computing device assigned the output layer of the DNN; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different mini batches of input training data).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 

Regarding Claim 11,
Seide et al. in view of Chilimbi et al. teaches the computing device of claim 9.
Seide et al. further teaches wherein the partitioning is further optimized such that each of the plurality of worker computing devices performs a same amount of processing during training of the DNN model (pg. 5 [0043]: “the load balance module 220 may assign each of four groups of multiple layers from the layers 114(1)-114(N) to a corresponding multi-core processor, such that the amount of data processed by each of the four multicore processors for its respective assigned layers is equalized or as equalized as possible” teaches partitioning is optimized such that each processor (device) processes an equalized amount of data (corresponds to a same amount of processing) during training).
Regarding Claim 12,
Seide et al. in view of Chilimbi et al. teaches the computing device of claim 9.
Seide et al. further teaches wherein at least one of the plurality of worker computing devices is configured for model parallel processing (pg. 1 [0012]: “multiple layers of the DNNs may be processed in parallel on the multiple multi-core processors. Further, the pipelined algorithm may be configured to process input data sample batches having a size that is defined to optimize a tradeoff between computation accuracy and execution efficiency. In other words, the size may maximize both computation accuracy and execution efficiency of the pipelined algorithm 110” teaches multi-core pg. 1 [0012] teaches the multi-core processors can be GPUs; also see pg. 3 [0024])...
and wherein multiple worker computing devices of the plurality of worker computing devices are configured for data parallel processing (Fig. 1 and Fig. 3 teach multiple multi-core processors (worker computing devices) are configured for data parallel processing),
Chilimbi et al. further teaches whereby the DNN model is replicated to the at least one of the plurality of worker computing devices for training (Fig. 7 teaches DNN model is replicated; Fig. 5 and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to process layers of the DNN models; pg. 2 [0036] teaches the processing cores can be GPUS)...
whereby the multiple worker computing devices are configured to process the layers of the DNN in a stage (Fig. 6 teaches multiple threads (worker computing devices) are assigned to process layers of the DNN in a stage (the replica stage contains many replica layers)),
each of the multiple worker computing devices processing different mini batches of training data during training (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2, which is not an output layer (see Fig. 5), therefore the threads assigned to layer z2 correspond to worker computing devices other than the worker Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different mini batches of input training data).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al pg. 4 [0056]).
Regarding Claim 13,
Seide et al. in view of Chilimbi et al. teaches the computing device of claim 9.
Seide et al. further teaches wherein the at least one non-transitory computer storage medium has further computer-executable instructions stored thereupon to  (see pg. 3 [0024], [0028], and Fig. 2):
generate a profile of the deep neural network (DNN) model (pg. 6 [0047]: “the model striping module 222 may compare the size of the top layer 114(N) to an average size of the hidden layers, such as the hidden layers 114(2)-114( 4), to produce a ratio value, a size of the smallest layer (e.g., input layer 114(1)) of the DNNs 112 to produce a ratio value or a total size of the hidden layers 114(2)-114(4) produce a ratio value” teaches producing (generating) a ratio value, which corresponds to profile since the ratio value represents a description of the characteristics of DNN layers).
Chilimbi et al. further teaches partition the layers of the DNN model into the plurality of stages based upon the profile (Fig. 11 and pg. 10 [0142]: “a DPS solution is defined by input information along three main dimensions. First, the input information includes parameters which define resources to be used, including the number of parameter modules (SP), a number of replica units (RA), a number of worker units (WO) per replica unit, and a maximum number H of threads per worker unit. Second, the input information specifies parameters that define a number partitions and replications at each layer of the DNN model 114. Third, the input information describes the manner in which resources are mapped to the features of the DNN model 114, such as the manner in which segments are mapped to worker units, etc.” teach how the layers of the DNN model are partitioned into stages is based on input information (profile) regarding the DNN (also see Fig. 12 element 1204))...
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al pg. 4 [0056]).



Regarding Claim 15,
Seide et al. teaches A non-transitory computer storage medium having computer-executable instructions stored thereupon which, when executed by one or more processors of a computing device, will cause the computing device to (see pg. 3 [0024], [0028], and Fig. 2):
partition the layers of a deep neural network (DNN) model into a plurality of stages, wherein each of the plurality of stages comprises one or more of the layers of the DNN model (Fig. 5 Step 506 teaches partitioning the layers of a DNN by grouping two layers (corresponds to first grouping, or first stage) and Fig. 5 Step 508-510 teaches partitioning the layers of a DNN by placing the top layer in another grouping (corresponds to another grouping, or another stage)), and 
and wherein the partitioning is optimized to minimize a time to train the DNN model (pg. 2 [0020]: “the top layer 114(N) of the DNNs 112 may have a size that is ten times larger than that of the next largest layer in the DNNs 112. Accordingly, the processing of the top layer 114(N) may be paralleled across multiple multi-core processors. In this way, the model striping 122 of the top layer 114(N) may reduce the execution time of the pipelined algorithm 110 for training the DNNs112” teaches that the partitioning scheme (which includes performing model striping on the top layer) is optimized to reduce execution time for training the DNN); 
assign layers of the DNN model to each of a plurality of worker computing devices based upon the partitioning (pg. 7 [0066]: “At block 506, the training engine 102 may group at least two layers of the DNNs 112 for processing on a single multi-core processor. In various embodiments, the training engine 102 may group the layers in the DNNs 112 into multiple sets of two or more layers, in which each of the multiple sets may be processed by a corresponding multi-core processor” teaches assigning layers to multi-core processor (corresponds to worker computing device); Fig. 1 teaches multiple multi-core processors), 
Fig. 1 and pg. 7 [0068]: “At block 510, the training engine 102 may distribute the top layer 114(N) of the DNNs 112 across the multi-core processors 108(1)-108(N) for parallelized processing by the pipelined algorithm 110” teach assigning the top layer (corresponds to output layer) to each of the plurality of multi-core processors 108(1)-108(N));
train the DNN model by causing the worker computing device assigned the output layer of the DNN to alternate between forward and backward processing of the same mini batches of training data (Fig. 1 and pg. 7 [0068]: “At block 510, the training engine 102 may distribute the top layer 114(N) of the DNNs 112 across the multi-core processors 108(1)-108(N) for parallelized processing by the pipelined algorithm 110” teach assigning the top layer (corresponds to output layer) to worker computing device for parallelized processing by the pipelined algorithm 110; pg. 5 [0038]: “Each of the computation iterations performed by the pipelined algorithm 110 may execute the following steps in sequence: forward propagation of input data, error back propagation, and model update” teaches the pipeline algorithm, implemented by worker computing devices (see Fig. 1 multi-core processors 108(1)-108(N)), trains the DNN by performing a plurality of computation iterations that alternate between forward propagation and error back propagation; pg. 7 [0064]: “At block 502, the training engine 102 may allocate the batches 128 of sample frames from the training data 116 (e.g., a speech corpus) for training the DNNs 112. The training may be performed using the pipelined algorithm 110” and pg. 7 [0069]: “At block 512, the training engine 102 may pipeline an execution of the algorithm 110 on a set of multi-core processors to train the DNNs 112 based on the batches 128 of the training data 116” teach the pipeline algorithm (includes forward propagation and error back propagation) trains the DNN through processing batches of training data).
Seide et al. does not appear to explicitly teach causing worker computing devices other than the worker computing device assigned the output layer of the DNN to alternate between performing forward and backward processing of different mini batches of training data.
However, Chilimbi et al. teaches causing worker computing devices other than the worker computing device assigned the output layer of the DNN to alternate between performing forward and backward processing of different mini batches of training data (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2, which is not an output layer (see Fig. 5), therefore the threads assigned to layer z2 correspond to worker computing devices other than the worker computing device assigned the output layer of the DNN; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different mini batches of input training data).
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic 
Regarding Claim 17,
Seide et al. in view of Chilimbi et al. teaches the non-transitory computer storage medium of claim 15.
Seide et al. further teaches wherein the partitioning is further optimized such that each of the plurality of worker computing devices performs a same amount of processing during training of the DNN model (pg. 5 [0043]: “the load balance module 220 may assign each of four groups of multiple layers from the layers 114(1)-114(N) to a corresponding multi-core processor, such that the amount of data processed by each of the four multicore processors for its respective assigned layers is equalized or as equalized as possible” teaches partitioning is optimized such that each processor (device) processes an equalized amount of data (corresponds to a same amount of processing) during training).
Regarding Claim 19,
Seide et al. in view of Chilimbi et al. teaches the non-transitory computer storage medium of claim 15.
Seide et al. further teaches wherein the at least one non-transitory computer storage medium has further computer-executable instructions stored thereupon to (see pg. 3 [0024], [0028], and Fig. 2):
generate a profile of the deep neural network (DNN) model (pg. 6 [0047]: “the model striping module 222 may compare the size of the top layer 114(N) to an average size of the hidden layers, such as the hidden layers 114(2)-114( 4), to produce a ratio value, a size of the smallest layer (e.g., input layer 114(1)) of the DNNs 112 to produce a ratio value or a total size of the hidden layers 114(2)-114(4) produce a ratio value” teaches producing (generating) a ratio value, which corresponds to profile since the ratio value represents a description of the characteristics of DNN layers).
Chilimbi et al. further teaches partition the layers of the DNN model into the plurality of stages based upon the profile (Fig. 11 and pg. 10 [0142]: “a DPS solution is defined by input information along three main dimensions. First, the input information includes parameters which define resources to be used, including the number of parameter modules (SP), a number of replica units (RA), a number of worker units (WO) per replica unit, and a maximum number H of threads per worker unit. Second, the input information specifies parameters that define a number partitions and replications at each layer of the DNN model 114. Third, the input information describes the manner in which resources are mapped to the features of the DNN model 114, such as the manner in which segments are mapped to worker units, etc.” teach how the layers of the DNN model are partitioned into stages is based on input information (profile) regarding the DNN (also see Fig. 12 element 1204))...
Seide et al. and Chilimbi et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitations as taught by Chilimbi et al. to the disclosed invention of Seide et al. 
One of ordinary skill in the arts would have been motivated to make this modification in order to “efficiently find an acceptable DPS solution in advance of deploying the solution. This characteristic reduces or eliminates the waste of computing resources that may occur upon deploying an under-performing DPS. This characteristic also reduces the time that is involved in finding an acceptable DPS solution, e.g., by reducing or eliminating the need for successive ad hoc "in-field" testing of DPS solutions” (Chilimbi et al pg. 4 [0056]).

Claims 2, 10, and 16 rejected under 35 U.S.C. 103 as being unpatentable over Seide et al. (US 2014/0142929 A1) in view of Chilimbi et al. (US 2016/0092765 A1) and further in view of Teerapittayanon et al. (“Distributed Deep Neural Networks over the Cloud, the Edge and End Devices”).
Regarding Claim 2,
Seide et al. in view of Chilimbi et al. teaches the computer-implemented method of claim 1.
Seide et al. in view of Chilimbi et al. does not appear to explicitly teach wherein the partitioning is further optimized to minimize data communication between the computing devices.
However, Teerapittayanon et al. teaches wherein the partitioning is further optimized to minimize data communication between the worker computing devices (Fig. 4 teaches partitioning layers of a DNN; pg. 329 fourth full paragraph: “The contributions of this paper include...A joint training method that minimizes communication and resource usage for devices and maximizes usefulness of extracted features which are utilized in the cloud, while allowing low-latency classification via early exit for a high percentage of input samples” teaches optimizing to minimize data communication between devices).
Seide et al., Chilimbi et al., and Teerapittayanon et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation as taught by Teerapittayanon et al. to the disclosed invention of Seide et al. in view of Chilimbi et al.
One of ordinary skill in the arts would have been motivated to make this modification because minimizing communication of data between devices can result in benefits such as the following: “the communication cost of DDNN is reduced by a factor of over 20x compared to offloading raw sensor input to a DNN in the cloud which performs all of the inference computation” (Teerapittayanon et al. pg. 338 fourth paragraph).
Regarding Claim 10,
Seide et al. in view of Chilimbi et al. teaches the computing device of claim 9.
Seide et al. in view of Chilimbi et al. does not appear to explicitly teach wherein the partitioning is further optimized to minimize data communication between the worker computing devices.
However, Teerapittayanon et al. teaches wherein the partitioning is further optimized to minimize data communication between the worker computing devices (Fig. 4 teaches partitioning layers of a DNN; pg. 329 fourth full paragraph: “The contributions of this paper include...A joint training method that minimizes communication and resource usage for devices and maximizes usefulness of extracted features which are utilized in the cloud, while allowing low-latency classification via early exit for a high percentage of input samples” teaches optimizing to minimize data communication between devices).
Seide et al., Chilimbi et al., and Teerapittayanon et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation as taught by Teerapittayanon et al. to the disclosed invention of Seide et al. in view of Chilimbi et al.
One of ordinary skill in the arts would have been motivated to make this modification because minimizing communication of data between devices can result in benefits such as the following: “the communication cost of DDNN is reduced by a factor of over 20x compared to offloading raw sensor input to a DNN in the cloud which performs all of the inference computation” (Teerapittayanon et al. pg. 338 fourth paragraph).
Regarding Claim 16,
Seide et al. in view of Chilimbi et al. teaches the non-transitory computer storage medium of claim 15.
Seide et al. in view of Chilimbi et al. does not appear to explicitly teach wherein the partitioning is further optimized to minimize data communication between the worker computing devices.
However, Teerapittayanon et al. teaches wherein the partitioning is further optimized to minimize data communication between the worker computing devices (Fig. 4 teaches partitioning layers of a DNN; pg. 329 fourth full paragraph: “The contributions of this paper include...A joint training method that minimizes communication and resource usage for devices and maximizes usefulness of extracted features which are utilized in the cloud, while allowing low-latency classification via early exit for a high percentage of input samples” teaches optimizing to minimize data communication between devices).
Seide et al., Chilimbi et al., and Teerapittayanon et al. are analogous art to the claimed invention because they are directed to partitioning of DNNs.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation as taught by Teerapittayanon et al. to the disclosed invention of Seide et al. in view of Chilimbi et al.
One of ordinary skill in the arts would have been motivated to make this modification because minimizing communication of data between devices can result in benefits such as the following: “the communication cost of DDNN is reduced by a factor of over 20x compared to offloading raw sensor input to a DNN in the cloud which performs all of the inference computation” (Teerapittayanon et al. pg. 338 fourth paragraph).



Claims 14 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Seide et al. (US 2014/0142929 A1) in view of Chilimbi et al. (US 2016/0092765 A1) and further in view of Wesolowski et al. (US 2019/0114537 A1).
Regarding Claim 14,
Seide et al. in view of Chilimbi et al. teaches the computing device of claim 13. 
Seide et al. in view of Chilimbi et al. does not appear to explicitly teach wherein the profile of the DNN model is generated by training the DNN model on a subset of the plurality of the worker computing devices with a subset of the DNN training data for a predetermined period of time.
However, Wesolowski et al. teaches wherein the profile of the DNN model is generated by training the DNN model on a subset of the plurality of the worker computing devices with a subset of the DNN training data for a predetermined period of time (pg. 10 [0066] “In order to better manage, or schedule, the transferring-of-training of a neural network ML model, each machine (or training group or computing system), may generate checkpoints at different times during the training of a neural network. The generation of checkpoints may be controlled by Master ML controller 21, or may be instigated by the machine (or training group of machines) that is training a neural network, in response to various triggering events, or conditions. A check point may be a record of an execution state in the training of a neural network (or graph-segment) with sufficient information to restart the training of the neural network (or graph-segment)” teaches generating a check point (for example, a record of an execution state) of the neural network, which corresponds to profile of the neural network, by training the neural network on a training group of machines (worker computing devices); pg. 12 [0073]: “For example, if training is being executed on a training group made up of service server from bank 27 during off-peak hours, and it is determined that peak hours are approaching, a check-point may be created in anticipation of transferring training off the service machines because of the peak hours” teaches training the neural network during a specific pre-determined period of time (for pg. 1 [0005] teaches that the neural network can be a deep neural network).
Seide et al., Chilimbi et al., and Wesolowski et al. are analogous art to the claimed invention because they are directed to parallelized implementation of neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation as taught by Wesolowski et al. to the disclosed invention of Seide et al. in view of Chilimbi et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to “make use of service computing machines during their off-peak hours for training a machine learning model” through “[providing] for heterogeneous computing for training a machine learning model across different computing systems having different computer architectures/characteristics” (Wesolowski et al. pg. 2 [0021]).
Regarding Claim 20,
Seide et al. in view of Chilimbi et al. teaches the non-transitory computer storage medium of claim 15. 
Seide et al. in view of Chilimbi et al. does not appear to explicitly teach wherein a profile of the DNN model is generated by training the DNN model on a subset of the plurality of the worker computing devices with a subset of the DNN training data for a predetermined period of time.
However, Wesolowski et al. teaches a profile of the DNN model is generated by training the DNN model on a subset of the plurality of the worker computing devices with a subset of the DNN training data for a predetermined period of time (pg. 10 [0066] “In order to better manage, or schedule, the transferring-of-training of a neural network ML model, each machine (or training group or computing system), may generate checkpoints at different times during the training of a neural network. The generation of checkpoints may be controlled by Master ML controller 21, or may be instigated by the machine (or training group of machines) that is training a neural network, in response to various triggering events, or conditions. A check point may be a record of an execution state in the training of a neural network (or graph-segment) with sufficient information to restart the training of the neural network (or graph-segment)” teaches generating a check point (for example, a record of an execution state) of the neural network, which corresponds to profile of the neural network, by training the neural network on a training group of machines (worker computing devices); pg. 12 [0073]: “For example, if training is being executed on a training group made up of service server from bank 27 during off-peak hours, and it is determined that peak hours are approaching, a check-point may be created in anticipation of transferring training off the service machines because of the peak hours” teaches training the neural network during a specific pre-determined period of time (for example, during off-peak hours; see pg. 10 [0063] for what is considered off-peak and peak hours); pg. 1 [0005] teaches that the neural network can be a deep neural network).
Seide et al., Chilimbi et al., and Wesolowski et al. are analogous art to the claimed invention because they are directed to parallelized implementation of neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation as taught by Wesolowski et al. to the disclosed invention of Seide et al. in view of Chilimbi et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to “make use of service computing machines during their off-peak hours for training a machine learning model” through “[providing] for heterogeneous computing for training a machine learning model across different computing systems having different computer architectures/characteristics” (Wesolowski et al. pg. 2 [0021]).


Response to Arguments
Applicant's arguments filed on 10/26/2021 with respect to the 35 U.S.C. 112(f) claim interpretation are acknowledged. Applicant asserts that the amended claims do not invoke 35 U.S.C. 112(f) but does not provide any arguments. Amended claim 12 remains under 35 U.S.C. 112(f) claim interpretation because claim 12 recites generic placeholder “worker computing devices” that is linked to claimed functions using linking phrase “configured to.” Please see the current 35 U.S.C. 112(f) claim interpretation section for additional information.

Applicant's arguments filed on 10/26/2021 with respect to the 35 U.S.C. 112(a) and 112(b) rejections are acknowledged. However, the amendments filed on 10/26/2021 necessitated new grounds of 35 U.S.C. 112(a) and 112(b) rejections as stated in the current rejection. Please see the current rejection for additional information.

Applicant's arguments filed on 10/26/2021 with respect to the 35 U.S.C. 103 rejection to claim 1 and its dependent claims have been fully considered but they are not persuasive. Applicant asserts that “the cited references do not teach, suggest, describe, or otherwise render obvious the recitations of this claim as amended hereby for "training the DNN model by causing the worker computing device assigned the output layer of the DNN to alternate between performing forward processing and backward processing of the same minibatches of training data, and causing worker computing devices other than the worker computing device assigned the output layer of the DNN to alternate between performing forward processing and backward processing of different minibatches of training data."” (Remarks, pg. 11-12).


Examiner’s Response:
The Examiner respectfully disagrees. Applicant merely asserts that the cited references do not teach the amended limitations, but does not provide any arguments. Seide et al. teaches training the DNN model by causing the worker computing device assigned the output layer of the DNN to alternate between performing forward processing and backward processing of the same mini batches of training data (Fig. 1 and pg. 7 [0068]: “At block 510, the training engine 102 may distribute the top layer 114(N) of the DNNs 112 across the multi-core processors 108(1)-108(N) for parallelized processing by the pipelined algorithm 110” teach assigning the top layer (corresponds to output layer) to worker computing device for parallelized processing by the pipelined algorithm 110; pg. 5 [0038]: “Each of the computation iterations performed by the pipelined algorithm 110 may execute the following steps in sequence: forward propagation of input data, error back propagation, and model update” teaches the pipeline algorithm, implemented by worker computing devices (see Fig. 1 multi-core processors 108(1)-108(N)), trains the DNN by performing a plurality of computation iterations that alternate between forward propagation and error back propagation; pg. 7 [0064]: “At block 502, the training engine 102 may allocate the batches 128 of sample frames from the training data 116 (e.g., a speech corpus) for training the DNNs 112. The training may be performed using the pipelined algorithm 110” and pg. 7 [0069]: “At block 512, the training engine 102 may pipeline an execution of the algorithm 110 on a set of multi-core processors to train the DNNs 112 based on the batches 128 of the training data 116” teach the pipeline algorithm (includes forward propagation and error back propagation) trains the DNN through processing batches of training data).
Chilimbi et al. (new reference) teaches causing worker computing devices other than the worker computing device assigned the output layer of the DNN to alternate between performing forward processing and backward processing of different mini batches of training data (pg. 6 [0079]: “The remaining characteristics allow a DPS to process multiple training samples in parallel. For example, a resource allocation architecture may also, or alternatively, entail allocating plural threads to at least one layer of the DNN model 114, such as, as shown in FIG. 5, layer z2” and pg. 6 [0080]: “In terms of physical implementation, a single computing device may have plural processing cores. Each core may run a single thread. In another case, each core may run two or more threads” teach multiple processing cores containing threads (the multiple threads correspond to worker computing devices) are assigned to layer z2, which is not an output layer (see Fig. 5), therefore the threads assigned to layer z2 correspond to worker computing devices other than the worker computing device assigned the output layer of the DNN; Fig. 3 teaches training operations of layer z2 include alternating between performing forward and backward processing of training data; Fig. 7 teaches the DNN model can be trained with different mini batches of input training data).
Applicant relies on the analogous arguments above for independent claims 9 and 15 (and their respective dependent claims), therefore the response above is applicable to claims 9 and 15 (and their respective dependent claims).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YING YU CHEN whose telephone number is (571)270-1484. The examiner can normally be reached Monday-Friday 7:30 am-5:00 pm (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/Y.C./               Examiner, Art Unit 2125         

/KAMRAN AFSHAR/               Supervisory Patent Examiner, Art Unit 2125