DETAILED ACTION
1.	This office action is in response to the Application No. 16418799 filed on 05/21/2019. Claims 1-25 are presented for examination and are currently pending.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
3.	Claim 25 is objected to because the limitation recites “… based at least in part on training speedup …”. It should be “…based at least in part on a training speedup …”
	 Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


4.	Claim 1-25 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
	Claim 1 recites “A processor, comprising: one or more arithmetic logic units (ALUs) to infer information using one or more neural networks”. It is unclear what the “one or more arithmetic logic units (ALUs) to infer information using one or more neural networks” means. An ALU simply performs basic arithmetic calculations such as multiplying inputs and weights and performing addition for each neuron as an example, but ALU does not appear to infer (or predict) information using a neural network. Furthermore, the Applicant discloses the basic arithmetic function of an ALU is instant specification: “Each core 1210, in an embodiment, includes a floating point arithmetic logic unit and an integer arithmetic logic unit (US20200372337 [0086]) … In an embodiment, each tensor core operates on a 4x4 matrix and performs a matrix multiply and accumulate operation D=A×B+C, where A, B, C, and D are 4×4 matrices. (US20200372337 [0087]).” The matrix multiply and accumulate function of an ALU as disclosed by the Applicant does not appear to be a prediction or inference using a neural network. Claims 2-5 are rejected due to dependency.
	Claim 6 recites “one or more neural networks have been trained by reducing training model parallelism in response to slower training times”. It is unclear what reducing training model parallelism in response to slower training times means. It is unclear if the claim implies reducing the number of partitioning model layers across machines during training or if it means adding another technique to the model parallelism to reduce training times. For the purpose of examination, the office has interpreted the claim as adding another technique to model parallelism to reduce training times.
	Claim 7 recites “a second number of portions of the one or more neural networks”. The claim does not disclose a first number of portions, so the second number of portions is unclear. Also, claim 7 recites “the one or more neural networks trained in parallel” which lacks antecedent basis, as the claim previously mentions “one or more processors to train one or more neural network” but does not explicitly perform training. Claims 8-13 are rejected due to dependency.
	Claim 14 recites “train a second number of portions of the neural network”. The claim does not disclose train a first number of portions, so the train a second number of portions is unclear. Claims 15-19 are rejected due to dependency.
	Claim 20 recites “determining a second number of portions of the neural network”. The claim does not disclose determining a first number of portions, so the determining a second number of portions is unclear. Claims 21-25 are rejected due to dependency.
	

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

5.	Claims 7-10, 12, 14-17 and 20-25 are rejected under 35 U.S.C 102(a)(1) as being anticipated by Harlap et al. ("Pipedream: Fast and efficient pipeline parallel dnn training." arXiv preprint arXiv:1806.03377 (2018))

	Regarding claim 7, Harlap teaches a system, comprising: one or more computers having one or more processors to train one or more neural networks (PipeDream is a Deep Neural Network (DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines (abstract))
	 using a first number of training data threads (PipeDream provides the ML worker thread (Caffe) pointers to GPU memory containing layer input data, pg. 9, Fig. 9; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. We use the ILSVRC12 dataset to train VGG16 and Inception-v3, pg. 10, right col, Models and Training Methodology Section. The Examiner notes that since Harlap's system trains DNNs (abstract) the worker threads can be considered training threads)
	 to achieve a first level of training efficiency (speedup over 1 machine (y-axis) of about 1 for 4 model parallel on x-axis, Fig. 13, pg. 12, right col, section 5.3. The Examiner notes that speedup over 1 machine (y-axis) of about 1 is the first level efficiency) and 
	 a second number of portions of the one or more neural networks trained in parallel (Using these estimates, PipeDream’s optimizer partitions layers across available machines, Fig. 7)
	using the first number of training data threads (PipeDream provides the ML worker thread (Caffe) pointers to GPU memory containing layer input data, pg. 9, Fig. 9; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. We use the ILSVRC12 dataset to train VGG16 and Inception-v3, pg. 10, right col, Models and Training Methodology Section. The Examiner notes that since Harlap's system trains DNNs (abstract) the worker threads can be considered training threads)
	to achieve a second level of training efficiency. (speedup over 1 machine (y-axis) of about 3.5 for 4 PipeDream on x-axis, Fig. 13, pg. 12, right col, section 5.3. The Examiner notes that speedup over 1 machine (y-axis) of about 3.5 is the second level efficiency)

	Regarding claim 8, Harlap teaches the system of claim 7, Harlap teaches wherein the first number of training data threads is split into subsets to be distributed among the one or more processors to train the one or more neural networks. (PipeDream provides the ML worker thread (Caffe) pointers to GPU memory containing layer input data, pg. 9, Fig. 9; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. The input to our system is a model architecture, the training dataset, and the number of GPUs that will be used for training. PipeDream first profiles the model on a single machine with a subset of minibatches from the training dataset. pg. 9, left col, Implementation Section. The Examiner notes that the worker threads can be considered as a training thread)

	Regarding claim 9, Harlap teaches the system of claim 7, Harlap teaches wherein the second number of portions of the one or more neural networks trained in parallel is split into components to be distributed among the one more processors to train the one or more neural networks. (Using these estimates, PipeDream’s optimizer partitions layers across available machines, Fig. 7. The Examiner notes that the DNN has four layer or components in Fig. 7 which implies layers are distributed across machines which comprises processors for parallel training in Fig. 6)

	Regarding claim 10, Harlap teaches the system of claim 7, Harlap teaches wherein the one or more computers having the one or more processors further train the one or more neural networks by increasing the first number of training data threads until the first level of training efficiency is achieved. (we inject multiple minibatches into the pipeline one after the other, thus enhancing model-parallel training with
pipelining. On completing the forward pass for a minibatch, each stage asynchronously sends the output activations to the next stage, while simultaneously starting to process another minibatch, Similarly, after 
completing the backward pass for a minibatch, each stage asynchronously sends the gradient to the previous stage, while starting computation for another minibatch.pg. 5, left col, last para.; PipeDream provides the ML worker thread (Caffe) pointers to GPU memory containing layer input data, pg. 9, Fig. 9; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3; (data-parallel configurations (BSP) has a BSP speedup over 1 machine of 2.35x, pg. 11, Table 1, first row. The Examiner notes that 2.35x is the first level of training efficiency)

	Regarding claim 12, Harlap teaches the system of claim 7, Harlap teaches wherein the second number of portions of the one or more neural networks training in parallel using the first number of training data threads is increased in response to achieving the first level of training efficiency (Using these estimates, PipeDream’s optimizer partitions layers across available machines, Fig. 7.; The figure 13 shows these results for training VGG16 using 4 and 8 machines on Cluster-A, pg. 12, right col, section 5.3; PipeDream provides the ML worker thread (Caffe) pointers to GPU memory containing layer input data, pg. 9, Fig. 9; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3; The Examiner notes that 4 machines implies partitioning neural network into 4 layers and 8 machines implies partitioning neural network into 8 layers which means that the number of portions are increased from 4 layers to 8 layers)

	Regarding claim 14, Harlap teaches a machine-readable medium having stored thereon a set of instructions, which if performed by one or more processors, cause the one or more processors to at least: (Cluster-A is a private cluster of NVIDIA Titan
X GPUs with 12 GB of GPU device memory. Each machine has a E5-2698Bv3 Xeon CPU with 64 GB of RAM, pg. 10, left col, last para.; PipeDream’s Partitioning Algorithm. Our partitioning algorithm takes the output of the profiling step, and
computes: 1) a partitioning of layers into stages, 2) the replication factor for each stage, and 3) optimal number of minibatches to keep the training pipeline busy, pg. 6, right col, third para.)
	train a neural network using a first number of parallel training data threads resulting in a first level of training efficiency; (data-parallel configurations (BSP) has a BSP speedup over 1 machine of 2.35x, pg. 11, Table 1, first row; PipeDream provides the ML worker thread (Caffe) pointers to GPU memory containing layer input data, pg. 9, Fig. 9; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. The Examiner notes 2.35x is the first level of training efficiency and since Harlap's system trains DNNs (abstract) the worker threads can be considered training threads.) and
	train a second number of portions of the neural network in parallel (Using these estimates, PipeDream’s optimizer partitions layers across available machines, Fig. 7. The Examiner notes that partitions layers are interpreted as portions of a neural network)
	 using the first number of parallel training data threads resulting in a second level of training efficiency. (PipeDream’s partitioning algorithm selects a data-parallel configuration to train inception-v3 on Cluster-A, pg. 11, right col, first para.; PipeDream speedup over 1 machine of 7.04x, pg. 11, Table 1, first row; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. The Examiner notes that the worker threads can be considered as a training thread)
	
	Regarding claim 15, Modified Harlap teaches the machine-readable medium of claim 14, Harlap teaches wherein the first level of training efficiency is based at least in part on training times associated with training the neural network using the first number of parallel training data threads (The first conclusion we draw is that for VGG16, BSP with 8 machines reduces training time by only 2.35x, … PipeDream
eliminates 95% of this communication overhead thereby improving performance by 7.04x, pg. 11, left col, last para.; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. The Examiner notes that The Examiner notes that since Harlap's system trains DNNs (abstract) the worker threads can be considered training threads and the 2.35x as first level of efficiency)

	Regarding claim 16, Modified Harlap teaches the machine-readable medium of claim 14, Harlap teaches wherein the second level of training efficiency indicates training times associated with training the neural network using the second number of portions of the neural network in parallel in combination with using the first number of parallel training data threads. (The first conclusion we draw is that for VGG16, BSP with 8 machines reduces training time by only 2.35x, … PipeDream eliminates 95% of this communication overhead thereby improving performance by 7.04x, pg. 11, left col, last para. pg. 10, right col, Models and training Methodology; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. Using these estimates, PipeDream’s optimizer partitions layers across available machines, Fig. 7; The Examiner notes that 7.04x is the second level of training efficiency.)

	Regarding claim 17, Modified Harlap teaches the machine-readable medium of claim 14, wherein the set of instructions further cause the one or more processors to at least train the neural network by adjusting the first number of parallel training data threads until the first level of training efficiency is achieved. (data-parallel configurations (BSP) has a BSP speedup over 1 machine of 2.35x, pg. 11, Table 1, first row; PipeDream provides the ML worker thread (Caffe) pointers to GPU memory containing layer input data, pg. 9, Fig. 9; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. The Examiner notes that data-parallel configuration (BSP) reads on the first number of parallel training data and first level of efficiency is 2.35x)

	Regarding claim 20, Harlap teaches a method comprising: determining a first number of parallel training data threads to train a neural network at or above a first level of training efficiency; (data-parallel configurations (BSP) has a BSP speedup over 1 machine of 2.35x, pg. 11, Table 1, first row; PipeDream provides the ML worker thread (Caffe) pointers to GPU memory containing layer input data, pg. 9, Fig. 9; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. The Examiner notes that since Harlap's system trains DNNs (abstract) the worker threads can be considered training threads and 2.35x is the first level of training efficiency) and
	determining a second number of portions of the neural network to train in parallel (Using these estimates, PipeDream’s optimizer partitions layers across available machines, Fig. 7. The Examiner notes that partitions layers is interpreted as portions of a neural network)
	using the first number of parallel training data threads at or above a second level of training efficiency. (PipeDream’s partitioning algorithm selects a data-parallel configuration to train inception-v3 on Cluster-A, pg. 11, right col, first para.; PipeDream speedup over 1 machine of 7.04x, pg. 11, Table 1, first row;  PipeDream provides the ML worker thread (Caffe) pointers to GPU memory containing layer input data, pg. 9, Fig. 9; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. The Examiner notes that 7.04x is the second level of training efficiency)

	Regarding claim 21, Harlap teaches the method of claim 20, Harlap teaches wherein the first number of parallel training data threads is split into subsets to be distributed among one or more processors of a computer system to train the neural network. (PipeDream provides the ML worker thread (Caffe) pointers to GPU memory containing layer input data, pg. 9, Fig. 9; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. The input to our system is a model architecture, the training dataset, and the number of GPUs that will be used for training. PipeDream first profiles the model on a single machine with a subset of minibatches from the training dataset. pg. 9, left col, Implementation Section)

	Regarding claim 22, Harlap teaches the method of claim 20, Harlap teaches wherein the second number of portions of the neural network to train in parallel is split into components to be distributed among one more processors of a computer system to train the neural network. (Using these estimates, PipeDream’s optimizer partitions layers across available machines, Fig. 7. The Examiner notes that the DNN has four layer or components in Fig. 7 which implies layers are distributed across machines which comprises processors for parallel training)

	Regarding claim 23, Harlap teaches the method of claim 20,	Harlap teaches wherein the first number of parallel training data threads to train the neural network is increased until the neural network is trained at or above the first level of training efficiency, (we inject multiple minibatches into the pipeline one after the other, thus enhancing model-parallel training with pipelining. On completing the forward pass for a minibatch, each stage asynchronously sends the output activations to the next stage, while simultaneously starting to process another minibatch, Similarly, after completing the backward pass for a minibatch, …, while starting computation for another minibatch.pg. 5, left col, last para.; (data-parallel configurations (BSP) has a BSP speedup over 1 machine of 2.35x, pg. 11, Table 1, first row; PipeDream provides the ML worker thread (Caffe) pointers to GPU memory containing layer input data, pg. 9, Fig. 9; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. The Examiner notes that 2.35x is the first level of training efficiency)
	wherein the first level of training efficiency is based at least in part on training speedup associated with using the first number of parallel training data threads to train the neural network. (data-parallel configurations (BSP) has a BSP speedup over 1 machine of 2.35x, pg. 11, Table 1, first row. The Examiner notes that data-parallel reads on the first number of parallel training data and 2.35x is the first level of training efficiency)
	 
	Regarding claim 24, Harlap teaches the method of claim 23, Harlap teaches wherein the second number of portions of the neural network to train in parallel using the first number of parallel training data threads is determined in response to the neural network being trained at or above the first level or training efficiency. (The first conclusion we draw is that for VGG16, BSP with 8 machines reduces training time by only 2.35x, … PipeDream eliminates 95% of this communication overhead thereby improving performance by 7.04x, pg. 11, left col, last para. pg. 10, right col, Models and training Methodology.; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3.; Using these estimates, PipeDream’s optimizer partitions layers across available machines, Fig. 7)

	Regarding claim 25, Harlap teaches the method of claim 20, Harlap teaches wherein the second number of portions of the neural network to train in parallel (Using these estimates, PipeDream’s optimizer partitions layers across available machines, Fig. 7. The Examiner notes that the DNN has four layer or components in Fig. 7 which implies layers are distributed across machines which comprises processors for parallel training)
	using the first number of parallel training data threads is based at least in part on training speedup associated with using the second number of portions to train the neural network (PipeDream provides the ML worker thread (Caffe) pointers to GPU memory containing layer input data, pg. 9, Fig. 9; a machine may have multiple GPUs each running a worker thread, Footnote 1, pg. 3. The Examiner notes that the worker threads can be considered as a training thread. We use the ILSVRC12 dataset to train VGG16 and Inception-v3, pg. 10, right col, Models and Training Methodology Section.)


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



6.	Claims 1-6 are rejected under 35 U.S.C. 103 as being unpatentable over Harlap et al. ("Pipedream: Fast and efficient pipeline parallel dnn training." arXiv preprint arXiv:1806.03377 (2018)) in view of Nurvitadhi et al (US20190205746 filed 12/29/2017)

	Regarding claim 1, Harlap teaches a processor, comprising: (PipeDream is a Deep Neural Network (DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines, … Experiments with five different DNNs on two different clusters show that PipeDream is up to 5x faster in time-to-accuracy compared to data parallel training, pg. 1, Abstract) 
	wherein the one or more neural networks have been trained using a combination of training data parallelism and model parallelism to achieve a level of training efficiency (PipeDream combines traditional data parallelism with model parallelism enhanced with pipelining, pg. 4 right col, last para.)
	Harlap does not explicitly teach one or more arithmetic logic units (ALUs) to infer information using one or more neural networks
 	Nurvitadhi teaches one or more arithmetic logic units (ALUs) to infer information using one or more neural networks (To perform logic operations, the slices 1401A-1401N can include a set of additional function units integer arithmetic logic units (ALUs 1416-1416N) [0178]; The number of channels may be independent of the number of physical Arithmetic Logic Units (ALUs) … for a particular graphics processor [0110]; In an inferencing configuration the GPGPU 1430 includes fewer of the compute clusters 1436A-1436H relative to the training configuration. Additionally, the memory technology associated with the memory 1434A-1434B may differ between inferencing and training configurations [0184])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Harlap to incorporate the method of Nurvitadhi for the benefit of an inferencing configuration which can provide support for one or more 8-bit integer dot product instructions, which are commonly used during inferencing operations for deployed neural networks (Nurvitadhi, [0184])
	
	Regarding claim 2, Modified Harlap teaches the processor of claim 1, Harlap teaches wherein the one or more neural networks have been trained by increasing data parallelism until an intermediate level of training efficiency is achieved (Figure 1 quantitatively shows the fraction of training time spent in communication stalls for five different DNN models, pg. 3, left col, last para.; … Second, as the number of data parallel workers increases, communication overheads increase for all models. Third, as GPU compute speeds increase (K80s to V100s), communication stalls also increase for all five models, pg. 3, right col, second to the last para.; data-parallel configurations (BSP) has a BSP speedup over 1 machine of 2.35x, pg. 11, Table 1, first row. The Examiner notes that 2.35x is the training efficiency) and
	 increasing model parallelism until the level of training efficiency is achieved (Using these estimates, PipeDream’s optimizer partitions layers across available machines, Fig. 7.; The figure 13 shows these results for training VGG16 using 4 and 8 machines on Cluster-A, pg. 12, right col, section 5.3; PipeDream speedup over 1 machine of 7.04x, pg. 11, Table 1, first row .The Examiner notes that 4 machines implies partitioning neural network into 4 layers and 8 machines implies partitioning neural network into 8 layers which means that the number of portions are increased from 4 layers to 8 layers and 7.04x is the level of training efficiency.)

	Regarding claim 3, Modified Harlap teaches the processor of claim 2, Harlap wherein the intermediate level of training efficiency is measured based at least in part on training times associated with using training data parallelism  (The first conclusion we draw is that for VGG16, BSP with 8 machines reduces training time by only 2.35x, … PipeDream eliminates 95% of this communication overhead thereby improving performance by 7.04x, pg. 11, left col, last para. The Examiner notes that 2.35x training efficiency) and 
	the level of training efficiency is measured based at least in part on training times associated with using the combination of training data parallelism and model parallelism (Pipeline Parallelism. PipeDream’s pipeline parallelism provides the biggest reductions in training time— 3.14× and 7.04× with 4 and 8 machines compared to single machine training. These results demonstrate that a combination of pipelining, model parallelism and data parallelism achieve faster training than either model parallelism, model parallelism with pipelining, or data parallelism, pg. 12, right col, last para. to first para., pg. 13)

	Regarding claim 4, Modified Harlap teaches the processor of claim 3, Harlap wherein the one or more neural networks have been trained by: comparing training times associated with using training data parallelism and training times associated with using the combination of training data parallelism and model parallelism; (Table 1 summarizes results comparing PipeDream with data-parallel training (BSP). For the three models, the table shows PipeDream’s auto-generated configuration and the corresponding speedup in training time over single machine and data-parallel training (BSP). It also shows the communication reduction achieved by PipeDream compared to data-parallel training, pg. 10, right col, last para., 5.2 PipeDream vs. Data Parallelism) and
	using the combination of training data parallelism and model parallelism based on the comparison (Figure 11 shows accuracy vs. training time for VGG16 and Inception-v3, using 8 machines in Cluster-B, for both BSP and PipeDream. Compared to Cluster-A, ClusterB employs faster V100 GPUs with 10Gbps interconnect between the machines (as granted by the cloud provider). Thus, models running on Cluster-B have lower computation-to-communication ratios. We note that the faster GPUs result in faster end-to-end training time — e.g., training time for VGG16 reduces from 220 hours on Cluster-A to little less than 100 hours on Cluster-B (Figures 10 and 11). We also observe that the higher communication overhead causes both BSP and PipeDream to scale less effectively to 8 machines.), pg. 11, right col, second to the last para.)

	Regarding claim 5, Modified Harlap teaches the processor of claim 1, Harlap wherein the one or more neural networks have been trained by: obtaining configuration parameters associated with the one or more neural networks; (Parameter State. For each stage, PipeDream maintains all parameters associated with the layers assigned to the stage directly in GPU memory. The parameters for each layer are stored separately, and each assigned a unique ID, pg. 9, right col, second para.; PipeDream extracts the layer parameters from the DNN model and computes the size of activations, parameters, and intermediate state that needs to be stored at each stage, across the active minibatches present in the pipeline, pg. 8, right col, last para.) and 
	using the combination of training data parallelism and model parallelism based on the obtained information. (Figure 7 shows PipeDream’s high-level workflow. The input to our system is a model architecture, the training dataset, and the number of GPUs that will be used for training. PipeDream first profiles the model on a single machine with a subset of minibatches from the training dataset, pg. 9, left col, second para.; PipeDream’s automated mechanism to partition DNN layers into stages. PipeDream first profiles the input DNN, to get estimates for each layer’s compute time and output size. Using these estimates, PipeDream’s optimizer partitions layers across available machines, pg. 6, Fig. 7) 

	Regarding claim 6, Modified Harlap teaches the processor of claim 4, Modified Harlap did not explicitly teach wherein the one or more neural networks have been trained by reducing training model parallelism in response to slower training times. (Combining model parallelism with pipelining results in straight pipeline configurations (no data parallelism compared to PipeDream). Straight pipeline configurations greatly reduce training time compared to single machine training—2.56x and 3.49x with 4 and 8 machines, respectively. In fact, these improvements are better than data parallel training, which achieves corresponding speedups of 1.47x and 2.35x compared to single machine training, pg. 12, right col, last para.)

7.	Claims 11 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Harlap et al. ("Pipedream: Fast and efficient pipeline parallel dnn training." arXiv preprint arXiv:1806.03377 (2018)) in view of Seide et al. ("1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns.", hereinafter “seide2” Fifteenth annual conference of the international speech communication association. 2014.)

	Regarding claim 11, Harlap teaches the system of claim 7, Harlap teaches wherein the one or more computers having the one or more processors further train the one or more neural networks by: (PipeDream is a Deep Neural Network (DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines (abstract))
	Harlap does not explicitly teach comparing training times associated with using the first number of training data threads and training times associated with using the second number of portions of the one or more neural networks trained in parallel using the first number of training data threads; and using the second number of portions of the one or more neural networks trained in parallel using the first number of training data threads based on the comparison.
	Seide2 teaches comparing training times associated with using the first number of training data threads and training times associated with using the second number of portions of the one or more neural networks trained in parallel (Table 5: combining data and model parallelism, pg. 1061, left-right col, 5.5 Combination with Model Parallelism.; Row “partial gradients” adds data parallelism over K = 4 compute nodes, combined with 2-GPU model parallelism in each compute node, pg. 1061, left col, first para., Table 4.The examiner notes that the table 5 shows the comparison of the training speed of data parallelism and model parallelism)
	using the first number of training data threads; and using the second number of portions of the one or more neural networks trained in parallel using the first number of training data threads based on the comparison. (Table 3 analyzes where to apply AdaGrad in the process. First, we can see that applying AdaGrad to the raw gradients before momentum, rather than the momentum-smoothed gradient,
leads to higher training frame accuracy and 0.3 points better WER. We believe that this is because momentum smoothing reduces the standard deviation and thus the effect of AdaGrad. Row “partial gradients” adds data parallelism over K =
4 compute nodes, combined with 2-GPU model parallelism in each compute node (more on that in Section 5.5). AdaGrad is applied locally before quantization. while it leads to a small WER gain, the training frame accuracy drops a little, pg. 1061, left col, first para.; Comparing the same number of GPUs, MP only helps in one configuration, the communication-bound minibatch size 2880 with 16 GPUs, pg. 1061, left-right col, 5.4. Impact of MB-Size Selection and Double Buffering) 
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Harlap to incorporate the method of Seide2 for the benefit of reducing end-to-end training time from 35h to 8.1h (Seide2, pg. 1061, fourth Para., Table 3)

	Regarding claim 18, Harlap teaches the machine-readable medium of claim 14, Harlap teaches wherein the set of instructions further cause the one or more processors to at least train the neural network by: (PipeDream is a Deep Neural Network (DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines (abstract))
	Harlap does not explicitly teach comparing training times associated with using the first number of parallel training data threads and training times associated with using the second number of portions of the neural network in parallel using the first number of parallel training data threads; and using the second number of portions of the neural network in parallel using the first number of parallel training data threads based on the comparison.
	Seide2 teaches comparing training times associated with using the first number of parallel training data threads and training times associated with using the second number of portions of the one or more neural networks trained in parallel (Table 5: combining data and model parallelism, pg. 1061, left-right col, 5.5 Combination with Model Parallelism.; Row “partial gradients” adds data parallelism over K = 4 compute nodes, combined with 2-GPU model parallelism in each compute node, pg. 1061, left col, first para., Table 4.The examiner notes that the table 5 shows the comparison of the training speed of data parallelism and model parallelism)
	using the first number of training data threads; and using the second number of portions of the one or more neural networks trained in parallel using the first number of training data threads based on the comparison. (Table 3 analyzes where to apply AdaGrad in the process. First, we can see that applying AdaGrad to the raw gradients before momentum, rather than the momentum-smoothed gradient,
leads to higher training frame accuracy and 0.3 points better WER. We believe that this is because momentum smoothing reduces the standard deviation and thus the effect of AdaGrad. Row “partial gradients” adds data parallelism over K =
4 compute nodes, combined with 2-GPU model parallelism in each compute node (more on that in Section 5.5). AdaGrad is applied locally before quantization. while it leads to a small WER gain, the training frame accuracy drops a little, pg. 1061, left col, first para.; Comparing the same number of GPUs, MP only helps in one configuration, the communication-bound minibatch size 2880 with 16 GPUs, pg. 1061, left-right col, 5.4. Impact of MB-Size Selection and Double Buffering) 
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Harlap to incorporate the method of Seide2 for the benefit of reducing end-to-end training time from 35h to 8.1h (Seide2, pg. 1061, fourth Para., Table 3)

8.	Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Harlap et al. ("Pipedream: Fast and efficient pipeline parallel dnn training." arXiv preprint arXiv:1806.03377 (2018)) and further in view of Li et al. ("Strategies for energy-efficient resource management of hybrid programming models." IEEE Transactions on parallel and distributed Systems 24.1 (2012): 144-157.)

	Regarding claim 13, Harlap teaches the system of claim 7, Harlap does not explicitly teach wherein the first level and second levels of training efficiency are based at least in part on information directed to power consumption of the system.
	Li teaches wherein the first level and second levels of training efficiency are based at least in part on information directed to power consumption of the system (Dynamic Concurrently Throttling (DCT) can reduce dynamic power consumption by putting cores that do not improve application execution time in a low-power state. DCT can also save execution time by alleviating contention for shared resources, pg. 145, right col, Applying DCT)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Harlap to incorporate the method of Li for the benefit of achieving substantial energy savings (8.74 percent on average and up to 13.8 percent) with some performance gain (up to 7.5 percent) or negligible performance loss (Li, abstract)

9.	Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Harlap et al. ("Pipedream: Fast and efficient pipeline parallel dnn training." arXiv preprint arXiv:1806.03377 (2018)) in view of Han (US20200342322 filed 11/06/2018)

	Regarding claim 19, Harlap teaches the machine-readable medium of claim 14, Harlap does not explicitly teach wherein the set of instructions further cause the one or more processors to at least train the neural network using the first number of parallel training data threads prior to using the second number of portions of the neural network in parallel using the first number of parallel training data threads.
	Han teaches wherein the set of instructions further cause the one or more processors to at least train the neural network using the first number of parallel training data threads prior to using the second number of portions of the neural network in parallel using the first number of parallel training data threads. (In step D, an allocation granularity G is determined. The allocation granularity is the minimum number of GPUs required for accommodating one model. [0057]; In step E, data parallelism (DP) is determined. The data parallelism indicates the number of slices into which the overall training data is split. The data parallelism is calculated by using the formula DP=floor(D/G).[0058];  one training job includes multiple replication groups, training is performed between replication groups by using data parallelism, and DP defines the number of replication groups included in one training job. FIG. 3 is a schematic diagram of a parallel algorithm according to an embodiment of the present disclosure.[0061]. The Examiner notes that since the data parallelism is determined then data parallel training is done before the model training)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Harlap to incorporate the method of Han for the benefit of speeding up the training of a deep learning model (Han, [0047])
	
Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 7:30am-5:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B. Zhen can be reached on (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/M.G./Examiner, Art Unit 2121        

/DANIEL T PELLETT/Primary Examiner, Art Unit 2121