DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-3, 5-10, 12-15 are rejected under 35 U.S.C. 102(a)(1) and 102(a)(2) as being anticipated by Harlap et al., “PipeDream: Fast and Efficient Pipeline Parallel DNN Training”, arxiv.org, published 8 June 2018 (accessed from https://arxiv.org/pdf/1806.03377, “Harlap”)
Regarding Claim 1:
Harlap teaches:
A method of processing a neural network model using a plurality of processors, the method comprising:allocating at least one slice to each layer from among a plurality of layers included in the neural network model;
{ Harlap, Page 5, Figure 4, Figure description } An example pipeline-parallel assignment [A method of processing a neural network model using a plurality of processors, the method comprising] with four machines [allocating at least one slice to each layer from among a plurality of layers included in the neural network model] and example timeline at one of the machines, highlighting temporal overlap of computation and activation/gradient communication [slices, the examiner notes that stages are the boxes within the dotted line, and a layer, is the dotted line container which contains forward work and backward work].

    PNG
    media_image1.png
    546
    460
    media_image1.png
    Greyscale

allocating each layer from among the plurality of layers to a processor from among the plurality of processors based on respective processing times of the plurality of processors for processing the at least one slice allocated to each layer; and
{ Harlap, Page 6, Figure 7, Figure description } PipeDream’s automated mechanism to partition DNN layers into stages [allocating each layer from among the plurality of layers to a processor from among the plurality of processors]. PipeDream first profiles the input DNN, to get estimates for each layer’s compute time [based on respective processing times] and output size. Using these estimates, PipeDream’s optimizer partitions layers across available machines [of the plurality of processors for processing the at least one slice allocated to each layer].
processing the neural network model by using the plurality of processors based on a result of the allocating,
{ Harlap, Page 7 - Runtime Analysis ff } Based on the partitioning generated by our algorithm, the optimal number of minibatches admitted per input stage to keep the pipeline full [processing the neural network model by using the plurality of processors] in steady state is given by ceiling((# machines) / (# machines in the input stage)). We refer to this quantity as the NUM_OPT_ACTIVE_MINIBATCHES [The examiner notes this is the “number of optimal active mini batches”] (NOAM) [based on a result of the allocating].
wherein the processing times comprise a switching time elapsed for each processor from among the plurality of processors to receive data for processing a current slice from a previous processor from among the plurality of processors processing a previous slice.
{ Harlap, Page 6,  PipeDream’s Partitioning Algorithm } Our partitioning algorithm takes the output of the profiling step, and computes: 1) a partitioning of layers into stages, 2) the replication factor for each stage, and 3) optimal number of minibatches to keep the training pipeline busy [to receive data for processing a current slice]. The partitioning algorithm tries to minimize the overall training time of the model. For a pipelined system, this problem is equivalent to minimizing the time taken by the slowest stage of the pipeline [the processing times comprise a switching time elapsed]. { Harlap, Page 7 – Figure 8 – Steady State } [from a previous processor from among the plurality of processors processing a previous slice, The examiner notes that after the startup state, the blocks that are shown in the steady state consist of distributed minibatches of forward work and a minibatch of backward work which are assigned to different machines (i.e. Processors)].
    PNG
    media_image2.png
    260
    582
    media_image2.png
    Greyscale

Regarding Claim 2:
Harlap teaches the method of claim 1.
Harlap further teaches:
wherein, as at least one layer from among the plurality of layers is determined as a slice point, each layer from among the plurality of layers is allocated to the at least one slice, and
{ Harlap, Page 5, Pipeline Parallelism, Paragraph 1 ff } Pipeline-parallel training partitions the layers of the model [The examiner notes that Harlap teaches a distinct difference between layers of a model, layers in a stage (of the model) in Figure 4 and Figure 7, shown below] 
    PNG
    media_image3.png
    546
    802
    media_image3.png
    Greyscale

    PNG
    media_image4.png
    428
    654
    media_image4.png
    Greyscale
being trained into multiple stages [slice points] – each stage [wherein a slice point] contains a consecutive set of layers in the model [each layer from among the plurality of layers is allocated to the at least one slice, The examiner notes that in figure 4 a layer in a stage would be Fn, Bn-x and Cn  (in combination)and a the terminology of a slice would be analogous to a stage since a stage is made up of two consecutive layers which contain work (ie. forward propagation, backward propagation as well as communication)]. Each stage is mapped to a separate GPU that performs both the forward and backward pass for all the layers in that stage. ... Figure 4 shows a simple example of a pipeline-parallel assignment, where the DNN is split [determined as a slice point] across four machines.
wherein the slice point is determined based on at least one of whether each layer of the plurality of layers is a branching point of the plurality of layers, whether each layer of the plurality of layers is a point at which the plurality of layers is combined, whether each layer of the plurality of layers comprises a task able to be processed by a same processor, and whether each layer of the plurality of layers comprises a task that needs high accuracy.
{ Harlap, Page 6 ff } Taking these factors into account, given a DNN with N layers and M available machines, PipeDream first profiles the model on a single machine, and then runs a partitioning algorithm that groups layers [wherein the slice point is determined based on at least one of whether each layer of the plurality of layers is a branching point (ie. Figure 7, branching point) of the plurality of layers (ie. Figure 7, layers), whether each layer of the plurality of layers is a point at which the plurality of layers is combined (ie. Figure 7, output layer), whether each layer of the plurality of layers comprises a task able to be processed by a same processor (ie. Figure 7, stage #4, machine #8 → output layer), and whether each layer of the plurality of layers comprises a task that needs high accuracy (ie. Figure 7, stage #4, machine #8 → output layer with softmax)] into stages, while also determining the replication factor for each stage that minimizes the overall training time for the model.  Our profiling mechanism exploits the fact that DNN training shows little variance in the computation and communication time across minibatches. 
    PNG
    media_image5.png
    428
    660
    media_image5.png
    Greyscale
 
Regarding Claim 3:
Harlap teaches the method of claim 1.
Harlap further teaches:
wherein each layer from among the plurality of layers is allocated to the processor from among the plurality of processors based on a path corresponding to a smallest sum of the processing times from among a plurality of paths generated as a plurality of nodes indicating combinations of different slices from among the at least one slice and different processors from among the plurality of processors are connected according to an order in which the at least one slice is arranged.
{ Harlap, Page 7: Work Scheduling, Paragraph 3, and Figure 8 } In the startup phase, the input stage admits NOAM minibatches to keep the pipeline full in steady state. Once in steady state, each stage alternates between performing the forward and backward pass for a minibatch. We call this mechanism one-forward-one-backward (1F1B) [wherein each layer from among the plurality of layers is allocated to the processor from among the plurality of processors based on a path corresponding to the smallest sum of the processing times from among the plurality of paths]. In a balanced pipeline, 1F1B ensures [as a plurality of nodes indicating], that no GPU is idle in steady state [combinations of different slices from among the at least one slice and different processors from among the plurality of processors are connected according to an order in which the at least one slice is arranged] and that we make forward progress in learning from each minibatch. [The examiner notes that Figure 8 is shown, again, below for reference with notes from the examiner to dissect the terminology from the prior art]
    PNG
    media_image6.png
    652
    828
    media_image6.png
    Greyscale

Regarding Claim 5:
Harlap teaches the method of claim 1.
Harlap further teaches:
further comprising: identifying at least one layer included in at least one slice allocated to a first processor of the plurality of processors;
{ Harlap, Page 9, Section 4 – Implementation, Paragraph 1 ff } PipeDream first profiles the model on a single machine with a subset of minibatches from the training dataset. It then runs the optimization algorithm described in Section 3.2 to partition the DNN model into k stages, with some stages replicated [further comprising: identifying at least one layer included in at least one slice]. The PipeDream runtime then assigns each stage to a single GPU [allocated to a first processor of the plurality of processors].
identifying at least one blob indicating at least one of data input to the at least one layer, data output from the at least one layer, and data temporarily stored in the at least one layer; and
{ Harlap, page 9, paragraph 8, Intermediate State } Each layer’s intermediate data is also assigned a unique blob ID [identifying at least one blob]. Upon receiving intermediate data from the prior stage (or from disk in the case of the input stage), PipeDream copies the intermediate data [indicating at least one of  ... data temporarily stored in the at least one layer] to GPU memory and places a pointer to the associated buffer in a work queue. Intermediate data from the forward pass is not discarded until the associated minibatch completes that stage’s backward pass. Intermediate data from the backward pass is released as soon as the ML worker finishes using it, and if necessary, after it is sent to the next stage. Due to the differing requirements for intermediate data in the forward and backward pass, stages in PipeDream commonly manage multiple versions of intermediate data from forward passes, and just a single version of intermediate data from the currently running backward pass [data input to the at least one layer, data output from the at least one layer].
allocating memory to store data of the at least one blob.
{ Harlap, page 8, paragraph 8 ff: GPU Memory Management } As minibatches enter and leave the pipeline, the system has to ensure that the inputs, weights, and other intermediate state required [at least one blob] by the GPU for its computation are present [allocating memory to store] in GPU memory.
Regarding Claim 6:
Harlap teaches the method of claim 5.
Harlap further teaches:
wherein the allocating comprises: determining an order for processing the at least one layer; and
{ Harlap, page 2, paragraph 2 ff }... unlike traditional uni-directional pipelines, DNN training is bi-directional—the forward pass is followed by a backward pass through the same layers [wherein the allocating comprises: determining an order for processing the at least one layer] in reverse order.
allocating memory for a current blob by determining whether a period for using a previous blob is terminated before data of the current blob is generated based on the order. 
{ Harlap, page 2, paragraph 2 ff } ... PipeDream interleaves forward and backward minibatch processing on each worker, while making sure to route minibatches through the same workers ) [allocating memory for a current blob by determining whether a period for using a previous blob is terminated] on the backward pass.{ Harlap, Page 8, Weight Stashing, in more detail } Weight Stashing maintains multiple versions of the weights [allocating memory for a current blob], one for each active minibatch. When performing the forward pass, each stage processes a minibatch using the latest version of weights available. After completing the forward pass, PipeDream stores the weights used as part of the intermediate state for that minibatch. When performing the minibatch’s backward pass, the same version of the weights is used to compute the weight gradient. Weight stashing ensures that within a stage, the same version of model parameters are used for the forward and backward pass of a given minibatch. For example, in Figure 8, minibatch 5 uses parameters updates from batch 1 on machine 1 and from 2 on machine 2. Weight stashing says nothing about the consistency of parameter versions used for a given minibatch across stages.
{ Harlap, Page 9, Intermediate State, blob-weight stashing in more detail } PipeDream copies the intermediate data to GPU memory and places a pointer to the associated buffer in a work queue. Intermediate data from the forward pass is not discarded until [by determining whether a period for using] the associated minibatch completes that stage’s backward pass. Intermediate data from the backward pass is released [a previous blob is terminated] as soon as the ML worker finishes using it, and if necessary, after it is sent to the next stage. Due to the differing requirements for intermediate data in the forward and backward pass, stages in PipeDream commonly manage multiple versions of intermediate data from forward passes, and just a single version of intermediate data from the currently running backward pass. Regarding Claim 8:
Harlap teaches:
An electronic device for processing a neural network model, the electronic device comprising: a memory configured to store the neural network model;
{ Harlap, Page 10, 5.1 Experimental Setup: Clusters } Each machine has a E5-2698Bv3 Xeon CPU with 64 GB of RAM [An electronic device for processing a neural network model, the electronic device comprising: a memory configured to store the neural network model].
at least one processor configured to allocate at least one slice to each layer from among a plurality of layers included in the neural network model, allocate each layer from among the plurality of layers to a processor from among a plurality of processors based on respective processing times of the plurality of processors for processing the at least one slice allocated to each layer, process the neural network model based on a result of the allocation; and
{ Harlap, Page 10, 5.1 Experimental Setup: Clusters }  Cluster-A is a private cluster of NVIDIA Titan X GPUs [at least one processor configured to allocate] with 12 GB of GPU device memory. { Harlap, Page 6, Figure 7, Figure description } PipeDream’s automated mechanism to partition DNN layers [at least one slice] into stages [to each layer from among a plurality of layers included in the neural network model]. PipeDream first profiles the input DNN, to get estimates for each layer’s compute time [based on respective processing times] and output size. Using these estimates, PipeDream’s optimizer partitions layers across available machines [allocate each layer from among the plurality of layers to a processor based on respective processing times of the plurality of processors].{ Harlap, Page 5, Figure 4, Figure description } An example pipeline-parallel assignment with four machines and example timeline at one of the machines, highlighting temporal overlap of computation and activation/gradient communication [for processing the at least one slice allocated to each layer, process the neural network model based on a result of the allocation, the examiner notes that slices are the boxes within the dotted line in figure 4, shown below]
    PNG
    media_image3.png
    546
    802
    media_image3.png
    Greyscale

{ Harlap, Page 10, Section 5.1 Experimental Setup: Models and Training Methodology ff } For all the experiments, we measure the time taken to train the models until they reach their advertised   validation accuracy: top-1 accuracy of 68% for VGG16, top-1 accuracy of 67% for Inception-v3, and METEOR score of 0.294 for S2VT. Guided by prior work, we adjust the learning rate [an outputter] during training to converge to the desired result faster [configured to output a result of processing the neural network model].
wherein the processing times comprises a switching time elapsed for each processor from among the plurality of processors to receive data for processing a current slice from a previous processor from among the plurality of processors processing a previous slice.
{ Harlap, Page 6,  PipeDream’s Partitioning Algorithm } Our partitioning algorithm takes the output of the profiling step, and computes: 1) a partitioning of layers into stages, 2) the replication factor for each stage, and 3) optimal number of minibatches to keep the training pipeline busy [for each processor from among the plurality of processors to receive data for processing a current slice]. The partitioning algorithm tries to minimize the overall training time of the model [wherein the processing times comprise a switching time elapsed]. For a pipelined system, this problem is equivalent to minimizing the time taken by the slowest stage of the pipeline. { Harlap, Page 7 – Figure 8 – Startup State } 
    PNG
    media_image7.png
    432
    765
    media_image7.png
    Greyscale
 [from a previous processor from among the plurality of processors processing a previous slice].
Regarding Claim 7:
Harlap teaches the method of claim 5.
Harlap further teaches:
wherein the allocating comprises determining a size of the memory based on a largest data size from among data sizes of the at least one blob to which a same memory is allocated.
{ Harlap, Page 6, Figure 7, Figure description } ... PipeDream first profiles the input DNN, to get estimates for each layer’s compute time and output size ... [based on a largest data size from among data sizes] { Harlap, Page 7, Paragraph 2} The optimal pipeline contains more than one stage. In this case, it can be broken into an optimal sub-pipeline consisting of layers from 1 through i with m - mʹ machines followed by a single stage with layers i + 1 through j replicated over mʹ machines. Then, using the optimal sub-problem property, we have
    PNG
    media_image8.png
    113
    520
    media_image8.png
    Greyscale
where the first term [determining a size of the memory] inside the max is the time taken by the slowest stage [based on a largest data size from among data sizes] of the optimal sub-pipeline between layers 1 and i with m - mʹ machines, the second term is the time taken to communicate the activations and gradients between layers i and i + 1, and the third term is the time taken by the single stage containing the remaining layers in a data-parallel configuration of mʹ machines. { Harlap, Page 9, Intermediate State ff } Each layer’s intermediate data is also assigned a unique blob ID [of the at least one blob]. Upon receiving intermediate data from the prior stage (or from disk in the case of the input stage), PipeDream copies the intermediate data to GPU memory [wherein the allocating comprises determining a size of the memory] and places a pointer to the associated buffer in a work queue. Intermediate data from the forward pass is not discarded [to which a same memory is allocated]  until the associated minibatch completes that stage’s backward pass. Intermediate data from the backward pass is released as soon as the ML worker finishes using it, and if necessary, after it is sent to the next stage. Due to the differing requirements for intermediate data in the forward and backward pass, stages in PipeDream commonly manage multiple versions of intermediate data from forward passes, and just a single version of intermediate data from the currently running backward pass.
Regarding Claim 9:
Harlap teaches the electronic device of claim 8.
Harlap further teaches:
wherein each layer from among the plurality of layers is allocated to the at least one slice by determining at least one layer of the plurality of layers as a slice point, and
{ Harlap, Page 5, Pipeline Parallelism, paragraph 1 ff } Pipeline-parallel training partitions the layers of the model [The examiner notes that Harlap teaches a distinct difference between layers of a model, and layers in a stage, Figure 6] being trained into multiple stages [slice points] – each stage contains a consecutive set of layers in the model [wherein each layer from among the plurality of layers is allocated at the at least one slice The examiner notes that in figure 4 this would be Fn, Bn-x or Cn, by determining at least one layer of the plurality of layers as a slice point and were a single layer may exist on multiple slices – ‘a single layer may exist on multiple slices’ is why there are minibatches. The purpose is to break the work up as evenly as possible]. Each stage is mapped to a separate GPU that performs both the forward and backward pass for all the layers in that stage. ... Figure 4 shows a simple example of a pipeline-parallel assignment, where the DNN is split [is determined as a slice point] across four machines.
wherein the slice point is determined based on at least one of whether each layer of the plurality of layers is a branching point of the plurality of layers, whether each layer of the plurality of layers is a point at which the plurality of layers are combined, whether each layer of the plurality of layers comprises a task able to be processed by a same processor, and whether each layer of the plurality of layers comprises a task that needs high accuracy.
{ Harlap, Page 5, Pipeline Parallelism, Paragraph 1 } Pipeline-parallel training partitions the layers of the model being trained into multiple stages – each stage contains a consecutive set of layers in the model. Each stage is mapped to a separate GPU that performs both the forward and backward pass for all the layers in that stage. We refer to the stage [wherein the slice point is determined based on at least one of] that contains the input layer [whether each layer of the plurality of layers is a branching point of the plurality of layers] as the input stage, and the one that contains the output layer [whether each layer of the plurality of layers is a point at which the plurality of layers is combined] as the output stage.{Harlap, Page 2 ff} Pipeline-parallel training has the potential to provide high DNN training performance when data parallelism struggles. In particular, inter-worker communication can be limited to activations (on the forward pass [whether each layer of the plurality of layers comprises a task (ie. activation function) able to be processed by a same processor,]) and gradients (backward [and whether each layer of the plurality of layers comprises a task (ie. activation function) that needs high accuracy])) between adjacent layers assigned to different workers.
Regarding Claim 10:
Harlap teaches the electronic device of claim 8.
Harlap further teaches:
wherein each layer from among the plurality of layers is allocated to the processor from among the plurality of processors based on a path corresponding to a smallest sum of the processing times from among a plurality of paths generated as a plurality of nodes indicating combinations of different slices from among the at least one slice and different processors from among the plurality of processors are connected according to an order in which the at least one slice is arranged
{ Harlap, Page 7: Work Scheduling, Paragraph 3, and Figure 8  In the startup phase, the input stage admits NOAM minibatches to keep the pipeline full in steady state. Once in steady state, each stage alternates between performing the forward and backward pass for a minibatch. We call this mechanism one-forward-one-backward (1F1B) [wherein each layer from among the plurality of layers is allocated to the processor from among the plurality of processors based on a path (ie. minibatch) corresponding to the smallest sum of the processing times from among the plurality of paths (ie. minibatches)]. In a balanced pipeline, 1F1B ensures that no GPU is idle in steady state [generated as a plurality of nodes (ie. stages with blobs) indicating combinations of different slices (ie. stages in a minibatch) from among the at least one slice (ie. stage) and different processors (ie. machines/GPUs) from among the plurality of processors are connected according to an order (ie. 1F1B) in which the at least one slice (ie. stage) is arranged] and that we make forward progress in learning from each minibatch.
    PNG
    media_image9.png
    423
    570
    media_image9.png
    Greyscale
{ Harlap, Page 2 ff, ‘1F1B’ in more detail } Bad partitionings, where stages have widely skewed amounts of work, can lead to workers spending significant time idle. PipeDream automatically determines how to partition the layers of the DNN based on a short profiling run, using an algorithm that balances computation load [corresponding (ie 1F1B)] among the different stages while minimizing communication.{ Harlap, Page 6 ff, ‘1F1b’ in more detail } PipeDream’s partitioning algorithm must ensure that each stage roughly performs the same amount of total work. At the same time, the partitioning algorithm must also ensure that the amount of data communicated [of the processing times] across stages is as small [to the smallest sum] as possible, to avoid communication stalls. Load imbalance across machines or excessive communication between machines can lower hardware efficiency (throughput).
Regarding Claim 12:
Harlap teaches the electronic device of claim 8.
Harlap further teaches:
wherein the at least one processor identify at least one layer included in at least one slice allocated to a first processor of the plurality of processors, identify at least one blob indicating at least one of data input to the at least one layer, data output from the at least one layer, and data temporarily stored in the at least one layer, and
{ Harlap, Page 6, Figure 7, Figure description } ... PipeDream first profiles the input DNN, to get estimates for each layer’s compute time and output size ... [based on a largest data size from among data sizes] { Harlap, Page 7, Paragraph 2} The optimal pipeline contains more than one stage. In this case, it can be broken into [wherein the at least one processor] an optimal sub-pipeline consisting of layers from 1 through i with m - mʹ machines followed by a single stage with layers [identify at least one layer included in at least one slice] i + 1 through j replicated [identify at least one blob indicating at least one of data input to the at least one layer, data output from the at least one layer, and data temporarily stored in the at least one layer] over mʹ machines [allocated to a first processor of the plurality of processors]. Then, using the optimal sub-problem property, we have 
    PNG
    media_image8.png
    113
    520
    media_image8.png
    Greyscale
where the first term inside the max is the time taken by the slowest stage [based on a largest data size from among data sizes of the at least one blob] of the optimal sub-pipeline between layers 1 and i with m - mʹ machines, the second term is the time taken to communicate the activations and gradients between layers i and i + 1, and the third term is the time taken by the single stage containing the remaining layers in a data-parallel configuration of mʹ machines.
allocate memory to store data of the at least one blob.
{ Harlap, Page 9, Intermediate Sate, Paragraph 1 } Each layer’s intermediate data [at least one blob.] is also assigned a unique blob ID. Upon receiving intermediate data from the prior stage (or from disk in the case of the input stage), PipeDream copies the intermediate data to GPU memory [allocate memory to store data of the at least one blob] and places a pointer to the associated buffer in a work queue. Intermediate data from the forward pass is not discarded until the associated minibatch completes that stage’s backward pass. Intermediate data from the backward pass is released as soon as the ML worker finishes using it, and if necessary, after it is sent to the next stage.
Regarding Claim 13:
Harlap teaches the electronic device of claim 12.
Harlap further teaches:
wherein the at least one processor determine an order for processing the at least one layer, and allocate memory for a current blob by determining whether a period for using a previous blob is terminated before data of the current blob is generated based on the order.
{ Harlap, Page 2, Paragraph 2 ff } PipeDream interleaves forward and backward minibatch processing [allocating memory for a current blob] on each worker, while making sure to route minibatches through the same workers ) [by determining whether a period for using] on the backward pass [a previous blob is terminated].
Regarding Claim 14:
Harlap teaches the electronic device of claim 12.
Harlap further teaches:
wherein the at least one processor determine a size of the memory to store data of the at least one blob based on a largest data size from among data sizes of the at least one blob to which a same memory is allocated.
{ Harlap, Page 6, Figure 7, Figure description } ... PipeDream first profiles the input DNN, to get estimates for each layer’s compute time and output size ... [based on a largest data size from among data sizes] { Harlap, Page 7, Paragraph 2} The optimal pipeline contains more than one stage. In this case, it can be broken into an optimal sub-pipeline consisting of layers from 1 through i with m - mʹ machines followed by a single stage with layers i + 1 through j replicated over mʹ machines. Then, using the optimal sub-problem property, we have 
    PNG
    media_image8.png
    113
    520
    media_image8.png
    Greyscale
where the first term inside the max is the time taken by the slowest stage [based on a largest data size from among data sizes] of the optimal sub-pipeline between layers 1 and i with m - mʹ machines, the second term is the time taken to communicate the activations and gradients [of the at least one blob] between layers i and i + 1, and the third term is the time taken by the single stage containing the remaining layers in a data-parallel configuration of mʹ machines.
Regarding Claim 15:
Harlap teaches:
A non-transitory computer-readable recording medium having recorded thereon a program for implementing the method of claim 1.
{ Harlap, Page 10, 5.1 Experimental Setup: Clusters } We used two different clusters in our experiments [for implementing the method of claim 1]. Cluster-A is a private cluster of NVIDIA Titan X GPUs with 12 GB of GPU device memory. Each machine has a E5-2698Bv3 Xeon CPU with 64 GB of RAM. The machines are connected via a 25 Gbps Ethernet interface. Cluster-B is public cloud cluster (AWS p3.2xlarge instances) of NVIDIA V100 GPUs, with 16 GB of GPU device memory. Each machine has a E5-2690Xeon CPU, 64 GB of RAM with a 10 Gbps Ethernet interface. Machines on both clusters [A non-transitory computer-readable recording medium having recorded thereon a program] run 64-bit Ubuntu 16.04 with CUDA toolkit 8.0 and cuDNN v6.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 4, and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Harlap et al., “PipeDream: Fast and Efficient Pipeline Parallel DNN Training”, arxiv.org, published 8 June 2018 (accessed from https://arxiv.org/abs/1806.03377, (“Harlap”) in view of Chen et al., “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning”, 13th USENIX Symposium on Operating Systems Design and Implementation, 10 October 2018 (accessed from https://www.usenix.org/conference/osdi18/presentation/chen, “Chen”)
Regarding Claim 4:
Harlap teaches The method of claim 1
Harlap does not teach wherein a plurality of pieces of input data to be input to a layer of the plurality of layers are sequentially processed for each channel from among a plurality of channels of a same number as a number of the plurality of pieces of input data, and, wherein for the plurality of channels of the layer, as much memory size as needed to process a task in a first channel from among the plurality of channels is allocated.
Chen teaches wherein a plurality of pieces of input data to be input to a layer of the plurality of layers are sequentially processed for each channel from among a plurality of channels of a same number as a number of the plurality of pieces of input data, and, wherein for the plurality of channels of the layer, as much memory size as needed to process a task in a first channel from among the plurality of channels is allocated.
{ Chen, page 6/17, Section 4.1 & 4.2: Paragraph 1 ff } The following code shows an example tensor expression to compute transposed matrix multiplication: 
    PNG
    media_image10.png
    158
    590
    media_image10.png
    Greyscale
Each compute operation specifies both the shape of the output tensor and an expression describing how to compute each element of it [wherein a plurality of pieces of input data, (eg. m, n, h)  to be input to a layer of the plurality of layers are sequentially processed for each channel from among a plurality of channels of a same number as a number of the plurality of pieces of input data (eg. t.compute(m,n), lambda y, x: t.sum(A[k,y] * B[k,x], axis=k)) and, wherein for the plurality of channels of the layer, (eg. A, B)]. Our tensor expression language supports common arithmetic and math operations and covers common DL operator patterns. The language does not specify the loop structure and many other execution details, and it provides flexibility for adding hardware-aware optimizations for various backends. Adopting the decoupled compute/schedule principle from Halide, we use a schedule to denote a specific mapping from a tensor expression to low-level code. ... An alternative to the shared-nothing approach is to fetch data cooperatively. Specifically, groups of threads can cooperatively fetch the data they all need and place it into a shared memory space [as much memory size as needed to process (eg. C) a task in a first channel (eg. A), from among the plurality of channels (eg. A, B) is allocated]. This optimization can take advantage of the GPU memory hierarchy and enable data reuse across threads through shared memory regions. TVM supports this well-known GPU optimization using a schedule primitive to achieve optimal performance.
In view of the teachings of Chen, it would have been obvious for a person of ordinary skill in the art, before the effective filing date, to apply the teachings of Chen to the teachings of Harlap, in order to increase performance of the machine learning model { cf. Chen, Page 6, Section 4.1, Paragraph 3 ff “... Internally, TVM uses a data structure to keep track of the loop structure and other information as we apply schedule transformations. This information can then help generate low-level code for a given final schedule.”}
Regarding Claim 11:
Harlap teaches The electronic device of claim 8,
Harlap does not teach wherein a plurality of pieces of input data to be input to a layer of the plurality of layers are sequentially processed for each channel from among a plurality of channels of a same number as a number of the plurality of pieces of input data, and,
Chen teaches wherein a plurality of pieces of input data to be input to a layer of the plurality of layers are sequentially processed for each channel from among a plurality of channels of a same number as a number of the plurality of pieces of input data
{ Chen, page 6/17, Section 4.1 & 4.2: Paragraph 1 ff } The following code shows an example tensor expression to compute transposed matrix multiplication: 
    PNG
    media_image10.png
    158
    590
    media_image10.png
    Greyscale
Each compute operation specifies both the shape of the output tensor and an expression describing how to compute each element of it [wherein a plurality of pieces of input data, (eg. m, n, h) to be input to a layer of the plurality of layers are sequentially processed for each channel from among a plurality of channels of a same number as a number of the plurality of pieces of input data, t.compute(m,n), lambda y, x: t.sum(A[k,y] * B[k,x], axis=k), and, wherein for the plurality of channels of the layer (eg. A, B) ]. Our tensor expression language supports common arithmetic and math operations and covers common DL operator patterns.
In view of the teachings of Chen, it would have been obvious for a person of ordinary skill in the art, before the effective filing date, to apply the teachings of Chen to the teachings of Harlap, for the same rational in claim 4.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RICHARD CARL STANLEY whose telephone number is (571)272-2002. The examiner can normally be reached Monday-Friday 8:30am-4:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on michael.huntley@uspto.gov. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/R.C.S./Examiner, Art Unit 2129                                                                                                                                                                                                        
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129