Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification
The disclosure is objected to because of the following informalities: 
The title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-20 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Regarding claims 1-20: the method of determining based on “an estimated data volume” is indefinite.  With respect to the instant specification estimating a data volume is not disclosed.  Since one of ordinary skill in the art would not be able to determine the method or accuracy of said estimation, it would be impossible to quantify how a 

Regarding claim 1, “Determining by at least one M processor cores” is indefinite.  One of ordinary skill in the art would expect M to represent an integer value, such that if the value was two the limitation would read “Determining by at least one two processor cores” which is contradictory and indefinite.  In the interest of further examination the claim is interpreted as “Determining by at least one processor core”.  

Regarding claim 6, “trained by any two of the at least one M processors” is indefinite.  The limitation suggests that only one processor is necessary while similarly requiring training to be performed on any two processors.  This limitation is contradictory.  In the interest of further examination the limitation is interpreted as only requiring a single processor.

Regarding claim 14, “the total duration” lacks antecedent basis.  It is unclear whether “the total duration” refers to the “third total duration” or one of the previously introduced total durations or a fourth unspecified total duration.

Regarding claim 14, “wherein the total duration comprises a smaller value in the first total duration and the second total duration” is indefinite.  The first, second, and third durations are expected to be values, so how the total duration (a value) can comprise a 

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-20 are rejected under 35 U.S.C. 102 as being unpatentable over Lopes (“STOCHASTIC GPU-BASED MULTITHREAD IMPLEMENTATION OF MULTIPLE BACK-PROPAGATION”, 2010).

Regarding claim 1, Lopes teaches A training method for a neural network model applied to a training system, wherein the training method comprises: ([Abstract] "In this paper, we propose a GPU implementation of the online (stochastic) training mode of the Multiple Back- Propagation (MBP) algorithm").
determining, by each of at least one M processor cores ([p. 273] "The online implementation shares much of the code of the batch implementation. Nevertheless Kernels are executed in parallel by different CUDA threads" Optimizing to the specific training mode by the processor is interpreted as synonymous with determining by each of at least one processor cores.  CUDA thread is an explicit processor specific implementation, intended to run across multiple processor cores.).
for each layer of L layers of the neural network model, ([p. 274 Col. 1] "The neural network models used in this study consisted of MFF networks comprising: (i) a main network containing an input layer with Ni neurons, a hidden layer with Nh1 neurons with selective activation, an optional second hidden layer").
a model training mode of a layer of the L layers based on an estimated data volume in a model parameter set and an estimated data volume of output data of the layer, wherein the training system comprises the M processor cores, and wherein M and L are integers greater than or equal to 1; and (See Table 1 CorrectWeights "Adjust the weights of a given layer. For the batch mode the step sizes are also updated." [p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads...The number of thread blocks in a grid is typically dictated by the size of the data being processed rather than by the number of processors in the system, which it can greatly exceed." Size of the data interpreted as synonymous with estimated data volume.  Adjusting the weights of a given layer interpreted as synonymous with training.).
performing, by each of the M processor cores, training to the layer using a determined model training mode, wherein the determined model training mode comprises at least one of a data parallel training mode or a model parallel training mode. ([p. 273 Col. 1] "The online implementation shares much of the code of the batch implementation. Nevertheless there are significant differences in the kernel implementations and although they might have similar names, they are optimized to the specific version" Data parallel training mode is interpreted as synonymous with batch training mode.  Model parallel training mode is interpreted as synonymous with online training mode.). 

Regarding claim 2, Lopes teaches The training method of claim 1, wherein the determined model training mode of a (j−1)th layer of the L layers is the data parallel training mode, wherein j is an integer greater than 1 and less than or equal to L ([p. 274 Col. 1] "The neural network models used in this study consisted of MFF networks comprising: (i) a main network containing an input layer with Ni neurons, a hidden layer with Nh1 neurons with selective activation, an optional second hidden layer" [p. 272 Col. 2] "The main network can only calculate its outputs after knowing the outputs (mpk ) of the space network. Thus the two networks will function in a collaborative manner and must also be trained together." j being less than or equal to 1 and training a j-1 layer is interpreted as not performing batch training on the output layer. Layer L interpreted as synonymous with output of main network.  Lopes explicitly teaches that main network outputs cannot be calculated before previous layers are trained.).
wherein the performing comprises performing data parallel training on a model parameter of the (j−1)th layer, ([p. 273 Col. 1] "Table 1 identifies the purpose of the kernels implemented for the online and batch mode versions. The kernels FireLayer, FireOutputLayer, CalcLocalGradients and CorrectWeights were designed to operate on a generic network layer with Nn neurons, each with Ni inputs (not including the bias) and No output connections." Batch mode interpreted as synonymous with data parallel training.).
wherein first output data is used as input data of a jth layer of the L layers, (See FIG. 1, layer output is fed as input into next layer.).
and wherein the first output data is output data obtained by each of the M processor cores training the (j−1)th layer. ([p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads...This requirement allows the set of thread blocks, called a grid, to be scheduled in any order across any number of cores"). 

Regarding claim 3, Lopes teaches The training method of claim 1, wherein the determined model training mode of a (j−1)th layer of the L layers is the model parallel training mode, wherein j is an integer greater than 1 and less than or equal to L, ([p. 274 Col. 1] "The neural network models used in this study consisted of MFF networks comprising: (i) a main network containing an input layer with Ni neurons, a hidden layer with Nh1 neurons with selective activation, an optional second hidden layer" [p. 272 Col. 2] "The main network can only calculate its outputs after knowing the outputs (mpk ) of the space network. Thus the two networks will function in a collaborative manner and must also be trained together." j being less than or equal to 1 
wherein the performing comprises performing model parallel training on a model parameter of the (j−1)th layer, ([p. 273 Col. 1] "Table 1 identifies the purpose of the kernels implemented for the online and batch mode versions. The kernels FireLayer, FireOutputLayer, CalcLocalGradients and CorrectWeights were designed to operate on a generic network layer with Nn neurons, each with Ni inputs (not including the bias) and No output connections." Online mode interpreted as synonymous with model parallel training.).
wherein second output data is used as input data of a jth layer of the L layers, (See FIG. 1, layer output is fed as input into next layer.).
wherein the second output data is output data obtained by m processor cores training the (j−1)th layer, wherein the m processor cores are one or more of the M processor cores used for training the (j−1)th layer, wherein m is an integer greater than or equal to 1 and less than or equal to M, and wherein a value of m of at least one of the L layers is greater than 1. ([p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads...This requirement allows the set of thread blocks, called a grid, to be scheduled in any order across any number of cores"). 

Regarding claim 4, Lopes teaches The training method of claim 1, wherein when the estimated data volume in the model parameter set is not greater than the estimated data volume of the output data of a layer, the model training mode of the layer is the data parallel training mode. ([p. 274 Sec. 5.1] "a space network with Ni inputs and Nh1 outputs." Table 4 Friedman column shows a mini-batch size of 32 being used on an output set of 40.  Batch size is interpreted as synonymous with estimated data volume in the model parameter set. Data parallel training mode interpreted as synonymous with batch training mode.). 

Regarding claim 5, Lopes teaches The training method of claim 1, wherein when the estimated data volume in the model parameter set is greater than the estimated data volume of the output data of a layer, the model training mode of the layer is the model parallel training mode. ([p. 272 Sec. 3] "ypk is the output of neuron k, mpk the importance of the neuron for the network output that varies accordingly to the pattern (stimulus) presented" [p. 273 Sec. 4] "In the batch mode, those kernels process (in parallel) all the Np patterns contained in the training data set, while in the online mode they process a single pattern. Therefore in the online mode the kernels must be called Np times (for each layer) in each epoch." Lopes explicitly teaches that batch size of online training is far larger than output data size.  Batch size is interpreted as synonymous with estimated data volume in the model parameter set.  Model parallel training mode is interpreted as synonymous with online training mode.). 

Regarding claim 6, Lopes teaches The training method of claim 1, wherein the determined model training mode of a (j−1)th layer of the L layers is the model parallel training mode, wherein j is an integer greater than 1 and less than or equal to L, wherein the performing comprises: ([Abstract] "it is commonly accepted that batch size is an
important parameter for offline tuning" [p. 274 Col. 1] "The neural network models used in this study consisted of MFF networks comprising: (i) a main network containing an input layer with Ni neurons, a hidden layer with Nh1 neurons with selective activation, an optional second hidden layer" [p. 272 Col. 2] "The main network can only calculate its outputs after knowing the outputs (mpk ) of the space network. Thus the two networks will function in a collaborative manner and must also be trained together." [p. 273 Col. 1] "Table 1 identifies the purpose of the kernels implemented for the online and batch mode versions. The kernels FireLayer, FireOutputLayer, CalcLocalGradients and CorrectWeights were designed to operate on a generic network layer with Nn neurons, each with Ni inputs (not including the bias) and No output connections." j being less than or equal to 1 and training a j-1 layer is interpreted as not performing batch training on the output layer. Layer L interpreted as synonymous with output of main network.  Lopes explicitly teaches that main network outputs cannot be calculated before previous layers are trained.).
determining, based on a model parameter set of a jth layer of the L layers, a model parameter subset of the jth layer that is to be trained by each of the M processor cores; and performing the model parallel training on the model parameter subset of the jth layer, ([p. 273 Sec. 4] "In the batch mode, those kernels process (in parallel) all the Np patterns contained in the training data set, while in the online mode they process a single pattern...they might be used to train the NNs using 
to the kernels designed for that purpose). This implementation is sometimes referred as mini-batch where the networks are trained using blocks of Nb patterns (1 < Nb < Np)" Mini-batch interpreted as synonymous with model parameter subset.).
wherein second output data is used as input data of the jth layer and an intersection set between model parameter subsets of the jth layer that are trained by any two of the at least one M processors is empty, (See FIG. 1, layer output is fed as input into next layer. [p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads...This requirement allows the set of thread blocks, called a grid, to be scheduled in any order across any number of cores").
wherein the second output data is output data obtained by m processor cores training a (j−1)th layer of the L layers, (See FIG. 1, layer output is fed as input into next layer.).
and wherein a union set of model parameter subsets of the jth layer that are trained by all of the at least one M processor core is equal to a universal set of model parameters of the jth layer. ([p. 273 Sec. 4] "Although in the online mode the kernels process a single pattern, they are actually capable of processing several patterns in parallel. Thus they might be used to train the NNs using small batches of patterns (they could also be used to train the networks in batch mode, but they would be inefficient compared to the kernels designed for that purpose). This implementation is sometimes referred as mini-batch where the networks are trained using blocks of Nb patterns (1 < Nb < Np)." Lopes explicitly teaches that the purpose of the mini-batch is to 

Regarding claim 7, Lopes teaches The training method of claim 1, wherein based on the model parallel training mode being used for a jth layer, the method further comprises: dividing second output data into a first input data subblock and a second input data subblock, wherein the second output data is output data obtained by m processor cores training a (j−1)th layer of the L layers; (See Table 1 CorrectWeights "Adjust the weights of a given layer. For the batch mode the step sizes are also updated." [p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads...The number of thread blocks in a grid is typically dictated by the size of the data being processed rather than by the number of processors in the system, which it can greatly exceed." Input data subblock interpreted as synonymous with thread block.).
using the second output data as input data of the jth layer of the L layers; (See FIG. 1, layer output is fed as input into next layer.).
performing model parallel training on a model parameter of the jth layer of the L layers, comprising: ([p. 273 Col. 1] "Table 1 identifies the purpose of the kernels implemented for the online and batch mode versions. The kernels FireLayer, FireOutputLayer, CalcLocalGradients and CorrectWeights were designed to operate on a generic network layer with Nn neurons, each with Ni inputs (not including the bias) and No output connections." Online mode interpreted as synonymous with model parallel training.).
receiving the first input data subblock; ([p. 272 Sec. 2] "This requirement allows the set of thread blocks, called a grid, to be scheduled in any order across any number of cores" Scheduling a thread interpreted as synonymous with receiving the first input data subblock.).
performing in parallel all of the following: performing the model parallel training on the model parameter of the jth layer based on the first input data subblock to obtain first output subdata of the jth layer; ([p. 273 Sec. 4] "In the batch mode, those kernels process (in parallel) all the Np patterns contained in the training data set").
receiving the second input data subblock; and ([p. 272 Sec. 2] "This requirement allows the set of thread blocks, called a grid, to be scheduled in any order across any number of cores" Scheduling a thread interpreted as synonymous with receiving the second input data subblock.).
performing the model parallel training on the model parameter of the jth layer based on the second input data subblock to obtain second output subdata of the jth layer; and ([p. 273 Sec. 4] "In the batch mode, those kernels process (in parallel) all the Np patterns contained in the training data set").
transmitting the first output subdata of the jth layer to a (j+1)th layer of the L layers. (See FIG. 1, layer output is fed as input into next layer.). 

Regarding claim 8, Lopes teaches A training apparatus for a neural network model, wherein the training apparatus comprises: ([Abstract] "In this paper, we propose a GPU implementation of the online (stochastic) training mode of the Multiple Back- Propagation (MBP) algorithm").
a memory configured to store instructions;  a processor coupled to the memory and configured to execute the instructions, wherein the processor comprises at least one processor core; and ([p. 272 Sec. 2] "Threads within a block can cooperate among themselves by sharing data and synchronizing their execution to coordinate memory accesses.").
a transceiver coupled to the processor and the memory, wherein the training apparatus is applicable to a training system that comprises M processor cores, ([p. 273] "The online implementation shares much of the code of the batch implementation. Nevertheless there are significant differences in the kernel implementations and although they might have similar names, they are optimized to the specific version...In the batch mode, those kernels process (in parallel) all the Np patterns contained in the training data set, while in the online mode they process a single pattern." [p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads" Optimizing to the specific training mode by the processor is interpreted as synonymou with determining by each of at least one processor cores.).
wherein the neural network model comprises L layers, wherein M and L are integers greater than or equal to 1, wherein for each layer of the L layers, ([p. 274 Col. 1] "The neural network models used in this study consisted of MFF networks comprising: (i) a main network containing an input layer with Ni neurons, a hidden layer with Nh1 neurons with selective activation, an optional second hidden layer").
the at least one processor core is used to train the layer, wherein the processor is configured to control the transceiver to transmit data to a second processor core in the M processor cores, and wherein the instructions cause each of the at least one processor core to be configured to: ([p. 272 Sec. 2] "Kernels are executed in parallel by different CUDA threads, on a physically separate device (GPU) that operates as a co-processor to the host (CPU) running the program. Threads are organized into blocks, containing up to 512 threads...Threads within a block can cooperate among themselves by sharing data and synchronizing their execution to coordinate memory accesses.").
determining, a model training mode of the layer based on an estimated data volume in a model parameter set and an estimated data volume of output data of the layer, wherein the training system comprises at least one M processor cores; and (See Table 1 CorrectWeights "Adjust the weights of a given layer. For the batch mode the step sizes are also updated." [p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads...The number of thread blocks in a grid is typically dictated by the size of the data being processed rather than by the number of processors in the system, which it can greatly exceed." Size of the data interpreted as synonymous with estimated data volume.).
performing, an training to the layer using a determined training mode, wherein the determined model training mode comprises at least one of a data parallel training mode or a model parallel training mode. ([p. 273 Col. 1] "The online implementation shares much of the code of the batch implementation. Nevertheless there are significant differences in the kernel implementations and although they might have similar names, they are optimized to the specific version" Data parallel training mode is interpreted as synonymous with batch training mode.  Model parallel training mode is interpreted as synonymous with online training mode.). 

Regarding claim 9, claim 9 effectively mirrors claim 2 and is therefore rejected under a similar interpretation.

Regarding claim 10, claim 10 effectively mirrors claim 3 and is therefore rejected under a similar interpretation.

Regarding claim 11, claim 11 effectively mirrors claim 4 and is therefore rejected under a similar interpretation.

Regarding claim 12, claim 12 effectively mirrors claim 5 and is therefore rejected under a similar interpretation.

Regarding claim 13, claim 13 effectively mirrors claim 6 and is therefore rejected under a similar interpretation.

Regarding claim 15, claim 15 effectively mirrors claim 8 and is therefore rejected under a similar interpretation.

Regarding claim 16, claim 16 effectively mirrors claim 2 and is therefore rejected under a similar interpretation.



Regarding claim 18, claim 18 effectively mirrors claim 4 and is therefore rejected under a similar interpretation.

Regarding claim 19, claim 19 effectively mirrors claim 5 and is therefore rejected under a similar interpretation.

Regarding claim 20, claim 20 effectively mirrors claim 6 and is therefore rejected under a similar interpretation.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Lopes and in view of Jin (US 10282809 B2). 

Regarding claim 14, Lopes teaches The training apparatus of claim 8, wherein when the model parallel training mode is used for a jth layer of the L layers and before the performing, the instructions further cause each of the at least one processor core to: ([p. 274 Col. 1] "The neural network models used in this study consisted of MFF networks comprising: (i) a main network containing an input layer with Ni neurons, a hidden layer with Nh1 neurons with selective activation, an optional second hidden layer"). However, Lopes does not explicitly teach set a value of i to an integer that is greater than or equal to 1 and less than or equal to M; 
update the value of i, wherein the updated value of i is another integer greater than or equal to 1 and less than or equal to M; 
estimate a second total duration of updated i processor cores on training, wherein the second total duration is an estimated total duration of the updated i processor cores on receiving the second input data and training the model parameter of the jth layer based on the second input data, wherein each value of i corresponds to one total duration; 
determine a third total duration based on a sum of a quantity of first total durations and a quantity of second total durations is equal to a quantity threshold, 
wherein the total duration comprises a smaller value in the first total duration and the second total duration; and 
use a second value of i that corresponds to the total duration with a smaller value as a determined value of a quantity of the at least one processor core used for training the jth layer.  

Jin who teaches a related art of training a neural network teaches set a value of i to an integer that is greater than or equal to 1 and less than or equal to M; ([Col. 13 l. 49] " i is the sequence number of a worker group or GPU" [Col. 19 l. 34] "dividing a storage region in each GPU where model parameters and gradients are stored into N partitions according to the number of the GPUs 2N;" M is interpreted as synonymous with 2N).
estimate a first total duration of i processor cores on training, wherein the first total duration is an estimated total duration of all the i processor cores on receiving a second input data and training the model parameter of the jth layer based on the second input data; (See FIG. 21. First total duration is interpreted as simply first relative iteration of the adaptive learning rate.).
update the value of i, wherein the updated value of i is another integer greater than or equal to 1 and less than or equal to M; ([Col. 13 l. 40] "In the training flow shown in FIG. 5, an auxiliary variable is further updated.").
estimate a second total duration of updated i processor cores on training, wherein the second total duration is an estimated total duration of the updated i processor cores on receiving the second input data and training the model parameter of the jth layer based on the second input data, wherein each value of i corresponds to one total duration; (See FIG. 21.   First total duration is interpreted as simply first relative iteration of the adaptive learning rate.).
determine a third total duration based on a sum of a quantity of first total durations and a quantity of second total durations is equal to a quantity threshold, ([Col. 14 l. 14-50] "Therefore, in the scene of data parallel, an adaptive learning rate updating formula for the parameter in the ith position should be expressed as:" See Eqn. on lines 15-50 helper_sum interpreted as a quantity of second total duration. Third total duration interpreted as result of helper_sum.).
wherein the total duration comprises a smaller value in the first total duration and the second total duration; and ([Col. 14 l. 14-50] "Therefore, in the scene of data parallel, an adaptive learning rate updating formula for the parameter in the ith position should be expressed as:" See Eqn. on lines 15-50 helper_sum interpreted as a quantity of second total duration. Third total duration interpreted as result of helper_sum.).
use a second value of i that corresponds to the total duration with a smaller value as a determined value of a quantity of the at least one processor core used for training the jth layer. ([Col. 13 l. 49] " i is the sequence number of a worker group or GPU" [Col. 19 l. 34] "dividing a storage region in each GPU where model parameters and gradients are stored into N partitions according to the number of the GPUs 2N;" The value of i in the second iteration is interpreted as a second value of I that corresponds to the total duration with a smaller value as a determined value of a quantity of the at least one processor cores.). 

.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Loshchilov (“ONLINE BATCH SELECTION FOR FASTER TRAINING OF NEURAL NETWORKS”, 2016), Wang (US10572800B2), Wang (US 20150019214 A1).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720.  The examiner can normally be reached on M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        



/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124