Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on April 14, 2022, in which claims 1-6, 8-12, and 15-19 are amended. Claims 1-20 are currently pending.

Specification
Applicant's amendments made to the specification are acknowledged. Examiner’s objection to the specification are hereby withdrawn, as necessitated by Applicant’s amendments made to the specification.

Response to Arguments
The original rejections to claims 1 and 6 under 35 U.S.C. § 112(b) are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.  The rejections to claim 14 have not been addressed and are therefore maintained without traverse. 
Applicant’s arguments with respect to rejection of claims 1-20 under 35 U.S.C. 103(a) based on amendment have been considered, however, have not been deemed persuasive. 
With respect to Applicant's arguments that Lopes fails to teach selecting a different model training mode of a layer that is selected from at least two model training modes, Examiner respectfully disagrees.  Lopes explicitly teaches that the training may be selected and performed from either batch or online training modes for the majority of the backpropagation algorithms.  It is further unclear what the different model training mode is different from, as a prior model training mode is not introduced such that a 'different' mode could be considered synonymous with a 'second' mode.  Examiner also asserts that there is insufficient support in the instant specification for a different model training mode such that one of ordinary skill in the art would be able to readily and reasonably understand the utility and novelty of the amended claim limitation.  Because of the lack of clarity in the newly introduced claim limitations the examiner has interpreted a different learning mode to be any training mode selected from one of either a batch or online training mode which is taught by the previously introduced primary reference Lopes and addresses the proposed claim limitations.  See the 112(a) and 112(b) rejections below for further explanation of the interpretation of the newly introduced claim limitation.  For these reasons the rejection is maintained.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.

With respect to claims 1, 8, and 15, the instant specification does not describe a “different model training mode”, how said model training mode is differentiated, or how said model training mode would be selected differently from a plurality of model training modes.  The “different model training mode” is interpreted as incorporating new matter into the claim which not does contain support in the original disclosure.

The remaining claims are rejected with respect to their dependence on the rejected claims. 

The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-20 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Regarding claims 1, 8, and 15, "a different model training mode" is indefinite.  A model training mode is not introduced in the claims prior to a different model training mode, therefore it's not clear what the different model training mode is different from.  Furthermore, there is no support in the instant specification for a "different model training mode" or how it is selected.  In the interest of further examination this is interpreted as synonymous with a model training mode, and further uses of a different model training mode in the claim are interpreted similarly.
Regarding claim 14, “the total duration” lacks antecedent basis.  It is unclear whether “the total duration” refers to the “third total duration” or one of the previously introduced total durations or a fourth unspecified total duration.

Regarding claim 14, “wherein the total duration comprises a smaller value in the first total duration and the second total duration” is indefinite.  The first, second, and third durations are expected to be values, so how the total duration (a value) can comprise a value in the first and second values is indefinite.  In the interest of further examination this is interpreted as the total duration containing a sum of the two values.

The remaining claims are rejected with respect to their dependence on the rejected claims. 

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-13, and 15-20 are rejected under 35 U.S.C. 102 as being anticipated by Lopes (“STOCHASTIC GPU-BASED MULTITHREAD IMPLEMENTATION OF MULTIPLE BACK-PROPAGATION”, 2010).
	
Regarding claim 1, Lopes teaches A training method for a neural network model applied to a training system, wherein the training method comprises: ([Abstract] "In this paper, we propose a GPU implementation of the online (stochastic) training mode of the Multiple Back- Propagation (MBP) algorithm")
	determining, by each of at least one M processor cores ([p. 273] "The online implementation shares much of the code of the batch implementation. Nevertheless there are significant differences in the kernel implementations and although they might have similar names, they are optimized to the specific version...In the batch mode, those kernels process (in parallel) all the Np patterns contained in the training data set, while in the online mode they process a single pattern." [p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads" Optimizing to the specific training mode by the processor is interpreted as synonymous with determining by each of at least one processor cores.)
	for each layer of L layers of the neural network model, ([p. 274 Col. 1] "The neural network models used in this study consisted of MFF networks comprising: (i) a main network containing an input layer with Ni neurons, a hidden layer with Nh1 neurons with selective activation, an optional second hidden layer")
	a different model training mode of a layer of the L layers that is selected from at least two model training modes and is based on a data volume in a model parameter set and a data volume of output data of the layer, wherein the training system comprises the M processor cores, and wherein M and L are integers greater than or equal to 1; and (See Table 1 CorrectWeights "Adjust the weights of a given layer. For the batch mode the step sizes are also updated." [p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads...The number of thread blocks in a grid is typically dictated by the size of the data being processed rather than by the number of processors in the system, which it can greatly exceed." Size of the data interpreted as synonymous with  data volume.)
	performing, by each of the M processor cores, training to the layer using a different one of the at least two model training mode, wherein the model training mode comprise a data parallel training mode or a model parallel training mode. ([p. 273 Col. 1] "The online implementation shares much of the code of the batch implementation. Nevertheless there are significant differences in the kernel implementations and although they might have similar names, they are optimized to the specific version" Data parallel training mode is interpreted as synonymous with batch training mode.  Model parallel training mode is interpreted as synonymous with online training mode.). 

	Regarding claim 2, Lopes teaches The training method of claim 1, wherein the  model training mode of a (j−1)th layer of the L layers is the data parallel training mode, wherein j is an integer greater than 1 and less than or equal to L ([p. 274 Col. 1] "The neural network models used in this study consisted of MFF networks comprising: (i) a main network containing an input layer with Ni neurons, a hidden layer with Nh1 neurons with selective activation, an optional second hidden layer" [p. 272 Col. 2] "The main network can only calculate its outputs after knowing the outputs (mpk ) of the space network. Thus the two networks will function in a collaborative manner and must also be trained together." j being less than or equal to 1 and training a j-1 layer is interpreted as not performing batch training on the output layer. Layer L interpreted as synonymous with output of main network.  Lopes explicitly teaches that main network outputs cannot be calculated before previous layers are trained.)
	 wherein the performing comprises performing data parallel training on a model parameter of the (j−1)th layer ([p. 273 Col. 1] "Table 1 identifies the purpose of the kernels implemented for the online and batch mode versions. The kernels FireLayer, FireOutputLayer, CalcLocalGradients and CorrectWeights were designed to operate on a generic network layer with Nn neurons, each with Ni inputs (not including the bias) and No output connections.")
	wherein first output data is used as input data of a jth layer of the L layers, (See FIG. 1, layer output is fed as input into next layer.)
	and wherein the first output data is output data obtained by each of the M processor cores training the (j−1)th layer. ([p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads...This requirement allows the set of thread blocks, called a grid, to be scheduled in any order across any number of cores"). 

	Regarding claim 3, Lopes teaches The training method of claim 1, wherein the  model training mode of a (j−1)th layer of the L layers is the model parallel training mode, wherein j is an integer greater than 1 and less than or equal to L, ([p. 274 Col. 1] "The neural network models used in this study consisted of MFF networks comprising: (i) a main network containing an input layer with Ni neurons, a hidden layer with Nh1 neurons with selective activation, an optional second hidden layer" [p. 272 Col. 2] "The main network can only calculate its outputs after knowing the outputs (mpk ) of the space network. Thus the two networks will function in a collaborative manner and must also be trained together." j being less than or equal to 1 and training a j-1 layer is interpreted as not performing batch training on the output layer. Layer L interpreted as synonymous with output of main network.  Lopes explicitly teaches that main network outputs cannot be calculated before previous layers are trained.)
	wherein the performing comprises performing model parallel training on a model parameter of the (j−1)th layer, ([p. 273 Col. 1] "Table 1 identifies the purpose of the kernels implemented for the online and batch mode versions. The kernels FireLayer, FireOutputLayer, CalcLocalGradients and CorrectWeights were designed to operate on a generic network layer with Nn neurons, each with Ni inputs (not including the bias) and No output connections." Online mode interpreted as synonymous with model parallel training.)
	wherein second output data is used as input data of a jth layer of the L layers, (See FIG. 1, layer output is fed as input into next layer.)
	wherein the second output data is output data obtained by m processor cores training the (j−1)th layer, wherein the m processor cores are one or more of the M processor cores used for training the (j−1)th layer, wherein m is an integer greater than or equal to 1 and less than or equal to M, and wherein a value of m of at least one of the L layers is greater than 1. ([p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads...This requirement allows the set of thread blocks, called a grid, to be scheduled in any order across any number of cores").

	Regarding claim 4, Lopes teaches The training method of claim 1, wherein when the  data volume in the model parameter set is not greater than the  data volume of the output data of a layer, the model training mode of the layer is the data parallel training mode. ([p. 274 Sec. 5.1] "a space network with Ni inputs and Nh1 outputs." Table 4 Friedman column shows a mini-batch size of 32 being used on an output set of 40.  Batch size is interpreted as synonymous with  data volume in the model parameter set. Data parallel training mode interpreted as synonymous with batch training mode.). 

	Regarding claim 5, Lopes teaches The training method of claim 1, wherein when the  data volume in the model parameter set is greater than the  data volume of the output data of a layer, the model training mode of the layer is the model parallel training mode. ([p. 272 Sec. 3] "ypk is the output of neuron k, mpk the importance of the neuron for the network output that varies accordingly to the pattern (stimulus) presented" [p. 273 Sec. 4] "In the batch mode, those kernels process (in parallel) all the Np patterns contained in the training data set, while in the online mode they process a single pattern. Therefore in the online mode the kernels must be called Np times (for each layer) in each epoch." Lopes explicitly teaches that batch size of online training is far larger than output data size.  Batch size is interpreted as synonymous with  data volume in the model parameter set.  Model parallel training mode is interpreted as synonymous with online training mode.). 

	Regarding claim 6, Lopes teaches The training method of claim 1, wherein the  model training mode of a (j−1)th layer of the L layers is the model parallel training mode, wherein j is an integer greater than 1 and less than or equal to L, wherein the performing comprises: ([Abstract] "it is commonly accepted that batch size is an important parameter for offline tuning" [p. 274 Col. 1] "The neural network models used in this study consisted of MFF networks comprising: (i) a main network containing an input layer with Ni neurons, a hidden layer with Nh1 neurons with selective activation, an optional second hidden layer" [p. 272 Col. 2] "The main network can only calculate its outputs after knowing the outputs (mpk ) of the space network. Thus the two networks will function in a collaborative manner and must also be trained together." [p. 273 Col. 1] "Table 1 identifies the purpose of the kernels implemented for the online and batch mode versions. The kernels FireLayer, FireOutputLayer, CalcLocalGradients and CorrectWeights were designed to operate on a generic network layer with Nn neurons, each with Ni inputs (not including the bias) and No output connections." j being less than or equal to 1 and training a j-1 layer is interpreted as not performing batch training on the output layer. Layer L interpreted as synonymous with output of main network.  Lopes explicitly teaches that main network outputs cannot be calculated before previous layers are trained.)
	determining, based on a model parameter set of a jth layer of the L layers, a model parameter subset of the jth layer that is to be trained by each of the M processor cores; and performing the model parallel training on the model parameter subset of the jth layer, ( [p. 272 §2] "This requirement allows the set of thread blocks, called a grid, to be scheduled in any order across any number of cores enabling programmers to write code that scales with the number of cores present on the device" [p. 273 § 4] "In the batch mode, those kernels process (in parallel) all the Np patterns contained in the training data set, while in the online mode they process a single pattern...they might be used to train the NNs using small batches of patterns (they could also be used to train the networks in batch mode, but they would be inefficient compared to the kernels designed for that purpose). This implementation is sometimes referred as mini-batch where the networks are trained using blocks of Nb patterns (1 < Nb < Np)" Mini-batch interpreted as synonymous with model parameter subset.)
	wherein second output data is used as input data of the jth layer and an intersection set between model parameter subsets of the jth layer that are trained by any two processor cores of the M processor cores is empty, (See FIG. 1, layer output is fed as input into next layer. [p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads...This requirement allows the set of thread blocks, called a grid, to be scheduled in any order across any number of cores")
	wherein the second output data is output data obtained by m processor cores training a (j−1)th layer of the L layers, (See FIG. 1, layer output is fed as input into next layer.)
	and wherein a union set of model parameter subsets of the jth layer that are trained by all of the at least one M processor core is equal to a universal set of model parameters of the jth layer. ([p. 273 Sec. 4] "Although in the online mode the kernels process a single pattern, they are actually capable of processing several patterns in parallel. Thus they might be used to train the NNs using small batches of patterns (they could also be used to train the networks in batch mode, but they would be inefficient compared to the kernels designed for that purpose). This implementation is sometimes referred as mini-batch where the networks are trained using blocks of Nb patterns (1 < Nb < Np)." Lopes explicitly teaches that the purpose of the mini-batch is to split up the universal set of model parameters such that the union set is interpreted as equal to the universal set.). 

	Regarding claim 7, Lopes teaches The training method of claim 1, wherein based on the model parallel training mode being used for a jth layer, the method further comprises: dividing second output data into a first input data subblock and a second input data subblock, wherein the second output data is output data obtained by m processor cores training a (j−1)th layer of the L layers; (See Table 1 CorrectWeights "Adjust the weights of a given layer. For the batch mode the step sizes are also updated." [p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads...The number of thread blocks in a grid is typically dictated by the size of the data being processed rather than by the number of processors in the system, which it can greatly exceed." Input data subblock interpreted as synonymous with thread block.)
	using the second output data as input data of the jth layer of the L layers (See FIG. 1, layer output is fed as input into next layer.)
	performing model parallel training on a model parameter of the jth layer of the L layers, comprising: ([p. 273 Col. 1] "Table 1 identifies the purpose of the kernels implemented for the online and batch mode versions. The kernels FireLayer, FireOutputLayer, CalcLocalGradients and CorrectWeights were designed to operate on a generic network layer with Nn neurons, each with Ni inputs (not including the bias) and No output connections." Online mode interpreted as synonymous with model parallel training.)
	receiving the first input data subblock ([p. 272 Sec. 2] "This requirement allows the set of thread blocks, called a grid, to be scheduled in any order across any number of cores" Scheduling a thread interpreted as synonymous with receiving the first input data subblock.)
	performing in parallel all of the following: performing the model parallel training on the model parameter of the jth layer based on the first input data subblock to obtain first output subdata of the jth layer; ([p. 273 Sec. 4] "In the batch mode, those kernels process (in parallel) all the Np patterns contained in the training data set")
	receiving the second input data subblock; and ([p. 272 Sec. 2] "This requirement allows the set of thread blocks, called a grid, to be scheduled in any order across any number of cores" Scheduling a thread interpreted as synonymous with receiving the second input data subblock.)
	performing the model parallel training on the model parameter of the jth layer based on the second input data subblock to obtain second output subdata of the jth layer; and ([p. 273 Sec. 4] "In the batch mode, those kernels process (in parallel) all the Np patterns contained in the training data set")
	transmitting the first output subdata of the jth layer to a (j+1)th layer of the L layers. (See FIG. 1, layer output is fed as input into next layer.). 

	Regarding claim 8,  A training apparatus for a neural network model, wherein the training apparatus comprises: ([Abstract] "In this paper, we propose a GPU implementation of the online (stochastic) training mode of the Multiple Back- Propagation (MBP) algorithm")
	a memory configured to store instructions;
a processor coupled to the memory and configured to execute the instructions, wherein the processor comprises at least one processor core; and ([p. 272 Sec. 2] "Threads within a block can cooperate among themselves by sharing data and synchronizing their execution to coordinate memory accesses.")
	a transceiver coupled to the processor and the memory, wherein the training apparatus is applicable to a training system that comprises M processor cores, ( Optimizing to the specific training mode by the processor is interpreted as synonymous with determining by each of at least one processor cores.)
	wherein the neural network model comprises L layers, wherein M and L are integers greater than or equal to 1, wherein for each layer of the L layers, ([p. 274 Col. 1] "The neural network models used in this study consisted of MFF networks comprising: (i) a main network containing an input layer with Ni neurons, a hidden layer with Nh1 neurons with selective activation, an optional second hidden layer")
	the at least one processor core is used to train the layer, wherein the processor is configured to control the transceiver to transmit data to a second processor core in the M processor cores, and wherein the instructions cause each of the at least one processor core to be configured to: ([p. 272 Sec. 2] "Kernels are executed in parallel by different CUDA threads, on a physically separate device (GPU) that operates as a co-processor to the host (CPU) running the program. Threads are organized into blocks, containing up to 512 threads...Threads within a block can cooperate among themselves by sharing data and synchronizing their execution to coordinate memory accesses.")
	determine, a different model training mode of the layer based on a  data volume in a model parameter set and an  data volume of output data of the layer, wherein the training system comprises at least one M processor cores; and (See Table 1 CorrectWeights "Adjust the weights of a given layer. For the batch mode the step sizes are also updated." [p. 272 Col. 1] "Kernels are executed in parallel by different CUDA threads...The number of thread blocks in a grid is typically dictated by the size of the data being processed rather than by the number of processors in the system, which it can greatly exceed." Size of the data interpreted as synonymous with  data volume.)
	performing, training to the layer using a different one of the at least two model training modes, wherein the  model training modes comprise at least one of a data parallel training mode or a model parallel training mode. ([p. 273 Col. 1] "The online implementation shares much of the code of the batch implementation. Nevertheless there are significant differences in the kernel implementations and although they might have similar names, they are optimized to the specific version" Data parallel training mode is interpreted as synonymous with batch training mode.  Model parallel training mode is interpreted as synonymous with online training mode.). 

	Claims 9-13, and 15 are substantially similar to claims 2-6 and 8 respectively.  Therefore, the rejections applied to claims 2-6 and 8, also apply to claims 9-13 and 15.

	Claims 16-20 are substantially similar to claims 2-6.  Therefore, the rejections applied to claims 2-6 also apply to claims 16-20. 
	
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: 
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Lopes and in view of Jin (US 10282809 B2).  

	Regarding claim 14, Lopes teaches The training apparatus of claim 8, wherein when the model parallel training mode is used for a jth layer of the L layers and before the performing, the instructions further cause each of the at least one processor core to: ([p. 274 Col. 1] "The neural network models used in this study consisted of MFF networks comprising: (i) a main network containing an input layer with Ni neurons, a hidden layer with Nh1 neurons with selective activation, an optional second hidden layer")
	However, Lopes does not explicitly teach set a value of i to an integer that is greater than or equal to 1 and less than or equal to M
	estimate a first total duration of i processor cores on training, wherein the first total duration is an  total duration of all the i processor cores on receiving a second input data and training the model parameter of the jth layer based on the second input data; 
	update the value of i, wherein the updated value of i is another integer greater than or equal to 1 and less than or equal to M; 
	estimate a second total duration of updated i processor cores on training, wherein the second total duration is an  total duration of the updated i processor cores on receiving the second input data and training the model parameter of the jth layer based on the second input data, wherein each value of i corresponds to one total duration; 
	determine a third total duration based on a sum of a quantity of first total durations and a quantity of second total durations is equal to a quantity threshold, 
	wherein the total duration comprises a smaller value in the first total duration and the second total duration; and 
	use a second value of i that corresponds to the total duration with a smaller value as a  value of a quantity of the at least one processor core used for training the jth layer.  

Jin teaches set a value of i to an integer that is greater than or equal to 1 and less than or equal to M; ([Col. 13 l. 49] " i is the sequence number of a worker group or GPU" [Col. 19 l. 34] "dividing a storage region in each GPU where model parameters and gradients are stored into N partitions according to the number of the GPUs 2N;" M is interpreted as synonymous with 2N)
	estimate a first total duration of i processor cores on training, wherein the first total duration is an  total duration of all the i processor cores on receiving a second input data and training the model parameter of the jth layer based on the second input data; (See FIG. 21.   First total duration is interpreted as simply first relative iteration of the adaptive learning rate.)
	update the value of i, wherein the updated value of i is another integer greater than or equal to 1 and less than or equal to M; ([Col. 13 l. 40] "In the training flow shown in FIG. 5, an auxiliary variable is further updated, In one implementation, the auxiliary variable is the sum of squares of auxiliary gradients (helper_sum) used for computing an adaptive learning rate, and a computational formula thereof is as follows:
helper_sumi′=helper_sumi +Δw i 2")
	estimate a second total duration of updated i processor cores on training, wherein the second total duration is an  total duration of the updated i processor cores on receiving the second input data and training the model parameter of the jth layer based on the second input data, wherein each value of i corresponds to one total duration; (See FIG. 21.   First total duration is interpreted as simply first relative iteration of the adaptive learning rate.)
	determine a third total duration based on a sum of a quantity of first total durations and a quantity of second total durations is equal to a quantity threshold, ([Col. 14 l. 14-50] "Therefore, in the scene of data parallel, an adaptive learning rate updating formula for the parameter in the ith position should be expressed as:" See Eqn. on lines 15-50 helper_sum interpreted as a quantity of second total duration. Third total duration interpreted as result of helper_sum.)
	wherein the total duration comprises a smaller value in the first total duration and the second total duration; and ([Col. 14 l. 14-50] "Therefore, in the scene of data parallel, an adaptive learning rate updating formula for the parameter in the ith position should be expressed as:" See Eqn. on lines 15-50 helper_sum interpreted as a quantity of second total duration. Third total duration interpreted as result of helper_sum.)
	use a second value of i that corresponds to the total duration with a smaller value as a  value of a quantity of the at least one processor core used for training the jth layer. ([Col. 13 l. 49] " i is the sequence number of a worker group or GPU" [Col. 19 l. 34] "dividing a storage region in each GPU where model parameters and gradients are stored into N partitions according to the number of the GPUs 2N;" The value of i in the second iteration is interpreted as a second value of I that corresponds to the total duration with a smaller value as a  value of a quantity of the the at least on processor cores.). 

	Lopes and Jin are both directed towards methods of training a neural network in a highly multithreaded environment including batch training.  Therefore, Lopes and Jin are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network training system in Lopes with Jin. One of ordinary skill in the art would be able to determine that in a system designed to minimize processing time in training a neural network, that determining processing time at the core-level would be advantageous to the process.  Jin gives the additional benefit as motivation for combination ([Col. 13 l. 13] “the greater the concurrency value of data parallel is, the more significant the performance benefit of the linear topology is”).  Timing core parallelism is interpreted as essential to increasing the concurrency.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126