DETAILED ACTION
This office action is in response to the Application No. 16249279 filed on
08/11/2022. Claims 1-27 are presented for examination and are currently pending. Applicant’s arguments have been carefully and respectfully considered.

Response to Arguments
2.	The remarks made by the Applicant on 08/11/2022 has overcome the US 35 U.S.C 112 (a) rejections of 06/09/2022 and therefore the rejections are thereby withdrawn.
	The Applicant argued on page 14 of the Remarks that “neither Strom, Alistarh, the remaining references, nor any combination thereof discloses, teaches, or suggests each and every claimed feature of independent claim 1.”
	The arguments above are not persuasive because Strom teaches a processor-implemented neural network method, the method (when using stochastic gradient descent to train a neural network (e.g., a deep neural network), a computing device may compute a gradient (e.g., a set of elements with a separate update value for each parameter of the model) for each input vector of training data, or for some subset of training data, col 3 lines 20-24); calculating one or more respective individual gradient values for updating a weight of a neural network; (At block 408, the gradient computation module 122 or some other module or component of the model training node 102A can generate a partial gradient based on the output vector 306 …., col 7 lines 45-60); tuning the one or more respective individual gradient values, (The quantized gradient values are then applied to the parameters of the local copy of the model, col 4, lines 17-19), each having respective bit digits, to correspond to bit digits of residual gradient value; (the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number, col 11, lines 45-48); summing the weight and an intermediate summation value of the tuned one or more respective individual gradient values, and the residual gradient value; (In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter (as weight)) of the residual gradient, col 8, lines 5-8. The Examiner notes that the intermediate summation value is the residual gradient value added to the individual gradient value); updating the residual gradient value dependent on a result of the summing; (At decision block 412, the model synchronization module 124 or some other module or component can determine, for a given model parameter, whether a particular update value (the combined update value from the partial gradient and the residual gradient) meets or exceeds a threshold. If so, the process 400 may proceed to block 414; col 8, lines 32-37, step 412-414, Fig. 4); wherein the residual gradient value is dependent on an accumulating of one or more previous individual gradient values for updating the weight in a previous time (At block 410, the model synchronization module 124 or some other module or component of the model training node 102A can add the partial gradient to the residual gradient, aggregate values of individual elements of the partial gradient and residual gradient, or otherwise compute values based on the partial gradient and residual gradient, col 7, lines 64-67, col 8, lines 1-2). Majumdar teaches selectively updating the weight dependent on the intermediate summation value overlaps bit digits of the weight to train the neural network (For performing operations such as adding gradient updates to weights, there may be sufficient mantissa overlap between tensors, putting additional requirements on number of bits needed to represent values in training [0017]; tensors in the primary neural network, such as weights [0034]). So, It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Strom to incorporate the teachings of Majumdar for the benefit of establishing that deep learning tensors conform to these requirements during training may improve results (Majumdar [0017])
	The Applicant argued on page 16 of the Remarks that “As shown, the Office Action asserts that Strom's quantizing of the update value (block 608) discloses the claimed "tuning the one or more respective individual gradient values" and Strom's adding of the partial gradient to the residual gradient (block 410) discloses the originally claimed "summing the tuned one or more respective individual gradient values, the residual gradient value, and the weight.”
	The arguments above are not persuasive because Strom teaches tuning the one or more respective individual gradients values (The quantized gradient values are then applied to the parameters of the local copy of the model, col 4, lines 17-19), … summing the tuned one or more respective individual gradient values, the residual gradient value, and the weight;(In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to … the value for the same model parameter(as weight), col 8, lines 5-8)
	The Applicant argued on page 17 of the Remarks that “That is, while the Office Action asserts Strom's quantizing discloses the claimed "tuning the one or more respective individual gradient values," the partial gradient that is added to the residual gradient at block 410 is not quantized”
	The arguments above are not persuasive because Strom clearly teaches the quantized gradient values (as tuned individual gradient values) (col 4, lines 17-18). Strom also teaches that in some embodiments, the updates that exceed the threshold described are quantized (col 4, line 14-17), and they also teach in adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding value update value of the residual gradient, (col 8, lines 5-8). Since updates that exceed the threshold value are quantized (col 4, lines 14-17), and particular update value of a partial gradient can exceed a threshold (col 11, lines 27-32), then it is obvious that each individual update value from the partial gradient value that exceeds a threshold is quantized and added to the corresponding update value of the residual gradient.
	The Applicant argued on page 17 of the Remarks that “… and therefore Strom's adding of the partial gradient to the residual gradient at block 410 fails to disclose the claimed "summing the weight and an intermediate summation value of the tuned one or more respective individual gradient values and the residual gradient value.”
	The arguments above are not persuasive. According to the applicant’s disclosure “intermediate summation value 740” which is the residual gradient value + individual gradient value, (Instant Specification, US20200012936, [0128] Fig 7). Therefore, Strom teaches summing the weight and an intermediate summation value of the tuned one or more respective individual gradient values and the residual gradient value (In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter (as weight)) of the residual gradient (col 8, lines 5-8. The Examiner notes that the intermediate summation value is the residual gradient value added to the individual gradient value). That is, partial gradient, residual gradient and weight is added together in Strom.
	The Applicant argued on page 17 of the Remarks that “Strom fails to disclose quantizing the partial gradient and adding the quantized partial gradient, the residual gradient, and the salient gradient”.
	The arguments above are not persuasive because they are not directed to the claim limitations. The Applicant is arguing what is not claimed.
	The Applicant argued on page 17 of the Remarks that “Accordingly, Strom fails to disclose, teach, or suggest "tuning the one or more respective individual gradient values, each having respective bit digits, to correspond to bit digits of a residual gradient value” and “summing the weight and an intermediate summation value of the tuned one or more respective individual gradient values, and the residual gradient value” as recited in independent claim 1”
	The arguments above are not persuasive because Strom teaches tuning the one or more respective individual gradient values (the quantized gradient values (as tuned individual gradient values), col 4, lines 17-18), each having respective bit digits, to correspond to bit digits of a residual gradient value (the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number, col 11, lines 45-48), summing the weight and an intermediate summation value of the tuned one or more respective individual gradient values, and the residual gradient value (In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter (as weight)) of the residual gradient, col 8, lines 5-8. The Examiner notes that the intermediate summation value is the residual gradient value added to the individual gradient value). That is, partial gradient, residual gradient and weight is added together. 
	The Applicant argued on page 17 of the Remarks that “For example, for at least the below reasons as with respect to independent claim 22, Swartzlander fails to disclose fails to disclose, teach, or suggest "summing the weight and an intermediate summation value of the tuned one or more respective individual gradient values and the residual gradient value," and "selectively updating the weight dependent on whether the intermediate summation value overlaps bit digits of the weight, to train the neural network," as recited in independent claim 1.”
	The arguments above are not persuasive because the summing the weight and an intermediate summation value of the tuned one or more respective individual gradient values and the residual gradient value are not claimed in independent claim 22. Also, Swartzlander is not applied in independent claim 1 to teach the selectively updating the weight dependent on whether the intermediate summation value overlaps bit digits of the weight, to train the neural network.
	The Applicant argued on page 18 of the Remarks that “Applicant respectfully disagrees and respectfully submits that neither Strom, Swartzlander, the remaining references, nor any combination thereof discloses, teaches, or suggests each and every claimed feature of independent claim 22”.
	The arguments above are not persuasive because Strom teaches a processor-implemented neural network method, the method comprising: (when using stochastic gradient descent to train a neural network (e.g., a deep neural network), a computing device may compute a gradient (e.g., a set of elements with a separate update value for each parameter of the model) for each input vector of training data, or for some subset of training data, col 3 lines 20-24), calculating an individual gradient value for updating a weight of a neural network; accumulating one or more of the calculated individual gradient values for updating the weight; (the gradient computation module 122 may be configured to compute a partial gradient 314 (as individual gradient) that includes a collection of updates to the individual parameters of the model 304, col 7 lines 56- 59; component of the model training node 102A can add the partial gradient to the residual gradient, aggregate values of individual elements of the partial gradient and residual gradient, col 7 lines 65-67), determining whether a value, of the accumulated one or more of the calculated individual gradient values, meets a threshold; (At decision block 412, the model synchronization module 124 or some other module or component can determine, for a given model parameter, whether a particular update value (the combined update value from the partial gradient and the residual gradient) meets or exceeds a threshold, col 8, lines 32-36, Fig. 4) and in response to the determining of whether the result of the accumulating meets the threshold being that the result of the accumulating meets the threshold, updating the weight dependent on the result of the accumulating, (Rather than update every parameter of the model based on the gradient, only those elements with update values meeting or exceeding a threshold, or meeting some other criteria, may be applied. In some embodiments, a threshold may be chosen such that the number of elements with update values exceeding the threshold, and therefore the number of parameters to be updated, col 3, lines 25-31 and col. 8, line 62-col. 9, line 8: block 414)7Application No. 16/249,279Docket No. 012055.0458 and Swartzlander teaches wherein the threshold is a least significant bit digit of the weight (the least significant bit of the weights comprising the weight vector W and the threshold, pg. 0723, right col, last para.). So, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Modified Strom to incorporate the teachings of Swartzlander for the benefit of implementation of parallel counters with up to 1022 inputs as required to realize multi-layer neural networks with up to 1000 neurons per layer (Swartzlander, abstract)
	The Applicant argued on page 20 of the Remarks that “Swartzlander fails to disclose that the threshold is a least significant bit of the weight” as recited in independent claim 22.
	The arguments above are not persuasive because Swartzlander teaches the threshold is a least significant bit of the weight (the least significant bit of the weights comprising the weight vector W and the threshold, pg. 0723, right col, last para.). So, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Modified Strom to incorporate the teachings of Swartzlander for the benefit of implementation of parallel counters with up to 1022 inputs as required to realize multi-layer neural networks with up to 1000 neurons per layer (Swartzlander, abstract). The Swartzlander reference is good for all it teaches. The Office uses the relevant part of the Swartzlander reference to teach the threshold is a least significant bit of the weight (the least significant bit of the weights comprising the weight vector W and the threshold, pg. 0723, right col, last para.)


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


3.	Claims 1, 4, 6, 8, 17, 18 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Majumdar et al (US20190042945)

	Regarding claim 1, Strom teaches a processor-implemented neural network method, the method (when using stochastic gradient descent to train a neural network (e.g., a deep neural network), a computing device may compute a gradient (e.g., a set of elements with a separate update value for each parameter of the model) for each input vector of training data, or for some subset of training data, col 3 lines 20-24) comprising: 
	calculating one or more respective individual gradient values for updating a weight of a neural network; (At block 408, the gradient computation module 122 or some other module or component of the model training node 102A can generate a partial gradient based on the output vector 306 …., col 7 lines 45-60)  
	tuning the one or more respective individual gradient values, (The quantized gradient values are then applied to the parameters of the local copy of the model, col 4, lines 17-19),
	each having respective bit digits, to correspond to bit digits of residual gradient value; (the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number, col 11, lines 45-48);
	summing the weight and an intermediate summation value of the tuned one or more respective individual gradient values, and the residual gradient value; (In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter (as weight)) of the residual gradient, col 8, lines 5-8. The Examiner notes that the intermediate summation value is the residual gradient value added to the individual gradient value)  
	updating the residual gradient value dependent on a result of the summing; (At decision block 412, the model synchronization module 124 or some other module or component can determine, for a given model parameter, whether a particular update value (the combined update value from the partial gradient and the residual gradient) meets or exceeds a threshold. If so, the process 400 may proceed to block 414; col 8, lines 32-37, step 412-414, Fig. 4)
	wherein the residual gradient value is dependent on an accumulating of one or more previous individual gradient values for updating the weight in a previous time (At block 410, the model synchronization module 124 or some other module or component of the model training node 102A can add the partial gradient to the residual gradient, aggregate values of individual elements of the partial gradient and residual gradient, or otherwise compute values based on the partial gradient and residual gradient, col 7, lines 64-67, col 8, lines 1-2)
	Strom does not explicitly teach selectively updating the weight dependent on the intermediate summation value overlaps bit digits of the weight to train the neural network.
	Majumdar teaches selectively updating the weight dependent on the intermediate summation value overlaps bit digits of the weight to train the neural network (For performing operations such as adding gradient updates to weights, there may be sufficient mantissa overlap between tensors, putting additional requirements on number of bits needed to represent values in training [0017]; tensors in the primary neural network, such as weights [0034])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Strom to incorporate the teachings of Majumdar for the benefit of establishing that deep learning tensors conform to these requirements during training may improve results (Majumdar [0017])

	Regarding claim 4, Modified Strom teaches the method of claim 1, wherein the summing (the partial gradient 314 may be stored in the residual data store 132 or otherwise combined with the residual gradient. In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter) of the residual gradient col 8 lines 4-9) comprises: 
	mapping the tuned one or more respective individual gradient values and the residual gradient value based on a set bit number, and calculating the intermediate summation value based on the mapped tuned one or more respective individual gradient values and the mapped residual gradient value; and mapping the weight based on the set bit number and summing the intermediate summation value and the weight. (the model training node 102 may use a mapping of ranges of update values to a predefined set of values; the model training node 102 may round each value to some unit of precision; etc. In one specific, non-limiting embodiment, the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number. The index of the parameter to which the quantized update is to be applied may remain a 24 bit integer. Therefore, the combination of index and quantized value for a given parameter may be stored as a 32-bit structure, col 11 lines 42-52)

	Regarding claim 6, Modified Strom teaches the method of claim 1, wherein the updating of the weight comprises updating a bit digit value of portion of the result of the summing, corresponding to the bit digit representing the weight, to the updated weight, ((the model training node 102 may use a mapping of ranges of update values to a predefined set of values; the model training node 102 may round each value to some unit of precision; etc. In one specific, non-limiting embodiment, the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number. The index of the parameter to which the quantized update is to be applied may remain a 24 bit integer. Therefore, the combination of index and quantized value for a given parameter may be stored as a 32 bit structure, col 11 lines 42-52); the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number. The index of the parameter to which the quantized update is to be applied may remain a 24 bit integer, col 11 lines 46-50) and 
	wherein the updating of the residual gradient value comprises updating a bit digit value of remaining portion of the result of the summing, not corresponding to the bit digit representing the weight, to the residual gradient value. (When the update value for a particular parameter meets or exceeds the threshold, it can be applied (and sent to the other computing devices), and the residual gradient element for that particular parameter can be cleared (e.g., the update value set to zero or null). In some embodiments, each time a computing device determines a partial gradient for a portion of training data, the partial gradient may be added to the residual gradient. The threshold determination may then be made based on the sum of the partial gradient and the residual gradient, rather than on the newly calculated partial gradient alone. The portions of that sum that do not exceed the threshold (e.g., the individual elements with values close to zero) can then be stored as the new residual gradient, and the process may be repeated as necessary. In this way, updates which may be substantial in aggregate may be retained, while updates which are too small to make a substantial difference to the model, or which may be cancelled by other updates calculated in a subsequent iteration, are not applied, col 3, lines 47-67)  

	Regarding claim 8, Modified Strom the method of claim 1, Strom teaches a non-transitory computer-readable recording medium having recorded thereon computer readable instructions, which, when executed by one or more processors, performs the method (The process 600 may be embodied in a set of executable program instructions stored on non-transitory computer-readable media, such as short-term or long-term memory of one or more computing devices associated with a model training node 102A. When the process 600 is initiated, the executable program instructions can be loaded and executed by the one or more computing devices, col 11 lines 15-22).

	Regarding claim 17, Strom teaches a neural network apparatus, the apparatus comprising: one or more processors configured to (a method, process, routine, or algorithm described in connection with the embodiments disclosed herein can be embodied directly in hardware, in a software module executed by a processor device, or in a combination of the two, col 13, lines 10-14)
	calculate respective individual gradient values calculating respective individual gradient values (At block 408, the gradient computation module 122 or some other module or component of the model training node 102A can generate a partial gradient based on the output vector 306 …., col 7 lines 45-60)
	to update a weight of a neural network; (At decision block 412, the model synchronization module 124 or some other module or component can determine, for a given model parameter, whether a particular update value (the combined update value from the partial gradient and the residual gradient) meets or exceeds a threshold. If so, the process 400 may proceed to block 414; col 8, lines 32-37, step 412-414, Fig. 4)
	calculate a residual gradient value based on an accumulated gradient value obtained by accumulating the respective individual gradient values and a bit digit representing the weight; (At block 410, the model synchronization module 124 or some other module or component of the model training node 102A can add the partial gradient to the residual gradient, aggregate values of individual elements of the partial gradient and residual gradient, or otherwise compute values based on the partial gradient and residual gradient, col 7, lines 64-67, col 8, lines 1-2; the model training node 102 may use a mapping of ranges of update values to a predefined set of values; the model training node 102 may round each value to some unit of precision; etc. In one specific, non-limiting embodiment, the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number. The index of the parameter to which the quantized update is to be applied may remain a 24 bit integer. Therefore, the combination of index and quantized value for a given parameter may be stored as a 32 bit structure, col 11 lines 42-52)
	tune the respective individual gradient values (The quantized gradient values are then applied to the parameters of the local copy of the model, col 4, lines 17-19), 
	to correspond to a bit digit representing the residual gradient value (the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number, col 11, lines 45-48); 
	sum the weight and an intermediate summation value of tuned individual gradient values, and the residual gradient value (In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter (as weight)) of the residual gradient, col 8, lines 5-8. The Examiner notes that the intermediate summation value is the residual gradient value added to the individual gradient value) 
	and update the weight and the residual gradient value (At decision block 412, the model synchronization module 124 or some other module or component can determine, for a given model parameter, whether a particular update value (the combined update value from the partial gradient and the residual gradient) meets or exceeds a threshold. If so, the process 400 may proceed to block 414; col 8, lines 32-37, step 412-414, Fig. 4)
	Strom does not explicitly teach based on a whether the intermediate summation value overlaps bit digits of the weight to train the neural network.
	Majumdar teaches based on a whether the intermediate summation value overlaps bit digits of the weight to train the neural network (For performing operations such as adding gradient updates to weights, there may be sufficient mantissa overlap between tensors, putting additional requirements on number of bits needed to represent values in training [0017]; tensors in the primary neural network, such as weights [0034])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Strom to incorporate the teachings of Majumdar for the benefit of establishing that deep learning tensors conform to these requirements during training may improve results (Majumdar [0017])

	Regarding claim 18, Modified Strom teaches apparatus of claim 21, Strom further comprising a memory storing instructions, which when executed by the one or more processors, configure the one or more processors (The process 500 may be embodied in a set of executable program instructions stored on non-transitory computer-readable media, such as short-term or long-term memory of one or more computing devices associated with a model training node 102A, col 10 lines 12-16)
	to perform the calculation of the one or more respective individual gradient values, (At block 408, the gradient computation module 122 or some other module or component of the model training node 102A can generate a partial gradient based on the output vector 306 …., col 7 lines 45-60) 
	the tuning of the one or more respective individual gradient values, (the collection of updates to the parameters of a model may be referred to as a “gradient” because each update is based on the direction in which the corresponding parameter should be modified (e.g., a value of the parameter is to be increased or decreased by a particular amount) … col 5 lines 29-40; Further aspects of the present, … the updates that exceed the threshold described above are quantized or otherwise compressed in order to further reduce the size (e.g., the number of bits or bytes) of each update value. The quantized gradient values are then applied to the parameters of the local copy of the model and are also transmitted to the other computing devices for application to the respective copies of model. In order to retain the entire magnitude of the originally calculated update (e.g., the pre-quantized update values), the quantization error for each of the quantized values is added to the corresponding value of the residual gradient (e.g., the value of the residual gradient that corresponds to the same model parameter) col 4, lines 10-28)
	 the summing, the updating of the residual gradient value, and the selective updating of the weight (the partial gradient 314 may be stored in the residual data store 132 or otherwise combined with the residual gradient. In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter) of the residual gradient col 8 lines 4-9; At decision block 412, the model synchronization module 124 or some other module or component can determine, for a given model parameter, whether a particular update value (the combined update value from the partial gradient and the residual gradient) meets or exceeds a threshold. If so, the process 400 may proceed to block 414; col 8, lines 32-37, step 412-414, Fig. 4)  

	Regarding claim 21, Strom teaches a neural network apparatus, the apparatus comprising: one or more processors (The process 500 may be embodied in a set of executable program instructions stored on non-transitory computer-readable media, such as short-term or long-term memory of one or more computing devices associated with a model training node 102A, col 10 lines 12-16)  
	to calculate one or more respective individual gradient values for updating a weight of a neural network; (At block 408, the gradient computation module 122 or some other module or component of the model training node 102A can generate a partial gradient based on the output vector 306 …., col 7 lines 45-60) 
	tune the one or more respective individual gradient values, (the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number, col 11, lines 45-48); 
	each having respective bit digits, to correspond to a-bit digits of residual gradient value; (the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number, col 11, lines 45-48); 
	sum the weight and an intermediate summation value of the tuned one or more respective individual gradient values, the residual gradient value, (In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter (as weight)) of the residual gradient, col 8, lines 5-8. The Examiner notes that the intermediate summation value is the residual gradient value added to the individual gradient value)
	update the residual gradient value dependent on a result of the summing; (At decision block 412, the model synchronization module 124 or some other module or component can determine, for a given model parameter, whether a particular update value (the combined update value from the partial gradient and the residual gradient) meets or exceeds a threshold. If so, the process 400 may proceed to block 414; col 8, lines 32-37, step 412-414, Fig. 4)
	wherein the residual gradient value is dependent on an accumulating of one or more previous individual gradient values for updating the weight in a previous time (At block 410, the model synchronization module 124 or some other module or component of the model training node 102A can add the partial gradient to the residual gradient, aggregate values of individual elements of the partial gradient and residual gradient, or otherwise compute values based on the partial gradient and residual gradient, col 7, lines 64-67, col 8, lines 1-2)
	Strom does not explicitly teach selectively update the weight dependent on the whether the intermediate summation value overlaps bit digit of the weight to train the neural network.
	Majumdar teaches selectively update the weight dependent on the whether the intermediate summation value overlaps bit digit of the weight to train the neural network.
(For performing operations such as adding gradient updates to weights, there may be sufficient mantissa overlap between tensors, putting additional requirements on number of bits needed to represent values in training [0017]; tensors in the primary neural network, such as weights [0034] )
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Strom to incorporate the teachings of Majumdar for the benefit of establishing that deep learning tensors conform to these requirements during training may improve results (Majumdar [0017])


6.	Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Majumdar et al (US20190042945) and further in view of Strom et al ("Scalable distributed DNN training using commodity GPU cloud computing." Sixteenth Annual Conference of the International Speech Communication Association. 2015., hereinafter “Strom NPL”)	

	Regarding claim 2, Modified Strom teaches the method of claim 1, Strom teaches wherein the updating the residual gradient value (the residual gradient can be the collection of update values from one or more previous iterations of training data processing, col 8 lines 9-11) comprises: 
	determining an effective gradient value dependent on the result of the summing, (A model training node 102A, 102B may determine which individual update values will make a substantial difference in the model. This subset of update values may be referred to as the “salient gradient.” In some embodiments, only those update values that meet or exceed some predetermined or dynamically determined threshold may be included in the salient gradient)
	where the effective gradient value has a value divisible by the least significant bit digit of the weight; (Rather than update every parameter of the model based on the gradient, only those elements with update values meeting or exceeding a threshold, or meeting some other criteria, may be applied. In some embodiments, a threshold may be chosen such that the number of elements with update values exceeding the threshold, and therefore the number of parameters to be updated, is one or more orders of magnitude smaller than the total number of updates that have been calculated (e.g., 1/100, 1/1000, or 1/10000 of the millions of parameters in the model) col 3 lines 25-34)
	Modified Strom does not explicitly teach 2Application No. 16/249,279Docket No. 012055.0458updating the residual gradient value by subtracting the effective gradient value from the result of the summing
	Strom NPL teaches updating the residual gradient value (gi(r), pg. 2, right col, Pseudo code. 2.6, step 7)
	 by subtracting the effective gradient value from the result of the summing (subtract τ to the residual: gi(r) = gi(r) - τ, pg. 2, right col, Pseudo code. 2.6, step 7, weight delta, τ (as effective gradient value), pg. 2, right col, first para.)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of modified Strom to incorporate the teachings of Strom NPL for the benefit of reducing bandwidth communication which enables efficient scaling to more parallel GPU nodes (Strom NPL, Abstract)

7.	Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Majumdar et al (US20190042945) in view of Chase et al (US5859930) and further in view of Koster et al (US20170316307)  

	Regarding claim 3, Modified Strom teaches the method of claim 1, Strom teaches wherein the tuning of the one or more respective individual gradient values (the collection of updates to the parameters of a model may be referred to as a “gradient” because each update is based on the direction in which the corresponding parameter should be modified (e.g., a value of the parameter is to be increased or decreased by a particular amount) … col 5 lines 29-40; Further aspects of the present, … the updates that exceed the threshold described above are quantized or otherwise compressed in order to further reduce the size (e.g., the number of bits or bytes) of each update value. The quantized gradient values are then applied to the parameters of the local copy of the model and are also transmitted to the other computing devices for application to the respective copies of model. In order to retain the entire magnitude of the originally calculated update (e.g., the pre-quantized update values), the quantization error for each of the quantized values is added to the corresponding value of the residual gradient (e.g., the value of the residual gradient that corresponds to the same model parameter) col 4, lines 10-28) comprises: 
	quantizing each of the one or more respective individual gradient values (The quantized gradient values are then applied to the parameters of the local copy of the model and are also transmitted to the other computing devices for application to the respective copies of model, col 4 lines17-20; component of the model training node 102A can determine whether a particular update value of a partial gradient (or merged partial and residual gradient) exceeds a threshold or meets some other criteria, col 11 lines 28-32; the update value that was determined above to exceed the threshold or meet other criteria can be quantized to further reduce the amount of data (e.g., the size of the model synchronization data structure) that must be transmitted to other model training nodes 102B-102X, col 11 lines 35- 39)
	including omitting respective values of the one or more respective individual gradient values that are less than a least significant bit digit of the residual gradient value; (In this way, updates which may be substantial in aggregate may be retained, while updates which are too small to make a substantial difference to the model, or which may be cancelled by other updates calculated in a subsequent iteration, are not applied, col 3 lines 66-67 and col 4 lines 1-3) and 	Modified Strom did not explicitly teach padding each of the quantized one or more respective individual gradient values, wherein a value up to a bit digit corresponding to a most significant bit digit of the residual gradient value is present in each padded quantized one or more respective individual gradient values.
	Chase teaches padding each of the quantized one or more respective individual gradient values; in each padded quantized one or more respective individual gradient values. (Each sub-matrix 232 includes 2M rows and 2,048 columns of gradient data., col. 11, lines 51-52, and the first two-hundred fifty-six (256) columns of sub-matrix 232-1 are padded with Zeros. Col 11, lines 58-60)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Strom to incorporate the teachings of Chase for the benefit of compensating for the shifting (Chase, col 11, line 56-57) in a robust fast pattern recognizer (Chase, col 1, line 5-6)
	Koster teaches wherein a value up to a bit digit corresponding to a most significant bit digit of the residual gradient value is present. (while four most significant bits are added [0070])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Strom to incorporate the teachings of Koster for the benefit of a result that is computed without causing an overflow [0070] periodically and dynamically updating weights of the neural network (Koster, [0036])

8.	Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Majumdar et al (US20190042945) and further in view of Chase et al (US5859930)  

	Regarding claim 5, Modified Strom teaches the method of claim 1, Strom teaches wherein the summing (the partial gradient 314 may be stored in the residual data store 132 or otherwise combined with the residual gradient. In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter) of the residual gradient col 8 lines 4-9) comprises: 
	tuned one or more respective individual gradient values, the residual gradient value, and the weight; (the collection of updates to the parameters of a model may be referred to as a “gradient” because each update is based on the direction in which the corresponding parameter should be modified (e.g., a value of the parameter is to be increased or decreased by a particular amount) … col 5 lines 29-40; Further aspects of the present, … the updates that exceed the threshold described above are quantized or otherwise compressed in order to further reduce the size (e.g., the number of bits or bytes) of each update value. The quantized gradient values are then applied to the parameters of the local copy of the model and are also transmitted to the other computing devices for application to the respective copies of model. In order to retain the entire magnitude of the originally calculated update (e.g., the pre-quantized update values), the quantization error for each of the quantized values is added to the corresponding value of the residual gradient (e.g., the value of the residual gradient that corresponds to the same model parameter) col 4, lines 10-28)
	 and summing the padded weight and the padded intermediate summation value of the padded tuned one or more respective individual gradient values, and the padded residual gradient value. (the partial gradient 314 may be stored in the residual data store 132 or otherwise combined with the residual gradient. In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter) of the residual gradient col 8 lines 4-9, step 410, Fig 4; At block 608, the update value that was determined above to exceed the threshold or meet other criteria can be quantized to further reduce the amount of data (e.g., the size of the model synchronization data structure) that must be transmitted to other model training nodes 102B-102X. The quantization applied by the model training node 102A may include converting the update value to one of a smaller set of values. For example, the model training node 102 may use a mapping of ranges of update values to a predefined set of values; the model training node 102 may round each value to some unit of precision; etc. In one specific, non-limiting embodiment, the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number. The index of the parameter to which the quantized update is to be applied may remain a 24 bit integer. Therefore, the combination of index and quantized value for a given parameter may be stored as a 32 bit structure, Fig. 6, col 11, lines 35-53)  
	intermediate summation value (the partial gradient 314 may be stored in the residual data store 132 or otherwise combined with the residual gradient. In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter) of the residual gradient col 8 lines 4-9, step 410, Fig 4; The Examiner notes that the intermediate summation value is the residual gradient value added to the individual gradient value)
	Modified Strom does not explicitly teach padding.
	Chase teaches padding (padded with zeros, col. 11, line 58-60)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Strom to incorporate the teachings of Chase for the benefit of compensating for the shifting (col 11, line 56-57) in a robust fast pattern recognizer (Chase, col 1, line 5-6)
	 
9.	Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Majumdar et al (US20190042945) and further in view of Koster et al (US20170316307)  

	Regarding claim 7, Modified Strom teaches the method of claim 1, Modified Strom does not explicitly teach further comprising: obtaining a sign bit that is a Most Significant Bit of the result of the summing; and adding the obtained sign bit such that the obtained sign bit is a Most Significant Bit of the updated weight and/or the updated residual gradient value.  
	Koster teaches obtaining a sign bit that is a Most Significant Bit (most significant bits, [0070])
	of the result of the summing (weighted sum [0032]);
	and adding the obtained sign bit such that the obtained sign bit is a Most Significant Bit (four most significant bits are added, [0070]) 
	of one of the updated weight and/or the updated residual gradient value. (updating a weighted sum [0032])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Patel to incorporate the teachings of Koster for the benefit of a result that is computed without causing an overflow [0070] periodically and dynamically updating weights of the neural network (Koster, [0036])

10.	Claims 9, 12, 13, 15, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Majumdar et al (US20190042945) in view of Baker (US20200134451 filed on 06/01/2018) and further in view of Lin et al. (US20160328645)

	Regarding claim 9, Strom teaches a processor implemented neural network method, the method (when using stochastic gradient descent to train a neural network (e.g., a deep neural network), a computing device may compute a gradient (e.g., a set of elements with a separate update value for each parameter of the model) for each input vector of training data, or for some subset of training data, col 3 lines 20-24) comprising: 
	calculating one or more respective individual gradient values for updating a weight of a neural network; (At block 408, the gradient computation module 122 or some other module or component of the model training node 102A can generate a partial gradient based on the output vector 306 …., col 7 lines 45-60)
	tuning the one or more respective individual gradient values, (The quantized gradient values are then applied to the parameters of the local copy of the model, col 4, lines 17-19), 
	each having respective bit digits, to correspond to a-bit digits of a residual gradient value; (the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number, col 11, lines 45-48); 
	summing the tuned one or more respective individual gradient values (In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter (as weight)) of the residual gradient, col 8, lines 5-8);
	updating the residual gradient value dependent on a result of the summing; (At decision block 412, the model synchronization module 124 or some other module or component can determine, for a given model parameter, whether a particular update value (the combined update value from the partial gradient and the residual gradient) meets or exceeds a threshold. If so, the process 400 may proceed to block 414; col 8, lines 32-37, step 412-414, Fig. 4)
	wherein the residual gradient value is dependent on an accumulating of one or more previous individual gradient values for updating the weight in a previous time. (At block 410, the model synchronization module 124 or some other module or component of the model training node 102A can add the partial gradient to the residual gradient, aggregate values of individual elements of the partial gradient and residual gradient, or otherwise compute values based on the partial gradient and residual gradient, col 7, lines 64-67, col 8, lines 1-2)
	Strom does not explicitly teach concatenating a remaining value of the residual gradient value, excluding a sign bit, to the weight to calculate an intermediate concatenation value; and the intermediate concatenation value; selectively updating the weight dependent on whether a summation of the tuned one or more respective individual gradient values overlaps bit digits of the weight to train the neural network 
	Majumdar teaches selectively updating the weight dependent on whether a summation of the tuned one or more respective individual gradient values overlaps bit digits of the weight to train the neural network (The computation node described above comprising a decoder which decodes encoded gradients received from other computation nodes, and wherein the processor updates weights of the neural network using the stored gradients and the decoded gradients [0083]) 
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Strom to incorporate the teachings of Majumdar for the benefit of establishing that deep learning tensors conform to these requirements during training may improve results (Majumdar [0017])
	Modified Strom does not explicitly teach concatenating a remaining value of the residual gradient value, excluding a sign bit to the weight; to calculate an intermediate concatenation value; and the intermediate concatenation value
	Baker teaches concatenating a remaining value of the residual gradient value, to calculate an intermediate concatenation value; (a vector created by first computing a gradient vector for all the arcs leaving each selected node and then forming a longer vector by concatenating the vectors created for each of the selected nodes. [0024])
	It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Strom to incorporate the teachings of Baker for the benefit of improving the performance of a network that is already achieving its optimum performance (Baker, [0040])
	Lin teaches a bit digit representing weight (weight represented using 32 bits [0062])
	and excluding a sign bit (removing [0064], most significant bits may be removed by saturation [0064], (most significant bits [0064] as sign bit)
	to the weight (wi,j , represents the weight, [0059], represented using  bits, [0062])
	It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Strom to incorporate the teachings of Lin for the benefit of reducing software complexity, and/or reduce memory usage (Lin, [0055])

	Regarding claim 12, Modified Strom teaches the method of claim 9, Strom teaches wherein the summing (the partial gradient 314 may be stored in the residual data store 132 or otherwise combined with the residual gradient. In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter) of the residual gradient col 8 lines 4-9) comprises: 
	mapping the tuned one or more respective individual gradient values; based on a set bit number; and summing the mapped tuned one or more respective individual gradient values and the mapped intermediate concatenation value.  (the model training node 102 may use a mapping of ranges of update values to a predefined set of values; the model training node 102 may round each value to some unit of precision; etc. In one specific, non-limiting embodiment, the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number. The index of the parameter to which the quantized update is to be applied may remain a 24 bit integer. Therefore, the combination of index and quantized value for a given parameter may be stored as a 32 bit structure, col 11 lines 42-52) and 
	Modified Strom does not explicitly teach the intermediate concatenation value, 
	Baker teaches intermediate concatenation value (a vector created by first computing a gradient vector for all the arcs leaving each selected node and then forming a longer vector by concatenating the vectors created for each of the selected nodes. [0024])
	The motivation to combine an independent claim 9 applies here.

	Regarding claim 13, Modified Strom teaches the method of claim 9, Strom teaches a non-transitory computer-readable recording medium having recorded thereon computer readable instructions, which, when executed by one or more processors, causes the one or more processors to perform the method (The process 600 may be embodied in a set of executable program instructions stored on non-transitory computer-readable media, such as short-term or long-term memory of one or more computing devices associated with a model training node 102A. When the process 600 is initiated, the executable program instructions can be loaded and executed by the one or more computing devices, col 11 lines 15-22).

	Regarding claim 15, Modified Strom teaches the method of claim 9, Strom teaches wherein the updating of the weight comprises: updating a bit digit value of portion of the result of the summing, corresponding to the bit digit representing the weight, to the updated weight, (the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number. The index of the parameter to which the quantized update is to be applied may remain a 24 bit integer, col 11 lines 46-50) and 5Application No. 16/249,279Docket No. 012055.0458 
	wherein the updating of the residual gradient value comprises updating a bit digit value of remaining portion of the result of the summing, not corresponding to the bit digit representing the weight, to the residual gradient value.  (When the update value for a particular parameter meets or exceeds the threshold, it can be applied (and sent to the other computing devices), and the residual gradient element for that particular parameter can be cleared (e.g., the update value set to zero or null). In some embodiments, each time a computing device determines a partial gradient for a portion of training data, the partial gradient may be added to the residual gradient. The threshold determination may then be made based on the sum of the partial gradient and the residual gradient, rather than on the newly calculated partial gradient alone. The portions of that sum that do not exceed the threshold (e.g., the individual elements with values close to zero) can then be stored as the new residual gradient, and the process may be repeated as necessary. In this way, updates which may be substantial in aggregate may be retained, while updates which are too small to make a substantial difference to the model, or which may be cancelled by other updates calculated in a subsequent iteration, are not applied, col 3, lines 47-67)  

	Regarding claim 19, Strom teaches a neural network apparatus, the apparatus comprising: one or more processors (The process 500 may be embodied in a set of executable program instructions stored on non-transitory computer-readable media, such as short-term or long-term memory of one or more computing devices associated with a model training node 102A, col 10 lines 12-16) configured  
	to calculate one or more respective individual gradient values for updating a weight of the neural network, (At block 408, the gradient computation module 122 or some other module or component of the model training node 102A can generate a partial gradient based on the output vector 306 …., col 7 lines 45-60) 
	tune the one or more respective individual gradient values, (The quantized gradient values are then applied to the parameters of the local copy of the model, col 4, lines 17-19), 
	 each having respective bit digits, to correspond to bit digits of a residual gradient value, (the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number, col 11, lines 45-48);
	sum the tuned one or more respective individual gradient values (In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter (as weight)) of the residual gradient, col 8, lines 5-8);
	wherein the residual gradient value is dependent on an accumulating of one or more previous individual gradient values for updating the weight in a previous time. (At block 410, the model synchronization module 124 or some other module or component of the model training node 102A can add the partial gradient to the residual gradient, aggregate values of individual elements of the partial gradient and residual gradient, or otherwise compute values based on the partial gradient and residual gradient, col 7, lines 64-67, col 8, lines 1-2)
	update the residual gradient value dependent on a result of the summing (At decision block 412, the model synchronization module 124 or some other module or component can determine, for a given model parameter, whether a particular update value (the combined update value from the partial gradient and the residual gradient) meets or exceeds a threshold. If so, the process 400 may proceed to block 414; col 8, lines 32-37, step 412-414, Fig. 4)
	Strom does not explicitly teach concatenating a remaining value of the residual gradient value, excluding a sign bit, to the weight to calculate an intermediate concatenation value; and the intermediate concatenation value; selectively updating the weight whether a summation of the tuned one or more respective individual gradient values overlaps bit digits of the weight 
	Majumdar teaches selectively updating the weight whether a summation of the tuned one or more respective individual gradient values overlaps bit digits of the weight 
(For performing operations such as adding gradient updates to weights, there may be sufficient mantissa overlap between tensors, putting additional requirements on number of bits needed to represent values in training [0017]; tensors in the primary neural network, such as weights [0034])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Strom to incorporate the teachings of Majumdar for the benefit of establishing that deep learning tensors conform to these requirements during training may improve results (Majumdar [0017])
	Baker teaches concatenating a remaining value of the residual gradient value, to calculate an intermediate concatenation value; (a vector created by first computing a gradient vector for all the arcs leaving each selected node and then forming a longer vector by concatenating the vectors created for each of the selected nodes. [0024])
	It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Strom to incorporate the teachings of Baker for the benefit of improving the performance of a network that is already achieving its optimum performance (Baker, [0040])
	Lin teaches a bit digit representing weight (weight represented using 32 bits [0062])
	and excluding a sign bit (removing [0064], most significant bits may be removed by saturation [0064], (most significant bits [0064] as sign bit))
	to the weight (wi,j, represents the weight, [0059], represented using bits, [0062])
	It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Strom to incorporate the teachings of Lin for the benefit of reducing software complexity, and/or reduce memory usage (Lin, [0055])

	Regarding claim 20, Modified Strom apparatus of claim 19, Strom teaches further comprising a memory storing instructions, which when executed by the one or more processors, configure the one or more processors to perform (The process 500 may be embodied in a set of executable program instructions stored on non-transitory computer-readable media, such as short-term or long-term memory of one or more computing devices associated with a model training node 102A, col 10 lines 12-16)
	 the calculation of the one or more respective individual gradient values, (At block 408, the gradient computation module 122 or some other module or component of the model training node 102A can generate a partial gradient based on the output vector 306 …., col 7 lines 45-60) 
	the tuning of the one or more respective individual gradient values, (the collection of updates to the parameters of a model may be referred to as a “gradient” because each update is based on the direction in which the corresponding parameter should be modified (e.g., a value of the parameter is to be increased or decreased by a particular amount) … col 5 lines 29-40; Further aspects of the present, … the updates that exceed the threshold described above are quantized or otherwise compressed in order to further reduce the size (e.g., the number of bits or bytes) of each update value. The quantized gradient values are then applied to the parameters of the local copy of the model and are also transmitted to the other computing devices for application to the respective copies of model. In order to retain the entire magnitude of the originally calculated update (e.g., the pre-quantized update values), the quantization error for each of the quantized values is added to the corresponding value of the residual gradient (e.g., the value of the residual gradient that corresponds to the same model parameter) col 4, lines 10-28)
	the summing, (the partial gradient 314 may be stored in the residual data store 132 or otherwise combined with the residual gradient. In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter) of the residual gradient col 8 lines 4-9, step 410, Fig 4; At block 608, the update value that was determined above to exceed the threshold or meet other criteria can be quantized to further reduce the amount of data (e.g., the size of the model synchronization data structure) that must be transmitted to other model training nodes 102B-102X. The quantization applied by the model training node 102A may include converting the update value to one of a smaller set of values. For example, the model training node 102 may use a mapping of ranges of update values to a predefined set of values; the model training node 102 may round each value to some unit of precision; etc. In one specific, non-limiting embodiment, the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number. The index of the parameter to which the quantized update is to be applied may remain a 24 bit integer. Therefore, the combination of index and quantized value for a given parameter may be stored as a 32 bit structure, Fig. 6, col 11, lines 35-53)  
	the updating of the residual gradient value, (At decision block 412, the model synchronization module 124 or some other module or component can determine, for a given model parameter, whether a particular update value (the combined update value from the partial gradient and the residual gradient) meets or exceeds a threshold. If so, the process 400 may proceed to block 414; col 8, lines 32-37, step 412-414, Fig. 4) and 
	Strom does not explicitly teach the selective updating of the weight; the concatenating of the remaining value,
	Majumdar teaches the selective updating of the weight (For performing operations such as adding gradient updates to weights, there may be sufficient mantissa overlap between tensors, putting additional requirements on number of bits needed to represent values in training [0017]; tensors in the primary neural network, such as weights [0034])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Strom to incorporate the teachings of Majumdar for the benefit of establishing that deep learning tensors conform to these requirements during training may improve results (Majumdar [0017])
	Baker teaches concatenating a remaining value (a vector created by first computing a gradient vector for all the arcs leaving each selected node and then forming a longer vector by concatenating the vectors created for each of the selected nodes. [0024])
	It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Strom to incorporate the teachings of Baker for the benefit of improving the performance of a network that is already achieving its optimum performance (Baker, [0040])

11.	Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Majumdar et al (US20190042945) in view of Baker (US20200134451 filed on 06/01/2018) in view of Lin et al (US20160328645) and further in view of Strom et al ("Scalable distributed DNN training using commodity GPU cloud computing." Sixteenth Annual Conference of the International Speech Communication Association. 2015., hereinafter “Strom NPL”)

	Regarding claim 10, Modified Strom teaches the method of claim 9, Strom teaches wherein the updating of the residual gradient value (the residual gradient can be the collection of update values from one or more previous iterations of training data processing, col 8 lines 9-11) comprises: 
	determining an effective gradient value dependent on the result of the summing, (A model training node 102A, 102B may determine which individual update values will make a substantial difference in the model. This subset of update values may be referred to as the “salient gradient.” In some embodiments, only those update values that meet or exceed some predetermined or dynamically determined threshold may be included in the salient gradient)
	 where the effective gradient value has a value divisible by the least significant bit digit of the weight (Rather than update every parameter of the model based on the gradient, only those elements with update values meeting or exceeding a threshold, or meeting some other criteria, may be applied. In some embodiments, a threshold may be chosen such that the number of elements with update values exceeding the threshold, and therefore the number of parameters to be updated, is one or more orders of magnitude smaller than the total number of updates that have been calculated (e.g., 1/100, 1/1000, or 1/10000 of the millions of parameters in the model) col 3 lines 25-34) and 4Application No. 16/249,279Docket No. 012055.0458
	Modified Strom does not explicitly teach 2Application No. 16/249,279Docket No. 012055.0458updating the residual gradient value by subtracting the effective gradient value from the result of the summing
	Strom NPL teaches updating the residual gradient value (gi(r), pg. 2, right col, Pseudo code. 2.6, step 7)
	 by subtracting the effective gradient value from the result of the summing (subtract τ to the residual: gi(r) = gi(r) - τ, pg. 2, right col, Pseudo code. 2.6, step 7, weight delta, τ (as effective gradient value), pg. 2, right col, first para.)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of modified Strom to incorporate the teachings of Strom NPL for the benefit of reducing bandwidth communication which enables efficient scaling to more parallel GPU nodes (Strom NPL, Abstract)

12.	Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Majumdar et al (US20190042945) in view of Baker (US20200134451 filed on 06/01/2018) in view of Lin et al (US20160328645) in view of Chase et al (US5859930) and further in view of Koster et al (US20170316307)  

	Regarding claim 11, Modified Strom teaches the method of claim 9, Strom teaches wherein the tuning of the one or more respective individual gradient values (the collection of updates to the parameters of a model may be referred to as a “gradient” because each update is based on the direction in which the corresponding parameter should be modified (e.g., a value of the parameter is to be increased or decreased by a particular amount) … col 5 lines 29-40; Further aspects of the present, … the updates that exceed the threshold described above are quantized or otherwise compressed in order to further reduce the size (e.g., the number of bits or bytes) of each update value. The quantized gradient values are then applied to the parameters of the local copy of the model and are also transmitted to the other computing devices for application to the respective copies of model. In order to retain the entire magnitude of the originally calculated update (e.g., the pre-quantized update values), the quantization error for each of the quantized values is added to the corresponding value of the residual gradient (e.g., the value of the residual gradient that corresponds to the same model parameter) col 4, lines 10-28) comprises: 
	quantizing each of the one or more respective individual gradient values, (The quantized gradient values are then applied to the parameters of the local copy of the model and are also transmitted to the other computing devices for application to the respective copies of model, col 4 lines17-20; component of the model training node 102A can determine whether a particular update value of a partial gradient (or merged partial and residual gradient) exceeds a threshold or meets some other criteria, col 11 lines 28-32; the update value that was determined above to exceed the threshold or meet other criteria can be quantized to further reduce the amount of data (e.g., the size of the model synchronization data structure) that must be transmitted to other model training nodes 102B-102X, col 11 lines 35- 39)
	including omitting respective values of the one or more respective individual gradient values that are less than a least significant bit digit of the residual gradient value; (In this way, updates which may be substantial in aggregate may be retained, while updates which are too small to make a substantial difference to the model, or which may be cancelled by other updates calculated in a subsequent iteration, are not applied, col 3 lines 66-67 and col 4 lines 1-3) and 	Modified Strom does not explicitly teach padding each of the quantized one or more respective individual gradient values, wherein a value up to a bit digit corresponding to a most significant bit digit of the residual gradient value is present in each padded quantized one or more respective individual gradient values.  
	Chase teaches padding each of the quantized one or more respective individual gradient values; in each padded quantized one or more respective individual gradient values. (Each sub-matrix 232 includes 2M rows and 2,048 columns of gradient data., col. 11, lines 51-52, and the first two-hundred fifty-six (256) columns of sub-matrix 232-1 are padded with Zeros. Col 11, lines 58-60)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Strom to incorporate the teachings of Chase for the benefit of compensating for the shifting (Chase, col 11, line 56-57) in a robust fast pattern recognizer (Chase, col 1, line 5-6)
	Koster teaches wherein a value up to a bit digit corresponding to a most significant bit digit of the residual gradient value is present. (while four most significant bits are added [0070])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Strom to incorporate the teachings of Koster for the benefit of a result that is computed without causing an overflow [0070] periodically and dynamically updating weights of the neural network (Koster, [0036])

13.	Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Majumdar et al (US20190042945) in view of Baker (US20200134451 filed on 06/01/2018) in view of Lin et al (US20160328645) and further in view of Chase et al (US5859930) 

	Regarding claim 14, Modified Strom teaches the method of claim 9, Strom teaches wherein the summing (the partial gradient 314 may be stored in the residual data store 132 or otherwise combined with the residual gradient. In adding the partial gradient 314 to the residual gradient, each individual update value from the partial gradient 314 can be added to the corresponding update value (e.g. the value for the same model parameter) of the residual gradient col 8 lines 4-9) comprises:
	Modified Strom does not explicitly teach padding the tuned one or more respective individual gradient values and the intermediate concatenation value; and summing the padded tuned one or more respective individual gradient values and the padded intermediate concatenation value.  
	Chase teaches padding the tuned one or more respective individual gradient values (Each sub-matrix 232 includes 2M rows and 2,048 columns of gradient data., col. 11, lines 51-52, and the first two-hundred fifty-six (256) columns of sub-matrix 232-1 are padded with Zeros. Col 11, lines 58-60)
	and summing (summing circuit which sums the signals, col. 8, lines 58-61)
	the padded tuned one or more respective individual gradient values ((Each sub-matrix 232 includes 2M rows and 2,048 columns of gradient data., col. 11, lines 51-52, and the first two-hundred fifty-six (256) columns of sub-matrix 232-1 are padded with zeros. Col 11, lines 44-60)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Strom to incorporate the teachings of Chase for the benefit of compensating for the shifting (Chase, col 11, line 56-57) in a robust fast pattern recognizer (Chase, col 1, line 5-6)
	the intermediate concatenation value; and the padded intermediate concatenation value.  
	Baker teaches intermediate concatenation value; and the padded intermediate concatenation value (a vector created by first computing a gradient vector for all the arcs leaving each selected node and then forming a longer vector by concatenating the vectors created for each of the selected nodes. [0024])
	It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Strom to incorporate the teachings of Baker for the benefit of improving the performance of a network that is already achieving its optimum performance (Baker, [0040])

14.	Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Majumdar et al (US20190042945) in view of Baker (US20200134451 filed on 06/01/2018) in view of Lin et al (US20160328645) and further in view of Koster et al (US20170316307)  

	Regarding claim 16, Modified Strom teaches the method of claim 9, Modified Strom does not explicitly teach further comprising: obtaining a sign bit that is a Most Significant Bit of the result of the summing; and adding the obtained sign bit such that the obtained sign bit is a Most Significant Bit of the updated weight and/or the updated residual gradient value.  
	Koster teaches obtaining a sign bit that is a Most Significant Bit (most significant bits, [0070])
	of the result of the summing (weighted sum [0032]);
	and adding the obtained sign bit such that the obtained sign bit is a Most Significant Bit (four most significant bits are added, [0070]) 
	of one of the updated weight and/or the updated residual gradient value. (updating a weighted sum [0032])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Modified Strom to incorporate the teachings of Koster for the benefit of a result that is computed without causing an overflow [0070] periodically and dynamically updating weights of the neural network (Koster, [0036])

15.	Claims 22, 24-26 are rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Swartzlander et al ("Digital neural network implementation." Eleventh Annual International Phoenix Conference on Computers and Communication [1992 Conference Proceedings]. IEEE, 1992.)

	Regarding claim 22, Strom teaches a processor-implemented neural network method, the method (when using stochastic gradient descent to train a neural network (e.g., a deep neural network), a computing device may compute a gradient (e.g., a set of elements with a separate update value for each parameter of the model) for each input vector of training data, or for some subset of training data, col 3 lines 20-24) comprising: 
	calculating an individual gradient value for updating a weight of a neural network; accumulating one or more of the calculated individual gradient values for updating the weight; (the gradient computation module 122 may be configured to compute a partial gradient 314 (as individual gradient) that includes a collection of updates to the individual parameters of the model 304, col 7 lines 56- 59; component of the model training node 102A can add the partial gradient to the residual gradient, aggregate values of individual elements of the partial gradient and residual gradient, col 7 lines 65-67) 
	determining whether a value, of the accumulated one or more of the calculated individual gradient values, meets a threshold; (At decision block 412, the model synchronization module 124 or some other module or component can determine, for a given model parameter, whether a particular update value (the combined update value from the partial gradient and the residual gradient) meets or exceeds a threshold, col 8, lines 32-36, Fig. 4) and 
	in response to the determining of whether the result of the accumulating meets the threshold being that the result of the accumulating meets the threshold, updating the weight dependent on the result of the accumulating, (Rather than update every parameter of the model based on the gradient, only those elements with update values meeting or exceeding a threshold, or meeting some other criteria, may be applied. In some embodiments, a threshold may be chosen such that the number of elements with update values exceeding the threshold, and therefore the number of parameters to be updated, col 3, lines 25-31 and col. 8, line 62-col. 9, line 8: block 414)7Application No. 16/249,279Docket No. 012055.0458 
	Strom does not explicitly teach wherein the threshold is a least significant bit digit of the weight.
	Swartzlander teaches wherein the threshold is a least significant bit digit of the weight (the least significant bit of the weights comprising the weight vector W and the threshold, pg. 0723, right col, last para.)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Modified Strom to incorporate the teachings of Swartzlander for the benefit of implementation of parallel counters with up to 1022 inputs as required to realize multi-layer neural networks with up to 1000 neurons per layer (Swartzlander, abstract)

	Regarding claim 24, Modified Strom teaches the method of claim 22, Strom teaches wherein, in response to the determining of whether the result of the accumulating meets the threshold being that the result of the accumulating does not meet the threshold, calculating at least one additional individual gradient value for updating the weight, summing the result of the accumulating and the calculated at least one additional individual gradient value, (When the update value for a particular parameter meets or exceeds the threshold, it can be applied (and sent to the other computing devices), and the residual gradient element for that particular parameter can be cleared (e.g., the update value set to zero or null). In some embodiments, each time a computing device determines a partial gradient for a portion of training data, the partial gradient may be added to the residual gradient. The threshold determination may then be made based on the sum of the partial gradient and the residual gradient, rather than on the newly calculated partial gradient alone. The portions of that sum that do not exceed the threshold (e.g., the individual elements with values close to zero) can then be stored as the new residual gradient, and the process may be repeated as necessary. In this way, updates which may be substantial in aggregate may be retained, while updates which are too small to make a substantial difference to the model, or which may be cancelled by other updates calculated in a subsequent iteration, are not applied, col 3, lines 47-67) and 
	selectively, dependent on whether the result of the summing meets the threshold, updating the weight to add the result of the summing to the weight. (Rather than update every parameter (weight) of the model based on the gradient, only those elements with update values meeting or exceeding a threshold, or meeting some other criteria, may be applied., col 3 lines 25-28) 

	Regarding claim 25, Modified Strom teaches the method of claim 22, wherein, when the result of the accumulating meets the threshold, the updating of the weight dependent on the result of the accumulating includes updating the weight based on a portion of the result of the accumulating that is equal to or greater than the threshold, (At decision block 412, the model synchronization module 124 or some other module or component can determine, for a given model parameter, whether a particular update value (the combined update value from the partial gradient and the residual gradient) meets or exceeds a threshold.  If so, the process 400 may proceed to block 414, col 8, lines 32-36, Fig. 4) and
	 setting a residual gradient value to include a remaining portion of the result of the accumulating that is less than the threshold, and for when the result of the accumulating does not meet the threshold, ( In some cases, updates from many iterations may continue to be stored in the residual gradient when, e.g., they are so small that they do not meet the necessary threshold, even in aggregate, col 8, lines 17-20) 
	the method further comprises: calculating another individual gradient value for updating the weight of the neural network; additionally accumulating another one or more of the calculated other individual gradient values for updating the weight and the residual gradient value; (In still other cases, many small updates that are stored in the residual gradient may eventually exceed the necessary threshold in aggregate and may be included in the salient gradient accordingly. In such cases, the residual gradient value that corresponds to that particular parameter may be set to zero, null, or the like, col 8, lines 25-31) and
	 selectively, dependent on whether a result of the additional accumulating meets the threshold, updating the weight dependent on a corresponding portion of the result of the additional accumulating that is equal to or greater than the threshold (Rather than update every parameter (weight) of the model based on the gradient, only those elements with update values meeting or exceeding a threshold, or meeting some other criteria, may be applied., col 3 lines 25-28)
	
	Regarding claim 26, Strom teaches the method of claim 22, wherein the calculating of the individual gradient value for updating the weight of the neural network is performed in a first-precision number system that generates at least one gradient value less than the threshold, while the weight corresponds to a second-precision number system different from the first-precision number system. (the model training node 102 may use a mapping of ranges of update values to a predefined set of values; the model training node 102 may round each value to some unit of precision; etc. In one specific, non-limiting embodiment, the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number (first-precision number system). The index of the parameter (as weight) to which the quantized update is to be applied may remain a 24 bit integer (as second-precision number system), col 11, lines 42-50) 

16.	Claims 23 is rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Swartzlander et al ("Digital neural network implementation." Eleventh Annual International Phoenix Conference on Computers and Communication [1992 Conference Proceedings]. IEEE, 1992.) and further in view of Majumdar et al (US20190042945)

	Regarding claim 23, Strom teaches the method of claim 22, wherein the updating of the weight includes updating the weight using all bit values of the result of the accumulating (For example, an integer value may be associated with each update value. The integer value can indicate the index or identifier of the model parameter (as weight) to which the update value is to be applied, col 9, lines 24-27) and 
	wherein the method further comprises setting a residual gradient value, for consideration in a subsequent determination of whether to update the weight, dependent on remaining bit values of the result of the accumulating that do not overlap the bit digits of the weight. (At block 410, the model synchronization module 124 or some other module or component of the model training node 102A can add the partial gradient to the residual gradient, aggregate values of individual elements of the partial gradient and residual gradient, or otherwise compute values based on the partial gradient and residual gradient, col 7, lines 64-67, col 8, lines 1-2; the model training node 102 may use a mapping of ranges of update values to a predefined set of values; the model training node 102 may round each value to some unit of precision; etc. In one specific, non-limiting embodiment, the un-quantized update value may be a 32 or 64 bit floating point number, and it may be quantized to an 8 bit number. The index of the parameter to which the quantized update is to be applied may remain a 24 bit integer. Therefore, the combination of index and quantized value for a given parameter may be stored as a 32 bit structure, col 11 lines 42-52)
	Modified Strom does not explicitly teach updating the weight using all bit values of the result of the accumulating that overlap bit digits of the weight, 
	Majumdar teaches updating the weight using all bit values of the result of the accumulating that overlap bit digits of the weight (For performing operations such as adding gradient updates to weights, there may be sufficient mantissa overlap between tensors, putting additional requirements on number of bits needed to represent values in training, as compared to inference [0017])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Modified Strom to incorporate the teachings of Majumdar for the benefit of establishing that deep learning tensors conform to these requirements during training may improve results (Majumdar [0017])

17.	Claims 27 is rejected under 35 U.S.C. 103 as being unpatentable over Strom (US10152676) in view of Swartzlander et al ("Digital neural network implementation." Eleventh Annual International Phoenix Conference on Computers and Communication [1992 Conference Proceedings]. IEEE, 1992.) and further in view of Bigioi et al (WO2017129325)

	Regarding claim 27, Strom teaches the method of claim 26, wherein the weight is a floating point value (In one specific, non-limiting embodiment, the un-quantized update value may be a 32 or 64 bit floating point number (first-precision number system), and it may be quantized to an 8 bit number. The index (exponent) of the parameter (as weight) to which the quantized update is to be applied may remain a 24 bit integer (as second-precision number system). Therefore, the combination of index and quantized value for a given parameter may be stored as a 32 bit structure, col 11, lines 45-52) and 
	Modified Strom does not explicitly teach the least significant bit of the weight is a least significant bit of an exponent part of the weight and dependent on a bias.
	Bigioi teaches the least significant bit of the weight is a least significant bit of an exponent part of the weight and dependent on a bias (a weight compression technique can be used to reduce the size of the fully connected layer weights and so the memory access requirements for transferring to and/or storing weight values in the weights cache 37 … This pruning of small valued weights has the effect of removing (pruning) a corresponding connection from the neural network. Furthermore, such encoding can take advantage of floating point values of minus zero and subnormal which can all be zeroed, whereas NaN (not a number) and positively/negative infinite values can be saturated to the largest positive/negative valid value as per the table below: 

    PNG
    media_image1.png
    32
    211
    media_image1.png
    Greyscale

“pg. 9, lines 26-34, pg. 10, lines 1-5 ;  In a standard FP representation, the default exponent bias is computed as 2exp_1 - 1 where exp is the number of bits used for exponent representation (4 in this case); this bias is subtracted from the binary representation of the exponent, leading to the actual exponent; so for 4 bits for exponent, the range of values for the exponent value is from 0 to 15D; subtracting the bias (7D in this case) leads to actual exponent values from -7D to 8D; an exponent equal with -7D means a subnormal weight; an exponent of 8D is not a number (NaN) in FP representation; therefore, the actual range of possible exponents is from -6D to 7D (i.e. 2-6: 27). This is a symmetric representation which may be used to create a balance between representing small and large numbered weight values, pg. 9, lines 6-15)
Examiner notes: the least significant bit of the weight value above is 0 and the least significant bit of the exponent part is also 0)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Modified Strom to incorporate the teachings of Bigioi for the benefit of using a weight compression technique which can be used to reduce the size of the fully connected layer weights (Bigioi, pg. 9, lines 26-27)

Conclusion
	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 7:30am-5:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B. Zhen can be reached on (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/M.G./Examiner, Art Unit 2121                                                                                                                                                                                                        



/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121