DETAILED ACTION
1.	This office action is in response to the Application No. 16249279 filed on 7/04/2018. Claims 1-20 are presented for examination and are currently pending.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


3.	Claims 1-8 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
	Claim 1 recites “calculating a residual gradient value based on an accumulated gradient value obtained by accumulating the respective individual gradient values and a bit digit representing the weight”. It is unclear if the applicant is also accumulating the bit digit representing the weight. For the purpose of examination, the examiner will interpret the calculating the residual value based on accumulated gradient value and a bit digit representing the weight.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


4.	Claims 1, 6, 8 , 17 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Patel et al (US20160307199)

	Regarding claim 1, Chen teaches a processor-implemented neural network method, the method comprising (the Adaptive Residual Gradient Compression (AdaComp) scheme, abstract) 
	calculating respective individual gradient values (we plot the values of the gradient (dW), pg. 6, right col, second para.);
 	to update a weight of a neural network; (weight-update step, pg. 4, left col, fist para.)
	calculating a residual gradient value based on an accumulated gradient value obtained by accumulating the respective individual gradient values (each learner maintains an accumulated gradient (that we refer to as residual gradients), pg 2, right col, first para.)
	and a bit digit representing the weight;  (16-bits of representation would be needed for larger LT sizes, pg. 6, left col, second to the last para.)
	tuning the respective individual gradient values (adaptively adjusts compression ratios that provide automatic tuning of the compression ratio, pg. 3, right col, second para., gradient (dW) as individual gradient, pg. 6 right col, last full para.) 
	to correspond to a bit digit (LT is a representation of bits, pg. 6 left col, second to the last para.) and Residual Gradients
(RG) LT = 200 corresponds to individual gradient LT = 200, Fig. 5, pg. 7)
	of the residual gradient value; (Residual Gradient (RG) and dW is accumulated into RG, pg. 6, right col, last full para.)
	summing the tuned respective individual gradient values, the residual gradient value, (the residue is computed as the sum of the previous residue and the latest gradient value, pg. 3, right col, first para.)
	and updating the weight and the residual gradient value based on a result of the summing to train the neural network (weight update in deep multi-layer perceptrons pg. 2, left col, last para., and additional residues (as residual gradients) in the set of values to be sent is centrally updated (pg. 3, right col, first para.))
	Chen did not explicitly teach summing the tuned respective individual gradient values, the residual gradient value, and the weight;
	Patel teaches summing the tuned respective individual gradient values, the residual gradient value, and the weight; (gradients (individual gradients) may be aggregated on model server 210, which may add the aggregated values to the base set of weights [0074])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen to incorporate the teachings of Patel for the benefit of maximizing the usefulness of such parameter updates which are based on the aggregate of many device parameters that result in some known statistic (Patel, [0050])

	Regarding claim 6, Chen modified by Patel teaches the method of claim 1, Chen teaches wherein the updating comprises: updating (centrally updated, pg. 2, right col, first para,)
	a bit digit value (8-bits could be used effectively, pg. 6, left col, second to the last para.)
	of the result of the summing corresponding to the bit digit representing the weight to the updated weight, (weight-update, pg. 4, left col, first para.) 
	and updating a bit digit value (16-bits of representation would be needed for larger sizes, pg. 6, left col, second to the last para. )
	of the result of the summing not corresponding to the bit digit representing the weight to the residual gradient value. (comprising of gradients that have not yet been updated centrally, , pg. 2, right col, first para)

	Regarding claim 8, Chen modified by Patel teaches the method of claim 1, Chen teaches a non-transitory computer-readable recording medium having recorded thereon (Each Xeon processor has 12 cores running at 2.66GHz and each Tesla K80 card contains two K40 GPUs each with 12GB of GDDR5 memory, pg. 4 left col, second para.)

	Regarding claim 17, Chen teaches a neural network apparatus, the apparatus comprising: one or more processors configured to (Each Xeon processor has 12 cores running at 2.66GHz and each Tesla K80 card contains two K40 GPUs each with 12GB of GDDR5 memory, pg. 4 left col, second para.)
	calculate respective individual gradient values calculating respective individual gradient values (we plot the values of the gradient (dW), pg. 6, right col, second para.);
	to update a weight of a neural network; (weight-update step, pg. 4, left col, fist para.)
	calculate a residual gradient value based on an accumulated gradient value obtained by accumulating the respective individual gradient values (each learner maintains an accumulated gradient (that we refer to as residual gradients), pg 2, right col, first para.)
	and a bit digit representing the weight;  (16-bits of representation would be needed for larger LT sizes, pg. 6, left col, second to the last para.)
	tune the respective individual gradient values (adaptively adjusts compression ratios that provide automatic tuning of the compression ratio, pg. 3, right col, second para., gradient (dW) as individual gradient, pg. 6 right col, last full para.) 	to correspond to a bit digit (LT is a representation of bits, pg. 6 left col, second to the last para.) and Residual Gradients
(RG) LT = 200 corresponds to individual gradient LT = 200, Fig. 5, pg. 7)
	of the residual gradient value; (Residual Gradient (RG) and dW is accumulated into RG, pg. 6, right col, last full para.)
	sum the tuned individual gradient values, the residual gradient value, (the residue is computed as the sum of the previous residue and the latest gradient value, pg. 3, right col, first para.)
	and update the weight and the residual gradient value based on a result of the summing to train the neural network (weight update in deep multi-layer perceptrons pg. 2, left col, last para., and additional residues (as residual gradients) in the set of values to be sent is centrally updated (pg. 3, right col, first para.))
	Chen did not explicitly teach sum the tuned respective individual gradient values, the residual gradient value, and the weight;
	Patel teaches summing the tuned respective individual gradient values, the residual gradient value, and the weight; (gradients (individual gradients) may be aggregated on model server 210, which may add the aggregated values to the base set of weights [0074])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen to incorporate the teachings of Patel for the benefit of maximizing the usefulness of such parameter updates which are based on the aggregate of many device parameters that result in some known statistic (Patel, [0050])
	Regarding claim 18, Chen modified by Patel teaches the apparatus of claim 17, Chen teaches a memory storing instructions, which, when executed by one or more processors to perform (Each Xeon processor has 12 cores running at 2.66GHz and each Tesla K80 card contains two K40 GPUs each with 12GB of GDDR5 memory, pg. 4 left col, second para.)
	the calculation respective individual gradient values ((each learner maintains an accumulated gradient (that we refer to as residual gradients), pg 2, right col, first para.) 
	the calculating of the residual gradient value, (each learner maintains an accumulated gradient (that we refer to as residual gradients), pg 2, right col, first para.)
	the tuning of the respective individual gradient values (adaptively adjusts compression ratios that provide automatic tuning of the compression ratio, pg. 3, right col, second para., gradient (dW) as individual gradient, pg. 6 right col, last full para.)
 	summing, and the updating of the weight and the residual gradient value. (weight update in deep multi-layer perceptrons pg. 2, left col, last para., and additional residues (as residual gradients) in the set of values to be sent is centrally updated (pg. 3, right col, first para.))

5.	Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Patel et al (US20160307199) and further in view of Strom et al ("Scalable distributed DNN training using commodity GPU cloud computing." Sixteenth Annual Conference of the International Speech Communication Association. 2015.)

	Regarding claim 2, Chen modified by Patel teaches the method of claim 1, Chen teaches wherein the calculating of the residual gradient value (each learner maintains an accumulated gradient (that we refer to as residual gradients) pg. 2, right col, first para.) comprises:
	However, they do not explicitly teach determining a value of the accumulated gradient value summable to the bit digit representing the weight as an effective gradient value; and calculating the residual gradient value by subtracting the effective gradient value from the accumulated gradient value.
	Strom teaches the determining a value of the accumulated gradient value summable (gi(r) as accumulated gradient value, pg. 2, right col, Pseudo code. 2.6, step 7, add τ to the residual: gi(r) = gi(r) + τ, pg. 2, right col, Pseudo code. 2.6, step 7)
	to the bit digit (+ τ , pg. 2, right col, Pseudo code. 2.6, step 7) as 1 bit (pg. 2, right col, first para.)
	representing the weight as an effective gradient value; (weight delta, τ(as effective gradient value), pg. 2, right col, first para.)
	and calculating the residual gradient value  (gi(r), pg. 2, right col, Pseudo code. 2.6, step 7)
	by subtracting the effective gradient value (subtract τ to the residual: gi(r) = gi(r) - τ, pg. 2, right col,  Pseudo code. 2.6, step 7, weight delta, τ(as effective gradient value), pg. 2, right col, first para.)
	from the accumulated gradient value. (gi(r), pg. 2, right col, Pseudo code. 2.6, step 7)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Patel to incorporate the teachings of Strom for the benefit of empirical results that found that 1-bit quantization is sufficient and carries no significant degradation in neither accuracy nor convergence speed (Strom, pg. 2, right col, first para.)

6.	Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Patel et al (US20160307199) in view of Koster et al (US20170316307)  and further in view of Chase et al (US5859930)

	Regarding claim 3, Chen modified by Patel teaches the method of claim 1, Chen teaches wherein the tuning of the respective individual gradient values (adaptively adjusts compression ratios that provide automatic tuning, pg. 3, right col, second to the last para.) comprises: 
	quantizing the respective individual gradient values, (gradient values that exceed a given threshold are quantized, pg. 2, left col, last para.)
	Chen modified by Patel does not explicitly teach wherein a value of an individual gradient value less than a least significant bit digit of the residual gradient value is omitted; and padding the quantized respective individual gradient values, wherein a value up to a bit digit corresponding to a most significant bit digit of the residual gradient value is present.
	Koster teaches wherein a value of an individual gradient value less than a least significant bit digit of the residual gradient value is omitted; (four least significant bits are removed from the result [0070]) 
	wherein a value up to a bit digit corresponding to a most significant bit digit of the residual gradient value is present. (while four most significant bits are added [0070])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Patel to incorporate the teachings of Koster for the benefit of a result that is computed without causing an overflow [0070] periodically and dynamically updating weights of the neural network (Koster, [0036])
	Chen modified by Patel did not explicitly teach and padding the quantized respective individual gradient values, 
	Chase teaches and padding the quantized respective individual gradient values (Each sub-matrix 232 includes 2M rows and 2,048 columns of gradient data., col. 11, lines 51-52, and the first two-hundred fifty-six (256) columns of sub-matrix 232-1 are padded with Zeros. Col 11, lines 58-60)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified  (Chase, col 1, line 5-6)

7.	Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Patel et al (US20160307199) and further in view of Alistarh et al (US20180075347)

	Regarding claim 4, Chen modified by Patel teaches the method of claim 1, Chen teaches wherein the summing comprises: and the residual gradient value (each learner maintains an accumulated gradient (that we refer to as residual gradients) pg. 2, right col, first para.)
	However, they do not teach mapping the tuned respective individual gradient values, 39012055.0458 for the summing based on a set bit number and calculating an intermediate summation value; and mapping the weight based on the bit number and summing the intermediate summation value and the weight.
	Alistarh teaches mapping (decides which stochastic gradients to set to zero and which to map to non-zero values [0017]) 
	the tuned (tuning parameter is being used [0017])
	respective individual gradient values (individual ones of the gradient [0071])
	39012055.0458for the summing (summing the gradients [0054])
	based on a set bit number (setting individual ones of the gradients to zero (which means the bit is 0) [0071]) 
	and calculating (calculated for individual ones of the stochastic gradients [0017])
	an intermediate summation value; (intermediate values of the stochastic gradient [0041])
	and mapping the weight (loss function which is a set of weights of the neural network that enable the output of the neural network to match the ground truth data [0019])
	based on the bit number (bit is zero [0058])
	and summing (summing the gradients [0054])
	the intermediate summation value (intermediate values of the stochastic gradient [0041])
	and the weight. (update the weights by summing the gradients [0054])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Patel to incorporate the teachings of Alistarh for the benefit of a tuning parameter that controls a tradeoff compression and training time (Alistarh, [0032])

8.	Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Patel et al (US20160307199) in view of Alistarh et al (US20180075347) and further in view of Chase et al (US5859930)

	Regarding claim 5, Chen modified by Patel and further modified by Alistarh teaches the method of claim 4, Chen teaches wherein the summing comprises: the residual gradient value, (each learner maintains an accumulated gradient (that we refer to as residual gradients) pg. 2, right col, first para.)
	Alistarh teaches intermediate summation value (intermediate values of the stochastic gradient [0041]) update the weights by summing the gradients [0054])
	However, they do not explicitly teach padding the tuned respective individual gradient values, and the weight, wherein a value is mapped to all bit digits; and summing the padded tuned respective individual gradient values, the padded intermediate summation value, and the padded weight.
	Chase teaches padding (padded with zeros, col. 11, line 58-60)
	the tuned (tuned to a predetermined match, col. 2, line 24-25)
	respective individual gradient values, (Each sub-matrix 232
includes 2M rows and 2,048 columns of gradient data., col. 11, lines 51-52, and the first two-hundred fifty-six (256) columns of sub-matrix 232-1 are padded with zeros. Col 11, lines 58-60)
	and the weight, (bits, (as weights), col 6, line 33-35)
	wherein a value is mapped (signals values will be mapped into, col. 6, lines 24-38)
	to all bit digits; (from four bits to two bits, col. 6, line 24-38) 
(summing circuit which sums the signals, col. 8, lines 58-61)
	the padded tuned respective individual gradient values, ((Each sub-matrix 232 includes 2M rows and 2,048 columns of gradient data., col. 11, lines 51-52, and the first two-hundred fifty-six (256) columns of sub-matrix 232-1 are padded with zeros. Col 11, lines 44-60)
	the padded (are padded with zeros. Col 11, lines 44-60)
	and the padded weight. (bits, (as weights), col 6, line 33-35)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Patel and further modified by Alistarh to incorporate the teachings of Chase for the benefit of compensating for the shifting (col 11, line 56-57) in a robust fast pattern recognizer (Chase, col 1, line 5-6)
	
9.	Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Patel et al (US20160307199) and further in view of Koster et al (US20170316307)  

	Regarding claim 7, Chen modified by Patel teaches the method of claim 1. However, they did not explicitly teach obtaining a sign bit that is a Most Significant Bit of the result of the summing; and adding the obtained sign bit such that the obtained sign bit is a Most Significant Bit of one of the updated weight and/or the updated residual gradient value.
(most significant bits, [0070])
	of the result of the summing (weighted sum [0032]);
	and adding the obtained sign bit such that the obtained sign bit is a Most Significant Bit (four most significant bits are added, [0070]) 
	of one of the updated weight and/or the updated residual gradient value. (updating a weighted sum [0032])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Patel to incorporate the teachings of Koster for the benefit of a result that is computed without causing an overflow [0070] periodically and dynamically updating weights of the neural network (Koster, [0036])

10.	Claims 9, 13, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Pleiss et al. ("Memory-efficient implementation of densenets." arXiv preprint arXiv:1707.06990 (2017)) and further in view of Lin et al (US20160328645)

	Regarding claim 9, Chen teaches a processor-implemented neural network method, the method comprising (the Adaptive Residual Gradient Compression (AdaComp) scheme, abstract) 
	calculating respective individual gradient values (we plot the values of the gradient (dW), pg. 6, right col, second para.);
 	for updating a weight of a neural network; (weight-update step, pg. 4, left col, fist para.)
	calculating a residual gradient value based on an accumulated gradient value obtained by accumulating the respective individual gradient values (each learner maintains an accumulated gradient (that we refer to as residual gradients), pg 2, right col, first para.)
	and a bit digit representing the weight; (16-bits of representation would be needed for larger LT sizes, pg. 6, left col, second to the last para.)
	tuning (automatic tuning, pg. 3, right col, second para)
	the respective individual gradient values (gradient (dW) as individual gradient, pg. 6 right col, last full para.)
	to correspond to a bit digit (LT is a representation of bits, pg. 6 left col, second to the last para.) and Residual Gradients
(RG) LT = 200 corresponds to individual gradient LT = 200, Fig. 5, pg. 7)
	of the residual gradient value; (Residual Gradient (RG) and dW is accumulated into RG, pg. 6, right col, last full para.)
	summing the tuned respective individual gradient values, and the residual gradient value, (the residue is computed as the sum of the previous residue and the latest gradient value, pg. 3, right col, first para.)
	and updating the weight and the residual gradient value based on a result of the summing to train the neural network (weight update in deep multi-layer perceptrons pg. 2, left col, last para., and additional residues (as residual gradients) in the set of values to be sent is centrally updated (pg. 3, right col, first para.))
	They did not explicitly teach concatenating a remaining value of the residual gradient value excluding a sign bit to the weight and calculating an intermediate concatenation value;
	Pleiss teaches concatenating a remaining value of the residual gradient value (concatenated feature gradients, pg. 4. Second para. )
	and calculating (computing pg. 2, second para.)
	an intermediate concatenation value; (intermediate feature maps are the outputs of concatenation operations, pg. 2, first para.) generating intermediate results (computing, pg. 2, second para, which are intermediate concatenate values)
	It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen to incorporate the teachings of Pleiss for the benefit of memory consumption which are relatively cheap to compute (Pleiss, computing, pg. 2, second para.)
	They do not explicitly teach excluding a sign bit to the weight 
	Lin teaches a bit digit representing weight (weight represented using 32 bits [0062])
	and excluding (removing [0064], most significant bits may be removed by saturation [0064])
	a sign bit (most significant bits [0064] as sign bit)
	to the weight (wi,j , represents the weight, [0059], represented using  bits, [0062])
	It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Pleiss to incorporate the teachings of Lin for the benefit of reducing software complexity, and/or reduce memory usage (Lin, [0055])

	Regarding claim 13, Chen modified by Pleiss and further modified by Lin teaches the method of claim 9, Chen teaches a non-transitory computer-readable recording medium having recorded thereon computer readable instructions, which, when executed by one or more processors, performs the method (Each Xeon processor has 12 cores running at 2.66GHz and each Tesla K80 card contains two K40 GPUs each with 12GB of GDDR5 memory, pg. 4 left col, second para.)

	Regarding claim 15, Chen modified by Pleiss and further modified by Lin teaches the method of claim 9, Chen teaches wherein the updating comprises: updating (centrally updated, pg. 2, right col, first para,)
	a bit digit value (8-bits could be used effectively, pg. 6, left col, second to the last para.)
	of the result of the summing corresponding to the bit digit representing the weight to the updated weight, (weight-update, pg. 4, left col, first para.) 
	and updating a bit digit value (16-bits of representation would be needed for larger sizes, pg. 6, left col, second to the last para. )
	of the result of the summing not corresponding to the bit digit representing the weight to the residual gradient value. (comprising of gradients that have not yet been updated centrally , pg. 2, right col, first para)

11.	Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Pleiss et al. ("Memory-efficient implementation of densenets." arXiv preprint arXiv:1707.06990 (2017)) in view of Lin et al (US20160328645) and further in view of Strom et al ("Scalable distributed DNN training using commodity GPU cloud computing." Sixteenth Annual Conference of the International Speech Communication Association. 2015.)

	Regarding claim 10, Chen modified by Pleiss and further modified by Lin teaches the method of claim 9, Chen teaches wherein the calculating of the residual gradient value ((each learner maintains an accumulated gradient (that we refer to as residual gradients), pg 2, right col, first para.) comprises:
	However, they do not explicitly teach determining a value of the accumulated gradient value summable to the bit digit representing the weight as an effective gradient value; and calculating the residual gradient value by subtracting the effective gradient value from the accumulated gradient value.
	Strom teaches the determining a value of the accumulated gradient value summable (gi(r) as accumulated gradient value, pg. 2, right col, Pseudo code. 2.6, step 7, add τ to the residual: gi(r) = gi(r) + τ, pg. 2, right col, Pseudo code. 2.6, step 7)
	to the bit digit (+ τ , pg. 2, right col, Pseudo code. 2.6, step 7) as 1 bit (pg. 2, right col, first para.)
	representing the weight as an effective gradient value; (weight delta, τ(as effective gradient value), pg. 2, right col, first para.)
	and calculating the residual gradient value  (gi(r), pg. 2, right col, Pseudo code. 2.6, step 7)
	by subtracting the effective gradient value (subtract τ to the residual: gi(r) = gi(r) - τ, pg. 2, right col,  Pseudo code. 2.6, step 7, weight delta, τ(as effective gradient value), pg. 2, right col, first para.)
	from the accumulated gradient value. (gi(r), pg. 2, right col, Pseudo code. 2.6, step 7)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Pleiss and further modified by Lin to incorporate the teachings of Strom for the benefit of empirical results that found that 1-bit quantization is sufficient and carries no significant degradation in neither accuracy nor convergence speed (Strom, pg. 2, right col, first para.)

12.	Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Pleiss et al. ("Memory-efficient implementation of densenets." arXiv preprint arXiv:1707.06990 (2017)) in Lin et al (US20160328645) in view of Koster et al (US20170316307)  and further in view of Chase et al (US5859930)

	Regarding claim 11, Chen modified by Pleiss and further modified by Lin teaches the method of claim 9, Chen wherein the tuning of the respective individual gradient values (adaptively adjusts compression ratios that provide automatic tuning, pg. 3, right col, second to the last para.) comprises: 
	quantizing the respective individual gradient values, (gradient values that exceed a given threshold are quantized, pg. 2, left col, last para.)
	Chen modified by Pleiss and further modified by Lin does not explicitly teach wherein a value of an individual gradient value less than a least significant bit digit of the residual gradient value is omitted; and padding the quantized respective individual gradient values, wherein a value up to a bit digit corresponding to a most significant bit digit of the residual gradient value is present.
	Koster teaches wherein a value of an individual gradient value less than a least significant bit digit of the residual gradient value is omitted; (four least significant bits are removed from the result [0070]) 
	wherein a value up to a bit digit corresponding to a most significant bit digit of the residual gradient value is present. (while four most significant bits are added [0070])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Pleiss modified by Lin to incorporate the teachings of Koster for the benefit  (Koster, [0036])
	They do not explicitly teach and padding the quantized respective individual gradient values, Chase teaches and padding the quantized respective individual gradient values (Each sub-matrix 232 includes 2M rows and 2,048 columns of gradient data., col. 11, lines 51-52, and the first two-hundred fifty-six (256) columns of sub-matrix 232-1 are padded with Zeros. Col 11, lines 58-60)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Pleiss modified by Lin and further modified by Koster to incorporate the teachings of Chase for the benefit of compensating for the shifting (Chase, col 11, line 56-57) in a robust fast pattern recognizer (Chase, col 1, line 5-6)

13.	Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Pleiss et al. ("Memory-efficient implementation of densenets." arXiv preprint arXiv:1707.06990 (2017)) in view of Lin et al (US20160328645) in view of Alistarh et al (US20180075347)

	Regarding claim 12, Chen modified by Pleiss and further modified by Lin teaches the method of claim 9, Chen teaches wherein the summing comprises: (each learner maintains an accumulated gradient (that we refer to as residual gradients) pg. 2, right col, first para.)
(intermediate feature maps are the outputs of concatenation operations, pg. 2, first para.) generating intermediate results (computing, pg. 2, second para, which are intermediate concatenate values)
	However, they do not teach mapping the tuned respective individual gradient values, 39012055.0458 based on a set bit number and summing the respective tuned individual gradient values and the intermediate concatenated value.
	Alistarh teaches mapping (decides which stochastic gradients to set to zero and which to map to non-zero values [0017]) 
	the tuned (tuning parameter is being used)
	respective individual gradient values (individual ones of the gradient [0071])
	based on a set bit number (setting individual ones of the gradients to zero (which means the bit is 0) [0071]) 
	39012055.0458and summing (summing the gradients [0054])
	the respective tuned (tuning parameter is being used)
	individual gradient values (individual ones of the gradient [0071])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Pleiss modified and further modified by Lin to incorporate the teachings of Alistarh for the benefit of a tuning parameter that controls a tradeoff compression and training time (Alistarh, [0032])

14 is rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Pleiss et al. ("Memory-efficient implementation of densenets." arXiv preprint arXiv:1707.06990 (2017)) in view of Lin et al (US20160328645) in view of Alistarh et al (US20180075347) and further in view of Chase et al (US5859930)

	Regarding claim 14, Chen modified by Pleiss modified by Lin and further modified by Alistarh teaches the method of claim 12, Chen teaches wherein the summing (each learner maintains an accumulated gradient (that we refer to as residual gradients) pg. 2, right col, first para.) comprises: 
	Pleiss teaches the intermediate concatenation value; (intermediate feature maps are the outputs of concatenation operations, pg. 2, first para.) generating intermediate results (computing, pg. 2, second para, which are intermediate concatenate values)
	However, they do not explicitly teach padding the tuned respective individual gradient values, wherein a value is mapped to all bit digits; and summing the padded tuned respective individual gradient values, the padded
	Chase teaches padding (padded with zeros, col. 11, line 58-60)
	the tuned (tuned to a predetermined match, col. 2, line 24-25)
	respective individual gradient values, (Each sub-matrix 232
includes 2M rows and 2,048 columns of gradient data., col. 11, lines 51-52, and the first two-hundred fifty-six (256) columns of sub-matrix 232-1 are padded with zeros. Col 11, lines 58-60)
	wherein a value is mapped (signals values will be mapped into, col. 6, lines 24-38)
	to all bit digits; (from four bits to two bits, col. 6, line 24-38) 
	and summing (summing circuit which sums the signals, col. 8, lines 58-61)
	the padded tuned respective individual gradient values, ((Each sub-matrix 232 includes 2M rows and 2,048 columns of gradient data., col. 11, lines 51-52, and the first two-hundred fifty-six (256) columns of sub-matrix 232-1 are padded with zeros. Col 11, lines 44-60)
	the padded (are padded with zeros. Col 11, lines 44-60)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Pleiss modified by Lin and further modified by Alistarh to incorporate the teachings of Chase for the benefit of compensating for the shifting (Chase, col 11, line 56-57) in a robust fast pattern recognizer (Chase, col 1, line 5-6)

15.	Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Pleiss et al. ("Memory-efficient implementation of densenets." arXiv preprint arXiv:1707.06990 (2017)) in view of Lin et al (US20160328645) in view of Koster et al (US20170316307)  
Regarding claim 16, Chen modified by Pleiss and further modified by Lin teaches the method of claim 9. However, they did not explicitly teach obtaining a sign bit that is a Most Significant Bit of the result of the summing; and adding the obtained sign bit such that the obtained sign bit is a Most Significant Bit of one of the updated weight and/or the updated residual gradient value.
	Koster teaches obtaining a sign bit that is a Most Significant Bit (most significant bits, [0070])
	of the result of the summing (weighted sum [0032]);
	and adding the obtained sign bit such that the obtained sign bit is a Most Significant Bit (four most significant bits are added, [0070]) 
	of one of the updated weight and/or the updated residual gradient value. (updating a weighted sum [0032])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Pleiss and further modified by Lin to incorporate the teachings of Koster for the benefit of a result that is computed without causing an overflow [0070] periodically and dynamically updating weights of the neural network (Koster, [0036])

16.	Claims 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al ("AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training." arXiv preprint arXiv:1712.02679 (2017)) in view of Patel et al (US20160307199) in view of Pleiss et al. ("Memory-efficient implementation of densenets." arXiv preprint arXiv:1707.06990 (2017)) and further in view of Lin et al (US20160328645)

	Regarding claim 19, Chen teaches a neural network apparatus, the apparatus comprising: one or more processors configured to (Each Xeon processor has 12 cores running at 2.66GHz and each Tesla K80 card contains two K40 GPUs each with 12GB of GDDR5 memory, pg. 4 left col, second para.)
	calculate respective individual gradient values (each learner maintains an accumulated gradient (that we refer to as residual gradients), pg 2, right col, first para.) 
	for updating a weight of the neural network, weight ((weight update in deep multi-layer perceptrons pg. 2, left col, last para.)
	calculate a residual gradient value based on an accumulated gradient value obtained by accumulating the respective individual gradient values ((each learner maintains an accumulated gradient (that we refer to as residual gradients), pg 2, right col, first para.)
	a bit digit representing the weight, (16-bits of representation would be needed for larger LT sizes, pg. 6, left col, second to the last para.)
	tuning (automatic tuning, pg. 3, right col, second para)
	the respective individual gradient values (gradient (dW) as individual gradient, pg. 6 right col, last full para.)
	to correspond to a bit digit (LT is a representation of bits, pg. 6 left col, second to the last para.) and Residual Gradients
(RG) LT = 200 corresponds to individual gradient LT = 200, Fig. 5, pg. 7)
	representing the residual gradient value; (Residual Gradient (RG) and dW is accumulated into RG, pg. 6, right col, last full para.)
	Chen did not explicitly teach sum the tuned individual gradient values, the residual gradient value, and update the weight and the residual gradient value based on a result of the summing, 
	Patel teaches sum the tuned individual gradient values, the residual gradient value, and the weight and the residual gradient value based on a result of the summing.; (gradients (individual gradients) may be aggregated on model server 210, which may add the aggregated values to the base set of weights [0074])
	It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen to incorporate the teachings of Patel for the benefit of maximizing the usefulness of such parameter updates which are based on the aggregate of many device parameters that result in some known statistic (Patel, [0050])
	Chen did not explicitly teach concatenate a remaining value of the residual gradient value and calculating an intermediate concatenation value;
	Pleiss teaches concatenate a remaining value of the residual gradient value (concatenated feature gradients, pg. 4. Second para. )
	and calculating (computing pg. 2, second para.)
	an intermediate concatenation value; (intermediate feature maps are the outputs of concatenation operations, pg. 2, first para.) generating intermediate results (computing, pg. 2, second para, which are intermediate concatenate values)
	It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Patel to incorporate the teachings of Pleiss for the benefit of memory consumption which are relatively cheap to compute (Pleiss, computing, pg. 2, second para.)	
	They do not explicitly teach excluding a sign bit to the weight 
	Lin teaches a bit digit representing weight (weight represented using 32 bits [0062])
	and excluding (removing [0064], most significant bits may be removed by saturation [0064])
	a sign bit (most significant bits [0064] as sign bit)
	to the weight (wi,j , represents the weight, [0059], represented using  bits, [0062])
	It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Chen modified by Patel modified by Pleiss to incorporate the teachings of Lin for the benefit of reducing software complexity, and/or reduce memory usage (Lin, [0055])

	Regarding claim 20, Chen modified by Patel modified by Pleiss and further modified by Lin teaches the apparatus of claim 19, Chen teaches a memory storing instructions, which, when executed by one or more processors to perform (Each Xeon processor has 12 cores running at 2.66GHz and each Tesla K80 card contains two K40 GPUs each with 12GB of GDDR5 memory, pg. 4 left col, second para.)
(each learner maintains an accumulated gradient (that we refer to as residual gradients), pg 2, right col, first para.) 
	the calculating of the residual gradient value, (each learner maintains an accumulated gradient (that we refer to as residual gradients), pg 2, right col, first para.)
	the tuning (automatic tuning, pg. 3, right col, second para)
	the respective individual gradient values (gradient (dW) as individual gradient, pg. 6 right col, last full para.)
	the summing, and the updating of the weight and the residual gradient value. (weight update in deep multi-layer perceptrons pg. 2, left col, last para., and additional residues (as residual gradients) in the set of values to be sent is centrally updated (pg. 3, right col, first para.))
	Pleiss teaches concatenating a remaining value (concatenated feature gradients, pg. 4. Second para. )
	and calculating (computing pg. 2, second para.)
	of the intermediate concatenating value; (intermediate feature maps are the outputs of concatenation operations, pg. 2, first para.) generating intermediate results (computing, pg. 2, second para, which are intermediate concatenate values)

Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B. Zhen can be reached on (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/M.G./Examiner, Art Unit 2121                                                                                                                                                                                                        
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121