PNG
    media_image1.png
    327
    1870
    media_image1.png
    Greyscale

    PNG
    media_image1.png
    327
    1870
    media_image1.png
    Greyscale




P.O. Box 1450, Alexandria, Virginia 22313-1450 – WWW.USPTO.GOV



   Examiner’s Detailed Office Action   

EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in an interview with Michael Powell, Registration No. 61,942, on 12/30/2021.

Proposed Claims:
1. (Currently Amended)	A computer-implemented method for adaptive residual gradient compression for training of a deep learning neural network (DNN), the computer implemented method comprising:
obtaining, by a processor of a first learner of a plurality of learners, a current gradient vector for a neural network layer of the DNN, wherein the current gradient vector comprises gradient weights of parameters of the neural network layer that are calculated by training the 
	receiving the training data comprising a plurality of input samples; 
	determining a mini-batch from the training data;
	performing a forward pass and a backward pass through the DNN to calculate a current gradient vector; and
	updating one or more gradient weights for the DNN based on the current gradient vector;
generating, by the processor, a current residue vector comprising residual gradient weights for the mini-batch, wherein generating the current residue vector comprises summing a prior residue vector and the current gradient vector;
generating, by the processor, a compressed current residue vector based at least in part on dividing the residual gradient weights of the current residue vector into a plurality of bins of a uniform size and quantizing a subset of the residual gradient weights of one or more bins of the plurality of bins, wherein quantizing the subset of the residual gradient weights is based at least in part on calculating a scaling parameter for the mini-batch and calculating a local maximum of each bin, wherein the uniform size of the bins is a hyper-parameter of the DNN; 
transmitting, by the processor, the compressed current residue vector to a second learner of the plurality of learners; and
updating, at each of the plurality of learners, the gradient weights of the parameters of the neural network layer.
2. (Original)	The computer-implemented method of claim 1, wherein generating the compressed current residue vector comprises:
generating, by the processor, a scaled current residue vector comprising scaled residual gradient weights for the mini batch, wherein generating the scaled current residue vector comprises multiplying the current gradient vector by the scaling parameter and summing the prior residue vector with the multiplied gradient vector;
dividing the residual gradient weights of the current residue vector into the plurality of bins of the uniform size;
identifying, for each bin of the plurality of bins, a local maximum of the absolute value of the residual gradient weights of the bin; 
determining, for each residual gradient weight of each bin, that a corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin; and 
upon identifying, for each residual gradient weight of each bin, that the corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin, generating a quantizing value for the give residual gradient weight and updating the current residue vector by substituting the residual gradient weight of the current residue vector with the quantized value.
3. (Original)	The computer-implemented method of claim 2, wherein the scale parameter is calculated by minimizing quantization error according to L2 normalization.
4. (Original) 	The computer-implemented method of claim 2, wherein: 
the DNN includes one or more convolution network layers; and  
the size of the plurality of bins is set to 50 for the one or more convolution layers.
5. (Original) 	The computer-implemented method of claim 2, wherein: 
the DNN includes at least one of more fully connected layers; and  
the size of the bins is set to 500 for the one or more fully connected layers.
6. (Currently Amended)	A system for adaptive residual gradient compression for training of a deep learning neural network (DNN), the system comprising a plurality of learners, wherein at least one leaner of the plurality of learners is configured to perform a method comprising:
obtaining a current gradient vector for a neural network layer of the DNN, wherein the current gradient vector comprises gradient weights of parameters of the neural network layer that are calculated by training the neural network layer of the DNN using a mini-batch of training data, wherein training the neural network layer of the DNN using the mini-batch of training data comprises:  
	receiving the training data comprising a plurality of input samples; 
	determining a mini-batch from the training data;
	performing a forward pass and a backward pass through the DNN to calculate a current gradient vector; and
	updating one or more gradient weights for the DNN based on the current gradient vector;
generating a current residue vector comprising residual gradient weights for the mini-batch, wherein generating the current residue vector comprises summing a prior residue vector and the current gradient vector;
generating a compressed current residue vector based at least in part on dividing the residual gradient weights of the current residue vector into a plurality of bins of a uniform size and quantizing a subset of the residual gradient weights of one or more bins of the plurality of bins, wherein quantizing the subset of the residual gradient weights is based at least in part on calculating a scaling parameter for the mini-batch and calculating a local maximum of each bin, wherein the uniform size of the bins is a hyper-parameter of the DNN; 
transmitting the compressed current residue vector to a second learner of the plurality of learners; and
updating, at each of the plurality of learners, the gradient weights of the parameters of the neural network layer.
7. (Original)	The system of claim 6, wherein generating the compressed current residue vector comprises:
generating, by the processor, a scaled current residue vector comprising scaled residual gradient weights for the mini batch, wherein generating the scaled current residue vector comprises multiplying the current gradient vector by the scaling parameter and summing the prior residue vector with the multiplied gradient vector;
dividing the residual gradient weights of the current residue vector into the plurality of bins of the uniform size;
identifying, for each bin of the plurality of bins, a local maximum of the absolute value of the residual gradient weights of the bin; 
determining, for each residual gradient weight of each bin, that a corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin; and 
upon identifying, for each residual gradient weight of each bin, that the corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin, generating a quantizing value for the give residual gradient weight and updating the current residue vector by substituting the  residual gradient weight of the current residue vector with the quantized value.
8. (Original)	The system of claim 7, wherein the scale parameter is calculated by minimizing quantization error according to L2 normalization.
9. (Original)	The system of claim 7, wherein:
the DNN includes one or more convolution network layers; and 
the size of the plurality of bins is set to 50 for the one or more convolution layers.
10. (Original)	The system of claim 7, wherein:
the DNN includes at least one of more fully connected layers; and 
the size of the bins is set to 500 for the one or more fully connected layers.
11. (Currently Amended)	A computer program product for adaptive residual gradient compression for training of a deep learning neural network (DNN), the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor of at least a first leaner of a plurality of learners to cause the first learner to perform a method comprising:
obtaining a current gradient vector for a neural network layer of the DNN, wherein the current gradient vector comprises gradient weights of parameters of the neural network layer that are calculated by training the neural network layer of the DNN using a mini-batch of training data, wherein training the neural network layer of the DNN using the mini-batch of training data comprises:  
	receiving the training data comprising a plurality of input samples; 
	determining a mini-batch from the training data;
	performing a forward pass and a backward pass through the DNN to calculate a current gradient vector; and
	updating one or more gradient weights for the DNN based on the current gradient vector;
generating a current residue vector comprising residual gradient weights for the mini-batch, wherein generating the current residue vector comprises summing a prior residue vector and the current gradient vector;
generating a compressed current residue vector based, at least in part, on dividing the residual gradient weights of the current residue vector into a plurality of bins of a uniform size and quantizing a subset of the residual gradient weights of one or more bins of the plurality of bins, wherein quantizing the subset of the residual gradient weights is based at least in part on calculating a scaling parameter for the mini-batch and calculating a local maximum of each bin, wherein the uniform size of the bins is a hyper-parameter of the DNN; and
transmitting the compressed current residue vector to a second learner of the plurality of learners; and
updating, at each of the plurality of learners, the gradient weights of the parameters of the neural network layer.
12. (Original)	The computer program product of claim 11, wherein generating the compressed current residue vector comprises:
generating, by the processor, a scaled current residue vector comprising scaled residual gradient weights for the mini batch, wherein generating the scaled current residue vector comprises multiplying the current gradient vector by the scaling parameter and summing the prior residue vector with the multiplied gradient vector;
dividing the residual gradient weights of the current residue vector into the plurality of bins of the uniform size;
identifying, for each bin of the plurality of bins, a local maximum of the absolute value of the residual gradient weights of the bin; 
determining, for each  residual gradient weight of each bin, that a corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin; and 
upon determining, for each  residual gradient weight of each bin, that the corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin, generating a quantizing value for the  residual gradient weight and updating the current residue vector by substituting the  residual gradient weight of the current residue vector with the quantized value.
13. (Original)	The computer program product of claim 12, wherein the scale parameter is calculated by minimizing quantization error according to L2 normalization.
14. (Original)	The computer program product of claim 12, wherein:
the DNN includes one or more convolution network layers; and 
the size of the plurality of bins is set to 50 for the one or more convolution layers.
15. (Original)	The computer program product of claim 12, wherein:
the DNN includes at least one of more fully connected layers; and 
the size of the bins is set to 500 for the one or more fully connected layers.
16. (Currently Amended)	A computer-implemented method for training a deep learning neural network (DNN) via adaptive residual gradient compression, the computer implemented method comprising:
receiving, by a system comprising a plurality of learners, training data for training of the DNN using one or more neural network layers;
generating, at each learner of the plurality of learners, a current gradient vector for a neural network layer, wherein the current gradient vector comprises gradient weights of parameters of the neural network layer, wherein the gradient weights of parameters of the neural network layer are calculated by training the neural network layer using a mini-batch of training data, wherein training the neural network layer of the DNN using the mini-batch of training data comprises:  
	receiving the training data comprising a plurality of input samples; 
	determining a mini-batch from the training data;
	performing a forward pass and a backward pass through the DNN to calculate a current gradient vector; and
	updating one or more gradient weights for the DNN based on the current gradient vector;
generating, at each learner of the plurality of learners, a current residue vector comprising residual gradient weights for the mini-batch, wherein generating the current residue vector comprises summing a prior residue vector and the current gradient vector;
generating, at each learner of the plurality of learners, a compressed current residue vector based at least in part on dividing the residual gradient weights of the current residue vector into a plurality of bins of a uniform size and quantizing a subset of the residual gradient weights of one or more bins of the plurality of bins, wherein quantizing the subset of the residual gradient weights is based at least in part on calculating a scaling parameter for the mini-batch and calculating a local maximum of each bin, wherein the uniform size of the bins is a hyper-parameter of the DNN; and
exchanging the compressed current residue vectors among the plurality of learners;
decompressing, at each of the plurality of learners, the compressed current residue vectors; and
updating, at each of the plurality of learners, the gradient weights of the parameters of the neural network layer.
17. (Original)	The computer-implemented method of claim 16, wherein generating the compressed current residue vector comprises:
generating a scaled current residue vector comprising scaled residual gradient weights for the mini batch, wherein generating the scaled current residue vector comprises multiplying the current gradient vector by the scaling parameter and summing the prior residue vector with the multiplied gradient vector;
dividing the residual gradient weights of the current residue vector into the plurality of bins of the uniform size;
identifying, for each bin of the plurality of bins, a local maximum of the absolute value of the residual gradient weights of the bin; 
determining, for each  residual gradient weight of each bin, hat a corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin; and 
upon determining, for each residual gradient weight of each bin, that the corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin, generating a quantizing value for the residual gradient weight and updating the current residue vector by substituting the residual gradient weight of the current residue vector with the quantized value.
18. (Original)	The computer-implemented method of claim 17, wherein the scale parameter is calculated by minimizing quantization error according to L2 normalization.
19. (Original)	The computer-implemented method of claim 17, wherein:
the DNN includes one or more convolution network layers; and 
the size of the plurality of bins is set to 50 for the one or more convolution layers.
20. (Original)	The computer-implemented method of claim 17, wherein:
the DNN includes at least one of more fully connected layers; and 
the size of the bins is set to 500 for the one or more fully connected layers.
21. (Currently Amended)	A system for training a deep learning neural network (DNN) via adaptive residual gradient compression, the system comprising a plurality of learners, wherein the system is configured to perform a method comprising:
receiving training data for training of the DNN using one or more neural network layers;
generating, at each learner of the plurality of learners, a current gradient vector for a neural network layer, wherein the current gradient vector comprises gradient weights of parameters of the neural network layer, wherein the gradient weights of parameters of the neural network layer are calculated by training the neural network layer using a mini-batch of training data, wherein training the neural network layer of the DNN using the mini-batch of training data comprises:  
	receiving the training data comprising a plurality of input samples; 
	determining a mini-batch from the training data;
	performing a forward pass and a backward pass through the DNN to calculate a current gradient vector; and
	updating one or more gradient weights for the DNN based on the current gradient vector;
generating, at each learner of the plurality of learners, a current residue vector comprising residual gradient weights for the mini-batch, wherein computing the current residue vector comprises summing a prior residue vector and the current gradient vector;
generating, at each learner of the plurality of learners, a compressed current residue vector based, at least in part, on dividing the residual gradient weights of the current residue vector into a plurality of bins of a uniform size and quantizing a subset of the residual gradient weights of one or more bins of the plurality of bins, wherein quantizing the subset of the residual gradient weights is based at least in part on calculating a scaling parameter for the mini-batch and calculating a local maximum of each bin, wherein the uniform size of the bins is a hyper-parameter of the DNN; and
exchanging the compressed current residue vectors among the plurality of learners;
decompressing, at each of the plurality of learners, the compressed current residue vectors; and
updating, at each of the plurality of learners, the gradient weights of the parameters of the neural network layer.
22. (Original)	The system of claim 21, wherein generating the compressed current residue vector comprises:
generating a scaled current residue vector comprising scaled residual gradient weights for the mini batch, wherein generating the scaled current residue vector comprises multiplying the current gradient vector by the scaling parameter and summing the prior residue vector with the multiplied gradient vector;
dividing the residual gradient weights of the current residue vector into the plurality of bins of the uniform size;
identifying, for each bin of the plurality of bins, a local maximum of the absolute value of the residual gradient weights of the bin; 
determining, for each residual gradient weight of each bin, that a corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin; and 
upon determining, for each residual gradient weight of each bin, that the corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin, generating a quantizing value for the  residual gradient weight and updating the current residue vector by substituting the  residual gradient weight of the current residue vector with the quantized value.
23. (Original)	The system of claim 22, wherein the scale parameter is calculated by minimizing quantization error according to L2 normalization.
24. (Original) 	The system of claim 22, wherein:
the DNN includes one or more convolution network layers; and 
the size of the plurality of bins is set to 50 for the one or more convolution layers.
25. (Original)	The system of claim 22, wherein:
the DNN includes at least one of more fully connected layers; and 
the size of the bins is set to 500 for the one or more fully connected layers.








END of claims’ amendment
1.	Claims 1-25 are allowed.	

  			           REASONS FOR ALLOWANCE
2.	The following is an Examiner’s statement for reasons for allowance: 

3.	Claims 1-25  are considered allowable since when reading the claims in light of the specification, as per, MPEP §2111.01 or Toro Co. v. White Consolidated Industries Inc., 199 F.3d 1295, 1301, 53 USPQ2d 1065, 1069 (Fed. Cir. 1999), none of the references of record alone or in combination disclose or suggest the combination of limitations specified in the independent claim(s).
4.	The limitations recited in independent claims 1, 6, and 11 “…generating a compressed current residue vector based at least in part on dividing the residual gradient weights of the current residue vector into a plurality of bins of a uniform size and quantizing a subset of the residual gradient weights of one or more bins of the plurality of bins, wherein quantizing the subset of the residual gradient weights is based at least in part on calculating a scaling parameter for the mini-batch and calculating a local maximum of each bin, wherein the uniform size of the bins is a hyper-parameter of the DNN.”
5.	The limitations recited in independent claims 16 and 21“…generating, at each learner of the plurality of learners, a compressed current residue vector based at least in part on dividing the residual gradient weights of the current residue vector into a plurality of bins of a uniform size and quantizing a subset of the residual gradient weights of one or more bins of the plurality of bins, wherein quantizing the subset of the residual gradient weights is based at least in part on calculating a scaling parameter for the mini-batch and calculating a local maximum of each bin, wherein the uniform size of the bins is a hyper-parameter of the DNN.”
6.  For claims 1-25, the closest prior art, Dryden et al. ("Communication Quantization for Data-parallel Training of Deep Neural Networks"), teaches data set divided into mini-batches. However there is no prior art to cover the claim limitations recited above.
7.	When taken in context the claim(s) as a whole was/were not uncovered in the prior art

i.e., the dependent claims are allowed as they depend upon an allowable independent claim.

8.	Any comments considered necessary by applicant must be submitted no later than the

payment of the issue fee and, to avoid processing delays, should preferably accompany the 

issue fee. Such submissions should be clearly labeled “Comments regarding Statement of 

Reasons for Allowance.”


Correspondence Information
9.  Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABABACAR SECK whose telephone number is (571)270-7146.  The examiner can normally be reached on Monday-Friday 8:00 A.M.-6:00 P.M..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on 5712723719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.









/ABABACAR SECK/Examiner, Art Unit 2122