Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response the application filed on May 4, 2018. Claims 1-32 are pending in the application and have been examined.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 10/12/2018 and 12/09/2019 was filed. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Drawings
The drawings are objected to because FIG. 5A is illegible ([0062] “The left-most (yellow) vertical line in the FIG. 5A diagram”; “the values in the band between the two (yellow and blue) vertical lines”; “there are many values to the left of the right-hand (blue) vertical line”; “a significant number of values to the left of the left-hand (yellow) vertical line”). The drawings are objected to because there is no block 680 in Fig. 6A. Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be 


Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 5, 6, 7, 8, 22, 24, 25 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 5 recites the scaling factor
There is insufficient antecedent basis for this limitation in this claim.
Claim 6 recites “log2(2^15 - x)”, and x is not defined.

Dependent claims 7 and 8 are rejected as they do not cure the deficiencies of the claims they depend from.

Claim 22 recites the limitation “the scale factor”
There is insufficient antecedent basis for this limitation in this claim.

Claim 24 recites “The method of claim 22”
There is insufficient antecedent basis for this limitation in this claim as claim 22 does not recite a method, but is directed to a computer arrangement.

Claim 24 recites the limitation “the compensating”; “said second hyperparameter.”
There is insufficient antecedent basis for this limitation in this claim as claim 22 does not recite “compensating” or any “hyperparamenter”.

Claim 25 recites “The method of claim 22”
There is insufficient antecedent basis for this limitation in this claim as claim 22 does not recite a method, but is directed to a computer arrangement.

Claim 25 recites the limitation “the compensating”; “said second hyperparameter.”
There is insufficient antecedent basis for this limitation in this claim as claim 22 does not recite “compensating” or any “hyperparamenter”.

Claim 24 is interpreted as “The method of claim 23 wherein the compensating comprises multiplying or dividing the computed gradients by said second hyperparameter.”

Claim 25 is interpreted as “The method of claim 23 wherein the compensating comprises modifying a learning rate in accordance with said second hyperparameter.”

Dependent claims are also rejected as they do not cure the deficiencies of claims they depend from.



Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-32 are rejected under 35 USC 101 because the claimed invention is directed to an abstract idea without significantly more. 

Regarding claim 1,
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… compute gradients based on a loss value;
… scale said loss value 
… adjusting weights based on gradients computed using said reduced precision mode;
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
… selectively operate in reduced precision computation mode
This limitation recites a mental process of deciding, which can reasonably be performed in the mind with the aid of pencil and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
for training a neural network
 “For training a neural network”, generally recited, is only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)).
The claim further recites the following additional elements:
numerical computation circuit
reduced precision selector 
Circuit and selector are generic computer components, amounting to merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f).
Thus, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.) or provide an inventive concept in Step 2B. Mere instructions to apply an exception using a generic computer component (circuit and selector) cannot provide an inventive concept in Step 2B, similarly with regards to “for training a neural network”, specifying a particular technological environment in which to apply the judicial exception does not provide an inventive concept (see MPEP 2106.05(h)).
	The claim is not patent eligible.

Regarding claim 2,
Claim 2 incorporates the rejection of claim 1.
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… compensate for said scaling 
… reducing the weight gradient contribution
… inversely proportional to said scaling
	These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional element:
	numerical computation circuit
Merely using a generic computer component as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
Thus, the additional element does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The same analysis applies here in 2B, i.e., mere instructions to apply an exception using a generic computer component cannot provide an inventive concept in Step 2B.
	The claim is not patent eligible.

Regarding claim 3,
Claim 3 incorporates the rejection of claim 1.
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… scale said loss value
	These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional element:
	numerical computation circuit
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
Thus, the additional element does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The same analysis applies here in 2B, i.e., mere instructions to apply an exception using a generic computer component cannot provide an inventive concept in Step 2B.
	The claim is not patent eligible.

Regarding claim 4,
Claim 4 incorporates the rejection of claim 1.
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… scale said loss value
	These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
… automatically-selected scaling factor
This limitation recites a mental process of deciding, which can reasonably be performed in the mind with the aid of pencil and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional element:
	numerical computation circuit
Merely using a generic computer component as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
Thus, the additional element does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The same analysis applies here in 2B, i.e., mere instructions to apply an exception using a generic computer component cannot provide an inventive concept in Step 2B.
	The claim is not patent eligible.

Regarding claim 5,
Claim 5 incorporates the rejection of claim 1.
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… automatically select the scaling factor
This limitation recites a mental process of deciding, which can reasonably be performed in the mind with the aid of pencil and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional element:
a processor
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The same analysis applies here in 2B, i.e., mere instructions to apply an exception using a generic computer component cannot provide an inventive concept in Step 2B.
	The claim is not patent eligible.

Regarding claim 6,
Claim 6 incorporates the rejection of claim 5.
Further, claim 6 recites only more specific of the judicial exceptions recited in claim 5, and does not recite any further additional elements.
Therefore, this claim is not patent eligible for the reasons set forth in claim 5 above.  
  
Regarding claim 7,
Claim 7 incorporates the rejection of claim 6.
Further, claim 7 recites only more specific of the judicial exceptions recited in claim 6, and does not recite any further additional elements.
Therefore, this claim is not patent eligible for the reasons set forth in claim 6 above.  

Regarding claim 8,
Claim 8 incorporates the rejection of claim 6.
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… test weight gradients
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
a training iteration of the neural networks
Training of the neural networks is generally linked to the use of the judicial exception to a particular technological environment or field of use.
Thus, the additional element do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
repeating the iteration …
Repeating only adds an insignificant extra-solution activity to the judicial exception, as discussed in MPEP § 2106.05(g), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: These additional elements are not sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.) or provide an inventive concept in Step 2B.
With regards to “for training a neural network”, specifying a particular technological environment in which to apply the judicial exception does not provide an inventive concept (see MPEP 2106.05(h)) in Step 2B.
As for the limitation repeating, this step is considered to be extra-solution activity in Step 2A, and thus it is re-evaluated in Step 2B to determine if it is more than what is well-understood, routine, conventional activity in the field. 
However, the courts have recognized performing repetitive calculations as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner at a high level of generality or as insignificant extra-solution activity. See Flook, 437 U.S. at 594, 198 USPQ2d at 199 (recomputing or readjusting alarm limit values); Bancorp Services v. Sun Life, 687 F.3d 1266, 1278, 103 USPQ2d 1425, 1433 (Fed. Cir. 2012 cited in MPEP 2106.05(d)(II)).
The claim is not patent eligible.

Regarding claim 9,
Claim 9 incorporates the rejection of claim 1.
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… modifies each weight gradient value
… using the weight gradient value
	Each of these limitations is a mental process, which can reasonably be performed in the mind with the aid of pen and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional element:
numerical computation circuit
Circuit is generic computer component, amounting to merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f).
Thus, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The same analysis applies here in 2B, i.e., mere instructions to apply an exception using a generic computer component (circuit) cannot provide an inventive concept in Step 2B.
	The claim is not patent eligible.

Regarding claim 10,
Claim 10 incorporates the rejection of claim 1.
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… performs weight updates by combining the gradients with a further parameter
… adjusted based on the scaling
	Each of these limitations is a mental process, which can reasonably be performed in the mind with the aid of pen and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
numerical computation circuit
Circuit is generic computer component, amounting to merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f).
Thus, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The same analysis applies here in 2B, i.e., mere instructions to apply an exception using a generic computer component (circuit) cannot provide an inventive concept in Step 2B.
	The claim is not patent eligible.

Regarding claim 11,
Claim 11 incorporates the rejection of claim 10.
The claim further recites: the parameter comprises learning rate 
This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 10 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 12,
Claim 12 incorporates the rejection of claim 10.
The claim further recites: the parameter comprises gradient clipping threshold 
This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 10 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 13,
Claim 13 incorporates the rejection of claim 10.
The claim further recites: the parameter comprises weight decay
This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 10 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 14,
Step 1: The claim recites a process, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
forward processing data
… develop a loss value
scaling the loss value by a scale factor
back propagating the scaled loss value
compute gradients
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
adjust weights …
	This limitation is a mental process, which can reasonably be performed in the mind with the aid of pen and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
deep neural network
“through the deep neural network”, generally recited, is only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)).
Thus, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The same analysis applies here in 2B, i.e., only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)) cannot provide an inventive concept in Step 2B.
	The claim is not patent eligible.

Regarding claim 15,
Claim 15 incorporates the rejection of claim 14.
Step 1: The claim recites a process, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… reducing the computed gradients
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
adjust weights …
	This limitation is a mental process, which can reasonably be performed in the mind with the aid of pen and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. The claim recites no additional elements. The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. The claim is not patent eligible.

Regarding claim 16,
Claim 16 incorporates the rejection of claim 14.
Step 1: The claim recites a process, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… computing the gradients 
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
using reduces precision
	This limitation is a mental process, which can reasonably be performed in the mind with the aid of pen and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. The claim recites no additional elements. The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. The claim is not patent eligible.

Regarding claim 17,
Claim 17 incorporates the rejection of claim 14.
Step 1: The claim recites a process, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… computing the gradients at a lower precision than is used 
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. The claim recites no additional elements. The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. The claim is not patent eligible.

Regarding claim 18,
Claim 18 incorporates the rejection of claim 14.
Step 1: The claim recites a process, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… computing gradients using half precision
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. The claim recites no additional elements. The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. The claim is not patent eligible.

Regarding claim 19,
Step 1: The claim recites a manufacture, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
processing data
… develop a loss value
scaling the loss value by a scale factor
back propagating the scaled loss value
compute gradients
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
adjust weights …
	This limitation is a mental process, which can reasonably be performed in the mind with the aid of pen and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
deep neural network
“through the deep neural network”, generally recited, is only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)).
Thus, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The same analysis applies here in 2B, i.e., only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)) cannot provide an inventive concept in Step 2B.
	The claim is not patent eligible.

Regarding claim 20,
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… operating in a reduced precision mode
… processing data
… develop a loss value
… backpropagating the loss value
… compute gradients
… scale the loss value
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
numerical computation circuit
Merely using a generic computer component as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
deep neural network
“through a deep neural network”, generally recited, is only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)).
Thus, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The same analysis applies here in 2B, i.e., computation circuit only indicates mere instructions to apply an exception using a generic computer component and cannot provide an inventive concept in Step 2B; similarly, “through a deep neural network”, generally recited, is only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)).
	The claim is not patent eligible.

Regarding claim 21,
Claim 21 incorporates the rejection of claim 20.
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… develop a loss value
… compute weight updates 
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
numerical computation circuit
Merely using a generic computer component as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
Thus, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The same analysis applies here in 2B, i.e., mere instructions to apply an exception using a generic computer component cannot provide an inventive concept in Step 2B.
	The claim is not patent eligible.


Regarding claim 22,
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… develop a loss value
… perform computations associated with back propagating the loss value
… compute gradients for weight updates
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
computer
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
deep neural network
 “through a deep neural network”, generally recited, is only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)).
Thus, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The same analysis applies here in 2B, i.e., computer only indicates mere instructions to apply an exception using a generic computer component and cannot provide an inventive concept in Step 2B; similarly, “through a deep neural network”, generally recited, is only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)).
	The claim is not patent eligible.

Regarding claim 23,
Step 1: The claim recites a process, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… scaling computed gradients
… compensates a weight update
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
… specifying a scale factor
inputting a first hyperparameter
inputting a second hyperparameter
These limitations recite a mental process of deciding, which can reasonably be performed in the mind with the aid of pencil and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional element:
deep neural network
“For training a neural network”, generally recited, is only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)).
Thus, the additional element do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.) or provide an inventive concept in Step 2B. 
With regards to “for training a neural network”, specifying a particular technological environment in which to apply the judicial exception does not provide an inventive concept (see MPEP 2106.05(h)) in Step 2B.
	The claim is not patent eligible.

Regarding claim 24,
Claim 24 incorporates the rejection of claim 2[3].
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… multiplying or dividing the computed gradients
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. The claim recites no additional elements. The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. The claim is not patent eligible.

Regarding claim 25,
Claim 25 incorporates the rejection of claim 2[3].
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… modifying a learning rate
This limitation recites a mental process of deciding, which can reasonably be performed in the mind with the aid of pencil and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. The claim recites no additional elements. The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. The claim is not patent eligible.

Regarding claim 26,
Step 1: The claim recites a process, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… scales at least one value 
… back propagation computation of said gradients
… compensates a gradient-based weight update for said scaling.
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
inserting a first code instruction(s) 
inserting a second code instruction(s) 
These limitations recite a mental process of deciding, which can reasonably be performed in the mind with the aid of pencil and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
deep neural network
 “for training a deep neural network”, generally recited, is only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)).
Thus, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.) or provide an inventive concept in Step 2B. 
With regards to “for training a neural network”, specifying a particular technological environment in which to apply the judicial exception does not provide an inventive concept (see MPEP 2106.05(h)) in Step 2B.
	The claim is not patent eligible.

Regarding claim 27,
Claim 27 incorporates the rejection of claim 26.
Step 1: The claim recites a process, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… develop a scale factor
… scale the loss value
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. The claim recites no additional elements. The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. The claim is not patent eligible.

Regarding claim 28,
Step 1: The claim recites a process, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… scales at least one value
… back propagation computation of said gradients
… compensates a gradient-based weight update
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
inputting a first hyperparameter
inputting a second hyperparameter
These limitations recite a mental process of deciding, which can reasonably be performed in the mind with the aid of pencil and paper. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
deep neural network
 “for training a deep neural network”, generally recited, is only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)).
Thus, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.) or provide an inventive concept in Step 2B. 
With regards to “for training a neural network”, specifying a particular technological environment in which to apply the judicial exception does not provide an inventive concept (see MPEP 2106.05(h)) in Step 2B.
	The claim is not patent eligible.

Regarding claim 29,
Step 1: The claim recites apparatus, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
develop a loss value
back propagating the loss value
compute, at reduced precision, a gradient
the trained weight having been adjusted
compensate for a scale factor
computation of the gradient at said reduced precision
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. 
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
deep neural network
“for training a deep neural network”, generally recited, is only indicating a technological environment in which to apply the judicial exception (see MPEP 2106.05(h)).
Thus, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.) or provide an inventive concept in Step 2B. 
With regards to “for training a neural network”, specifying a particular technological environment in which to apply the judicial exception does not provide an inventive concept (see MPEP 2106.05(h)) in Step 2B.
	The claim is not patent eligible.

Regarding claim 30,
Step 1: The claim recites a process, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
	at a first precision to develop loss;
… develop loss
scaling the loss
computing gradients at a second precision lower than the first precision;
reducing the magnitude of the computed gradients
compensate for scaling of the loss
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
training 
“Training” can be described as "not integrating the abstract idea into a practical application" because it does not rely upon or make use of the abstract idea.  
The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The same analysis applies here in 2B, i.e., generic “training” a neural network is well-understood, routine, conventional activity in the field.
“This framework captures many fundamental tasks, such as neural network training.” (pg. 1, section. 1 para. 2) cited in Alistarh et al. (“QSGD: Communication-Optimal Stochastic Gradient Descent with Applications to Training Neural Networks”) indicates that “training” a neural network is a well-understood, routine, conventional activity in the field when it is claimed in a merely generic manner (as it is here).
The claim is not patent eligible.

Regarding claim 31,
Step 1: The claim recites a process, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
forward propagating training data
… develop a loss value
back propagating the loss value
… develop weight gradients
recover zeros and normalize denormals
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. The claim recites no additional elements. The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. The claim is not patent eligible.

Regarding claim 32,
Claim 32 incorporates the rejection of claim 20.
Step 1: The claim recites a process, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
… compensates for a magnitude component of the computed gradients due to said scaling
These limitations are processes that, under broadest reasonable interpretation, covers the recitation of mathematical relationships which falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. The claim recites no additional elements. The claim is directed to an abstract idea.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. The claim is not patent eligible.


Claim Rejections - 35 USC § 102

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 23-25, 30 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Alistarh et al. (“QSGD: Communication-Optimal Stochastic Gradient Descent with Applications to Training Neural Networks”)

Regarding Claim 23
Alistarh teaches a method for training a deep neural network comprising: (Alistarh pg. 1, Abstract, para. 3, “training of deep neural networks.”).
inputting a first hyperparameter specifying a scale factor for scaling computed gradients; (Alistarh pg. 4, para. 4; pg. 4, section 2.1, para. 3).
(
    PNG
    media_image1.png
    33
    197
    media_image1.png
    Greyscale
, ƞt is the first hyperparameter. Each gradient is scaled by ƞt)
inputting a second hyperparameter that compensates a weight update for said scaling (Alistarh pg. 4, section 2.1, para. 1 and para. 3).
(
    PNG
    media_image1.png
    33
    197
    media_image1.png
    Greyscale
, K is the second hyperparameter. Each gradient that is scaled by ƞt is divided by K, i.e., compensated by K).

Regarding Claim 24
Alistarh teaches the method of claim 23
wherein the compensating comprises multiplying or dividing the computed gradients by said second hyperparameter. (Alistarh pg. 4, section 2.1, para. 1 and para. 3).
(
    PNG
    media_image1.png
    33
    197
    media_image1.png
    Greyscale
, K is the second hyperparameter. Each gradient that is scaled by ƞt is divided by K, i.e., compensated by K).

Regarding Claim 25
Alistarh teaches the method of claim 23
wherein the compensating comprises modifying a learning rate in accordance with the second hyperparameter. (Alistarh pg. 4, para. 4; pg. 4, section 2.1, para. 3).
(
    PNG
    media_image1.png
    33
    197
    media_image1.png
    Greyscale
, ƞt is a learning rate. The learning rate is modified (divided) by the second hyperparameter K). 

Regarding Claim 30
Alistarh teaches a method comprising iteratively:  (Alistarh pg. 1, Introduction 1, para. 2, "SGD", (stochastic gradient descent updates the gradients iteratively.)). 
training at a first precision to develop loss; (Alistarh pg. 1, Abstract, para. 2, "compression schemes which allow the compression of gradient updates"). (It implies that the computations to generate gradients is performed at a first precision.); Alistarh pg. 1, Introduction, para. 1, "computations in the context of machine learning". (It indicates that those "computations" are training.); Alistarh pg. 1, Introduction, para. 2, “Let f : Rn ->R be a function which we want to minimize.”)
scaling the loss; (Alistarh pg. 8 section 3.3 para. 1,     

    PNG
    media_image2.png
    27
    182
    media_image2.png
    Greyscale
   ). (“1/m” scales the loss fi).
computing gradients at a second precision lower than the first precision; (Alistarh pg. 7, section 3.2, para. 2, equation (4).)  and
reducing the magnitude of the computed gradients to compensate for scaling of the loss.   (Alistarh pg. 4, para. 5, “1/m”) (Multiplying the sum of the gradients by 1/m reduces their magnitude). 


Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.


Claims 1-5, 9 are rejected under 35 U.S.C. 103 as being unpatentable over Alistarh et al. (“QSGD: Communication-Optimal Stochastic Gradient Descent with Applications to Training Neural Networks”) in view of Ould-Ahmed-Vall et al. (US20180315157A1) 

Regarding Claim 1
Alistarh teaches a system for training a neural network comprising:
at least one numerical computation circuit (Alistarh pg. 9 section 4, para. 1, “We now empirically validate our approach on data-parallel GPU training of deep neural networks. Setup. We performed experiments on Amazon EC2 p2.16xlarge instances, using up to 16 NVIDIA K80 GPUs”). (Alistarh performs their method on a GPU/"numerical computation circuit") 
configurable to compute gradients on a loss value; (Alistarh pg. 1, Introduction, para. 2, “Let f : Rn ->R be a function which we want to minimize.”; pg. 1, section 1, para. 3, “We wish to find a model Ɵ* which minimizes f(Ɵ) = E_X~D[ɻ(X,Ɵ)], the expected loss to the data.”; 

    PNG
    media_image3.png
    34
    173
    media_image3.png
    Greyscale

). (f is the expected loss, minimized by taking its gradient). 
wherein the at least one numerical computation circuit is further configured to scale said loss value (Alistarh pg. 8 section 3.3 para. 1,  

    PNG
    media_image2.png
    27
    182
    media_image2.png
    Greyscale
 ). (Loss value fi is scaled by “1/m”, where m is the size of minibatch of SGD).
when adjusting weights based on gradients computed using said reduced precision mode.   (Alistarh pg. 7, section 3.2, para. 2, equation (4).)  
Alistarh does not explicitly teach a reduced precision selector coupled to said at least one numerical computation circuit, said reduced precision selector controlling said at least one numerical computation circuit to selectively operate in reduced precision computation mode; but Ould-Ahmed-Vall teaches this limitation. (Ould-Ahmed-Vall [0191] the dynamic precision floating point unit supports 32-bit, 16-bit, and 8-bit integer operations.; [0197] In one embodiment the required and resulting precision can be represented within the precision register 1508 (precision mode); [0205] “The reduced precision results can then be compared to the full precision results. If the precision loss is less than the threshold, the logic 1700 can output the result at the second precision, as shown at block 1712. If the precision loss is not less than the threshold at block 1709, the logic 1700 can compute the remaining bits of the result at block 1710 and output the result at the first precision, as shown at block 1714.” ) 
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to modify Alistarh with a reduced precision selector taught in Ould-Ahmed-Vall to control the reduced precision mode. The motivation to do so is that "the dynamic precision floating point unit supports 32-bit, 16-bit, and 8-bit integer operations" (Ould-Ahmed-Vall, [0191]).
	
Regarding Claim 2
Alistarh/ Ould-Ahmed-Vall teaches the system of claim 1 
Alistarh further teaches wherein the at least one numerical computation circuit is configured to compensate for said scaling by reducing the weight gradient contribution (Alistarh pg. 4, para. 5, “1/m” in 

    PNG
    media_image4.png
    28
    89
    media_image4.png
    Greyscale
).
in a way that is inversely proportional to said scaling. (Alistarh pg. 5, algorithm 1, lin. 3 and lin. 7) (Decoding and encoding are inverse.) 

Regarding Claim 3
Alistarh/ Ould-Ahmed-Vall teaches the system of claim 1 
Alistarh further teaches wherein said at least one numerical computation circuit is configured to scale said loss value based at least in part on a hyperparameter. (Alistarh pg. 8 section 3.3 para. 1,     

    PNG
    media_image2.png
    27
    182
    media_image2.png
    Greyscale
). (m is minibatch size, which is a hyperparameter. Loss value fi is scaled by “1/m”, which is the multiplicative inverse of m).

Regarding Claim 4
Alistarh/ Ould-Ahmed-Vall teaches the system of claim 1 
Alistarh further teaches wherein said at least one numerical computation circuit is configured to scale said loss value by an automatically-selected scaling factor. (Alistarh pg. 8 section 3.3 para. 1,     

    PNG
    media_image2.png
    27
    182
    media_image2.png
    Greyscale
).
(Loss value fi is scaled by “1/m”. m is minibatch size, which is a hyperparameter. As long as the value m is determined, the scaling factor “1/m” is automatically-selected by the multiplicative inverse of m).

Regarding Claim 5
Alistarh/ Ould-Ahmed-Vall teaches the system of claim 1 
Alistarh/ Ould-Ahmed-Vall does not explicitly teach 
Alistarh teaches further including at least one processor configured to automatically select [a] scaling factor for each iteration of training of said neural network based on a largest magnitude weight gradient determined in a last iteration. (Alistarh pg. 9, section 4, para. 2, “When quantizing, we scale by the max, which simplifies computation and reduces variance.”). ("scale by the max" means scaling by the maximum value of the gradient vector in current iteration, which is interpreted as a last iteration).
pg. 9, section 4, para. 2, “To control variance, we quantize buckets of d consecutive vector components, using stochastic quantization. … d = n corresponds to full quantization.” (a scaling factor that is automatically selected is the max value in the bucket of “d consecutive components”, which is the whole gradient vector, i.e., of an iteration. SGD gets a new gradient each iteration of training, so a scaling factor can be automatically selected because the system automatically selects the max as the scaling factor).

Regarding Claim 9
Alistarh/ Ould-Ahmed-Vall teaches the system of claim 1 
Alistarh further teaches wherein the at least one numerical computation circuit modifies each weight gradient value (Alistarh pg. 4, para. 5, “1/m”).
by an amount inversely proportional to the scaling factor before using the weight gradient value for a weight update. (Alistarh pg. 5, algorithm 1, lin. 3 and lin. 7) (Decoding and encoding are inverse.) 


Claims 10, 11, 13 are rejected under 35 U.S.C. 103 as being unpatentable over Alistarh et al. (“QSGD: Communication-Optimal Stochastic Gradient Descent with Applications to Training Neural Networks”) in view of Ould-Ahmed-Vall et al. (US20180315157A1) in view of Krizhevsky et al. ("One weird trick for parallelizing convolutional neural networks"). 

Regarding Claim 10
Alistarh/ Ould-Ahmed-Vall teaches the system of claim 1 
Alistarh/ Ould-Ahmed-Vall does not explicitly teach wherein the at least one numerical computation circuit performs weight updates by combining the gradients with a further parameter that is adjusted based on the scaling., but Krizhevsky teaches this limitation. (Krizhevsky, pg. 5, col. 1, para. 3, 

    PNG
    media_image5.png
    295
    543
    media_image5.png
    Greyscale

; pg. 5, col. 1, para. 4, “When experimenting with different batch sizes, one must decide how to adjust the hyperparameters μ, ω, and ϵ”). (The hyperparameters indicate a further parameter”. The scaling is based on “batch sizes”, where “Loss value fi is scaled by “1/m”, where m is the size of minibatch of SGD” is explained in claim 1).
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to perform weight updates by adjusting the hyperparameters with respect to changing batch size, which affects the scaling.
The motivation to do so is “When experimenting with different batch sizes, one must decide how to adjust the hyperparameters.” (Krizhevsky, pg. 5, para. 4).

Regarding Claim 11
Alistarh/ Ould-Ahmed-Vall/ Krizhevsky teaches the system of claim 10 
Krizhevsky teaches wherein the parameter comprises learning rate. (Krizhevsky, pg. 5, col. 1, para. 3, 

    PNG
    media_image5.png
    295
    543
    media_image5.png
    Greyscale

; pg. 5, col. 1, para. 4, “When experimenting with different batch sizes, one must decide how to adjust the hyperparameters μ, ω, and ϵ”). (Learning rate ϵ is one of the hyperparameters).
The rejection of Claim 10 already demonstrates that the parameter comprises learning rate.

Regarding Claim 13
Alistarh/ Ould-Ahmed-Vall/ Krizhevsky teaches the system of claim 10 
Krizhevsky further teaches wherein the parameter comprises weight decay. (Krizhevsky, pg. 5, col. 1, para. 3, 

    PNG
    media_image5.png
    295
    543
    media_image5.png
    Greyscale

; pg. 5, col. 1, para. 4, “When experimenting with different batch sizes, one must decide how to adjust the hyperparameters μ, ω, and ϵ”). (ω is one of the hyperparameters).
The rejection of Claim 10 already demonstrates that the parameter comprises weight decay.


Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Alistarh et al. (“QSGD: Communication-Optimal Stochastic Gradient Descent with Applications to Training Neural Networks”) in view of Ould-Ahmed-Vall et al. (US20180315157A1) in view of Krizhevsky et al. ("One weird trick for parallelizing convolutional neural networks") in view of Achanta et al. ("An investigation of recurrent neural network architectures for statistical parametric speech synthesis"). 

Regarding Claim 12
Alistarh/ Ould-Ahmed-Vall/ Krizhevsky teaches the system of claim 10 
Alistarh/ Ould-Ahmed-Vall/ Krizhevsky does not explicitly teach wherein the parameter comprises gradient clipping threshold., but Achanta teaches a gradient clipping threshold parameter used in weight updates that is adjusted based on the scaling. (Achanta, pg. 2, col. 2, section. 3.1.3, para. 1, “In gradient clipping, average length of gradients is computed over one pass of the training data and a scaled version of this average length is used as threshold (th)”). ("average length of gradients" means summing and dividing by the number of gradients, i.e., the minibatch size, which affects the scaling).
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to include gradient clipping threshold parameter that is adjusted based on the batch size, which affects scaling factor as taught in the Alistarh/ Ould-Ahmed-Vall/ Krizhevsky combination of Claim 10.  
The motivation to do so is that “gradient clipping was introduced in [15] to avoid the gradient explosion problem.” (Achanta, pg. 2, col. 2, section. 3.1.3, para. 1).

  
Claims 6-8 are rejected under 35 U.S.C. 103 as being unpatentable over Alistarh et al. (“QSGD: Communication-Optimal Stochastic Gradient Descent with Applications to Training Neural Networks”) in view of Ould-Ahmed-Vall et al. (US20180315157A1) in view of De Sa et al. (“Taming the Wild: A Unified Analysis of HOGWILD!-Style Algorithms”).

Regarding Claim 6
Alistarh/ Ould-Ahmed-Vall teaches the system of claim 5 
However, Alistarh/ Ould-Ahmed-Vall does not explicitly teach wherein said at least one processor computes an upper bound on the scaling factor u = log2(2^15 - … ) but De Sa teaches this limitation. (De Sa pg. 11 section 4. Para. 1, “BUCKWILD! uses limited-precision arithmetic by rounding the input data to 8-bit or 16-bit integers.”). 
(For 16 bit signed Integer, log2(2^15) is the maximum number of binary bits, which is the upper bound).
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to have a maximum scaling value, such that the scaled product does not exceed 16 bits. The motivation to do so is so as not to overflow the maximum value of the 16 bit variable. “This not only decreases the memory usage, but also allows us to take advantage of single-instruction-multiple-data (SIMD) instructions for integers on modern CPUs.” (De Sa, pg. 11, section 4, para. 1).

Regarding Claim 7
Alistarh/ Ould-Ahmed-Vall/ De Sa teaches the system of claim 6 
Alistarh further teaches wherein said at least one processor further reduces the scaling factor by a constant to prevent overflow. (Alistarh pg. 7, ||v|| in equation 4.) (each element of the vector is scaled by the length of the entire vector so that no value is greater than 1, i.e., to prevent overflow).

Regarding Claim 8
Alistarh/ Ould-Ahmed-Vall/ De Sa teaches the system of claim 6 
Alistarh further teaches wherein the at least one processor is further configured to test weight gradients determined by a training iteration of the neural network, repeating the iteration with a reduced scaling factor conditioned on the results of the test. (Alistarh pg. 3, section 2, last para. "We assume repeated access to stochastic gradients, which on (possibly random) input x, outputs a direction which is in expectation the correct direction to move in."; 
Pg. 4, section 2.1, para. 3, “it is a simple calculation to see that at each processor, if xt was the value of x that the processors held before iteration t, then the updated value of x by the end of this iteration is xt+1 = … is a stochastic gradient.”;
pg. 9, para. 2, “When quantizing, we scale by the max, which simplifies computation and reduces variance.”). (Taking the max of the gradients is a test and scaling by the max is repeating the iteration with a reduced scaling factor).


Claims 14-17, 19, 22, 26-29, 31 are rejected under 35 U.S.C. 103 as being unpatentable over Alistarh et al. (“QSGD: Communication-Optimal Stochastic Gradient Descent with Applications to Training Neural Networks”) in view of Yu et al.  (US20170330068A1).

Regarding Claim 14
Alistarh teaches a process of training a deep neural network comprising:
    (a) forward processing data through the deep neural network to develop a loss value; (Alistarh pg. 1, Abstract, para. 3, “experiments show that gradient quantization applied to training of deep neural networks”; pg. 1, section 1, para. 3, “We wish to find a model Ɵ* which minimizes f(Ɵ) = E_X~D[ɻ(X,Ɵ)], the expected loss to the data.”). (Computing the loss involves forward processing the input through the network).
    (b) scaling the loss value by a scale factor; (Alistarh pg. 8 section 3.3 para. 1, “1/m” in     

    PNG
    media_image2.png
    27
    182
    media_image2.png
    Greyscale
 ).
(c) … the scaled loss value through the deep neural network to compute gradients; (Alistarh pg. 1, Abstract, para. 2, "compression schemes which allow the compression of gradient updates"). (It implies that the computations to generate gradients is performed at a first precision.); Alistarh pg. 1, Introduction, para. 2, “Let f : Rn ->R be a function which we want to minimize.”; 

    PNG
    media_image3.png
    34
    173
    media_image3.png
    Greyscale

). (f is the expected loss, minimized by taking its gradient).
(d) adjusting weights of the deep neural network based on the computed gradients (Alistarh pg. 4, para. 3, 

    PNG
    media_image6.png
    36
    269
    media_image6.png
    Greyscale
).
Alistarh does not explicitly teach (c) back propagating …; but Yu teaches this limitation. (Yu [0005] “calculate a gradient of a loss function based on the first network outputs and on the loss function; backpropagate the gradient of the loss function through the first neural network” ([0030] When training the neural network 200 using backpropagation (e.g., with gradient descent), the gradient 234 of a loss function 230 can be calculated based on the output layer (the second layer 211B of the SSNN 220, in this example) and a training target 233, and then the gradient 234 can be backpropagated through the neural network 200.) 
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to backpropagate the gradient of the loss function through the first neural network. The motivation to do so is that “update the first neural network based on the backpropagation of the gradient.” (Yu [0005]).

Regarding Claim 15
Alistarh/ Yu teaches the process of claim 14 
Alistarh further teaches wherein the adjusting includes reducing the computed gradients by the scale factor before using them to adjust the weights. (Alistarh pg. 4, para. 5, “1/m” in 

    PNG
    media_image7.png
    30
    98
    media_image7.png
    Greyscale
).

Regarding Claim 16
Alistarh/ Yu teaches the process of claim 14 
Alistarh further teaches further including computing the gradients using reduced precision. (Alistarh pg. 7, section 3.2, para. 2, equation (4).)   

Regarding Claim 17
Alistarh/ Yu teaches the process of claim 14 
Alistarh further teaches further including computing the gradients at a lower precision than is used for at least some computations associated with the forward processing of training data. (Alistarh pg. 7, section 3.2, para. 2, equation (4).)   

Regarding Claim 19
Alistarh teaches a non-transitory memory storing instructions that when executed by at least one processor control the at least one processor to perform steps (Alistarh pg. 9 section 4, para. 1, “We now empirically validate our approach on data-parallel GPU training of deep neural networks. Setup. We performed experiments on Amazon EC2 p2.16xlarge instances, using up to 16 NVIDIA K80 GPUs”). (Memory storing instructions is inherent when executing on a computer/GPU). 
comprising: (a) processing data with a deep neural network to develop a loss value; (Alistarh pg. 1, Abstract, para. 3, “experiments show that gradient quantization applied to training of deep neural networks”; pg. 1, section 1, para. 3, “We wish to find a model Ɵ* which minimizes f(Ɵ) = E_X~D[ɻ(X,Ɵ)], the expected loss to the data.”). (Computing the loss involves processing the input through the network).
(b) scaling the loss value by a scale factor; (Alistarh pg. 8 section 3.3 para. 1, “1/m” in    

    PNG
    media_image2.png
    27
    182
    media_image2.png
    Greyscale
).
(c) … the scaled loss value through the deep neural network to compute gradients; (Alistarh pg. 1, Abstract, para. 2, "compression schemes which allow the compression of gradient updates"). (It implies that the computations to generate gradients is performed at a first precision.); Alistarh pg. 1, Introduction, para. 2, “Let f : Rn ->R be a function which we want to minimize.”; 

    PNG
    media_image3.png
    34
    173
    media_image3.png
    Greyscale

). (f is the expected loss, minimized by taking its gradient).
(d) adjusting weights of the deep neural network based on the computed gradients (Alistarh pg. 4, para. 3, 

    PNG
    media_image6.png
    36
    269
    media_image6.png
    Greyscale
).
Alistarh does not explicitly teach (c) back propagating …; but Yu teaches this limitation. (Yu [0005] “calculate a gradient of a loss function based on the first network outputs and on the loss function; backpropagate the gradient of the loss function through the first neural network” ([0030] When training the neural network 200 using backpropagation (e.g., with gradient descent), the gradient 234 of a loss function 230 can be calculated based on the output layer (the second layer 211B of the SSNN 220, in this example) and a training target 233, and then the gradient 234 can be backpropagated through the neural network 200.) 
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to backpropagate the gradient of the loss function through the first neural network. The motivation to do so is that “update the first neural network based on the backpropagation of the gradient.” (Yu [0005]).

Regarding Claim 22
Alistarh teaches a computing arrangement comprising: 
a computer configured to perform computations associated with (Alistarh pg. 9 section 4, para. 1, “We now empirically validate our approach on data-parallel GPU training of deep neural networks. Setup. We performed experiments on Amazon EC2 p2.16xlarge instances, using up to 16 NVIDIA K80 GPUs”). (Alistarh performs their method on a GPU/computer). 
processing data through a deep neural network to develop a loss value (Alistarh pg. 1, Abstract, para. 3, “experiments show that gradient quantization applied to training of deep neural networks”; pg. 1, section 1, para. 3, “We wish to find a model Ɵ* which minimizes f(Ɵ) = E_X~D[ɻ(X,Ɵ)], the expected loss to the data.”). (Computing the loss involves processing the input through the network). and 
further configured to perform computations associated with … the loss value through the deep neural network to compute gradients for weight updates, (Alistarh pg. 1, Abstract, para. 2, "compression schemes which allow the compression of gradient updates"). (It implies that the computations to generate gradients is performed at a first precision.); Alistarh pg. 1, Introduction, para. 2, “Let f : Rn ->R be a function which we want to minimize.”; 

    PNG
    media_image3.png
    34
    173
    media_image3.png
    Greyscale

). (f is the expected loss, minimized by taking its gradient).
the computer being further configured to scale the loss value (Alistarh pg. 8 section 3.3 para. 1, “1/m” in    

    PNG
    media_image2.png
    27
    182
    media_image2.png
    Greyscale
). and 
to modify the computed gradients and/or at least one process that operates on said computed gradients to take […] scale factor into account. (Alistarh pg. 4, para. 5, “1/m” in

    PNG
    media_image7.png
    30
    98
    media_image7.png
    Greyscale
).
Alistarh does not explicitly teach back propagating but Yu teaches this limitation. (Yu Fig. 2, [0030] When training the neural network 200 using backpropagation (e.g., with gradient descent), the gradient 234 of a loss function 230 can be calculated based on the output layer (the second layer 211B of the SSNN 220, in this example) and a training target 233) 
further configured to perform computations associated with back propagating the loss value through the deep neural network to compute gradients for weight updates, (Yu [0005] “calculate a gradient of a loss function based on the first network outputs and on the loss function; backpropagate the gradient of the loss function through the first neural network”
[0030] When training the neural network 200 using backpropagation (e.g., with gradient descent), the gradient 234 of a loss function 230 can be calculated based on the output layer (the second layer 211B of the SSNN 220, in this example) and a training target 233, and then the gradient 234 can be backpropagated through the neural network 200.) 
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to modify Alistarh to update the neural network by backpropagating the gradient of the loss function through the neural network as taught in Yu. The motivation to do so is that “update the first neural network based on the backpropagation of the gradient.” (Yu [0005]).

Regarding Claim 26
Alistarh teaches a method of modifying a deep neural network training system to permit lower precision computation of gradients (Alistarh, pg. 1, Abstract, para. 2, "compression schemes which allow the compression of gradient updates"; Abstract, para. 3, "gradient quantization applied to the training of deep neural networks"; pg. 3, para. 2, “QSGD converges even at 2-bit precision.”).
while avoiding numerical computation problems associated with zeros and denormals due to use of said lower precision gradient computation, (Alistarh pg. 2 para. 7, “we quantize each component by randomized rounding to a discrete set of values, in a principled way which preserves the statistical properties of the original.”).
comprising: inserting a first code instruction(s) (Alistarh performs their method on a GPU, thus teaching "inserting code instructions" to perform the method) that scales at least one value used in a … computation of said gradients; (Alistarh pg. 8 section 3.3 para. 1, “1/m” in     

    PNG
    media_image8.png
    27
    182
    media_image8.png
    Greyscale
).
inserting a second code instruction(s) (Alistarh performs their method on a GPU, thus teaching "inserting code instructions" to perform the method) that compensates a gradient-based weight update for said scaling. (Alistarh pg. 4, para. 5, “1/m” in

    PNG
    media_image7.png
    30
    98
    media_image7.png
    Greyscale
). (Multiplying the sum of the gradients by 1/m reduces their magnitude). 
Alistarh does not explicitly teach back propagating computation but Yu teaches this limitation. (Yu [0005] “calculate a gradient of a loss function based on the first network outputs and on the loss function; backpropagate the gradient of the loss function through the first neural network” ([0030] When training the neural network 200 using backpropagation (e.g., with gradient descent), the gradient 234 of a loss function 230 can be calculated based on the output layer (the second layer 211B of the SSNN 220, in this example) and a training target 233, and then the gradient 234 can be backpropagated through the neural network 200.) 
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to backpropagate the gradient of the loss function through the first neural network. The motivation to do so is that “update the first neural network based on the backpropagation of the gradient.” (Yu [0005]).

Regarding Claim 27
Alistarh/ Yu teaches the method of claim 26 
Alistarh further teaches wherein the at least one value comprises a loss value, and 
the first code instruction(s) automatically develop a scale factor used to scale the loss value. (Alistarh pg. 8 section 3.3 para. 1,     

    PNG
    media_image2.png
    27
    182
    media_image2.png
    Greyscale
 ). 
(Loss value fi is scaled by “1/m”. m is minibatch size, which is a hyperparameter. A scale factor “1/m” is automatically determined by the multiplicative inverse of m).
 
Regarding Claim 28
Alistarh teaches a method of controlling a deep neural network training system (Alistarh pg. 1, Abstract, para. 3, “training of deep neural networks.”). 
to permit lower precision computation of gradients (Alistarh pg. 3, para. 2, “QSGD converges even at 2-bit precision.”).
while avoiding numerical computation problems associated with zeros and denormals due to use of said lower precision gradient computation, (Alistarh pg. 2 para. 7, “we quantize each component by randomized rounding to a discrete set of values, in a principled way which preserves the statistical properties of the original.”).
comprising: inputting a first hyperparameter that scales at least one value used in a … computation of said gradients; 
(Alistarh pg. 8 section 3.3 para. 1, “1/m” in     

    PNG
    media_image8.png
    27
    182
    media_image8.png
    Greyscale
) (where, m is batch size, i.e., an inputted hyperparameter). and
inputting a second hyperparameter that compensates a gradient-based weight update for said scaling. (Alistarh pg. 4, para. 5, “1/m” in

    PNG
    media_image7.png
    30
    98
    media_image7.png
    Greyscale
). (Multiplying the sum of the gradients by 1/m reduces their magnitude, where 1/m in the gradient computation is a second hyperparameter). 
Alistarh does not explicitly teach back propagating computation but Yu teaches this limitation. (Yu [0005] “calculate a gradient of a loss function based on the first network outputs and on the loss function; backpropagate the gradient of the loss function through the first neural network” ([0030] When training the neural network 200 using backpropagation (e.g., with gradient descent), the gradient 234 of a loss function 230 can be calculated based on the output layer (the second layer 211B of the SSNN 220, in this example) and a training target 233, and then the gradient 234 can be backpropagated through the neural network 200.) 
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to backpropagate the gradient of the loss function through the first neural network. The motivation to do so is that “update the first neural network based on the backpropagation of the gradient.” (Yu [0005]).

Regarding Claim 29
Alistarh teaches a deep neural network comprising layers each comprising at least one artificial neuron, each layer having a weight associated therewith, the weight having been trained by (Alistarh pg. 1, Abstract, para. 3, “training of deep neural networks.”).
performing computations associated with processing of training data through the deep neural network to develop a loss value (Alistarh pg. 1, Abstract, para. 3, “experiments show that gradient quantization applied to training of deep neural networks”; pg. 1, section 1, para. 3, “We wish to find a model Ɵ* which minimizes f(Ɵ) = E_X~D[ɻ(X,Ɵ)], the expected loss to the data.”). (Computing the loss involves forward processing the input through the network). and 
… the loss value through the deep neural network to compute, (Alistarh pg. 1, Abstract, para. 2, "compression schemes which allow the compression of gradient updates"). (It implies that the computations to generate gradients is performed at a first precision.); Alistarh pg. 1, Introduction, para. 2, “Let f : Rn ->R be a function which we want to minimize.”; 

    PNG
    media_image3.png
    34
    173
    media_image3.png
    Greyscale

). (f is the expected loss, minimized by taking its gradient).
at reduced precision, (Alistarh pg. 3, para. 2, “QSGD converges even at 2-bit precision.”).
a gradient used to update the weight, the contribution of the computed gradient to the trained weight having been adjusted to compensate for a scale factor used to enable computation of the gradient at said reduced precision (Alistarh pg. 4, para. 5, “1/m” in 

    PNG
    media_image4.png
    28
    89
    media_image4.png
    Greyscale
).
while normalizing denormals and recovering zeros that would otherwise have occurred due to the reduced precision. (Alistarh pg. 7, see pasted Figure, 
    PNG
    media_image9.png
    88
    409
    media_image9.png
    Greyscale

). (Any point between 0 and 0.25, i.e., a denormal smaller than the smallest quantization precision of the system gets probabilistically mapped to 0, i.e., "recover zeros" or mapped to 0.25, i.e., "normalizing the denormal").
Alistarh does not explicitly teach back propagating but Yu teaches this limitation. (Yu [0005] “calculate a gradient of a loss function based on the first network outputs and on the loss function; backpropagate the gradient of the loss function through the first neural network” ([0030] When training the neural network 200 using backpropagation (e.g., with gradient descent), the gradient 234 of a loss function 230 can be calculated based on the output layer (the second layer 211B of the SSNN 220, in this example) and a training target 233, and then the gradient 234 can be backpropagated through the neural network 200.) 
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to backpropagate the gradient of the loss function through the first neural network. The motivation to do so is that “update the first neural network based on the backpropagation of the gradient.” (Yu [0005]).

Regarding Claim 31
Alistarh teaches teaches a method comprising iteratively:  (Alistarh pg. 1, Introduction 1, para. 2, "SGD", (stochastic gradient descent updates the gradients iteratively)). forward propagating training data through a deep neural network to develop a loss value; (Alistarh pg. 1, Abstract, para. 3, “experiments show that gradient quantization applied to training of deep neural networks”; pg. 1, section 1, para. 3, “We wish to find a model Ɵ* which minimizes f(Ɵ) = E_X~D[ɻ(X,Ɵ)], the expected loss to the data.”). (Computing the loss involves forward processing the input through the network).
… the loss value through the deep neural network to develop weight gradients; (Alistarh pg. 1, Abstract, para. 2, "compression schemes which allow the compression of gradient updates"). (It implies that the computations to generate gradients is performed at a first precision.); Alistarh pg. 1, Introduction, para. 2, “Let f : Rn ->R be a function which we want to minimize.”; 

    PNG
    media_image3.png
    34
    173
    media_image3.png
    Greyscale

). (f is the expected loss, minimized by taking its gradient). and
configuring … to recover zeros and normalize denormals (Alistarh pg. 7, see pasted Figure, 
    PNG
    media_image9.png
    88
    409
    media_image9.png
    Greyscale

). (Any point between 0 and 0.25, i.e., a denormal smaller than the smallest quantization precision of the system gets probabilistically mapped to 0, i.e., "recover zeros" or mapped to 0.25, i.e., "normalizing the denormal"). without adversely affecting a subsequent weight update based on the weight gradients. (Alistarh pg. 2 para. 7, “we quantize each component by randomized rounding to a discrete set of values, in a principled way which preserves the statistical properties of the original.”).
Alistarh does not teach back propagating but Yu teaches this limitation. (Yu [0005] “calculate a gradient of a loss function based on the first network outputs and on the loss function; backpropagate the gradient of the loss function through the first neural network”
[0030] When training the neural network 200 using backpropagation (e.g., with gradient descent), the gradient 234 of a loss function 230 can be calculated based on the output layer (the second layer 211B of the SSNN 220, in this example) and a training target 233, and then the gradient 234 can be backpropagated through the neural network 200.). 
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to modify Alistarh to update the neural network by backpropagating the gradient of the loss function through the neural network as taught in Yu. The motivation to do so is that “update the first neural network based on the backpropagation of the gradient.” (Yu [0005]).


Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Alistarh et al. (“QSGD: Communication-Optimal Stochastic Gradient Descent with Applications to Training Neural Networks”) in view of Yu et al.  (US20170330068A1) in view of De Sa et al. (“Taming the Wild: A Unified Analysis of HOGWILD!-Style Algorithms”)
 
Regarding Claim 18
Alistarh/ Yu teaches the process of claim 14 
Alistarh further teaches further including computing gradients using [lower] precision while recovering eroes and normalizing denormals due to said [lower] precision. (Alistarh pg. 7, see pasted Figure, 
    PNG
    media_image9.png
    88
    409
    media_image9.png
    Greyscale

). (Any point between 0 and 0.25, i.e., a denormal smaller than the smallest quantization precision of the system gets probabilistically mapped to 0, i.e., "recover zeros" or mapped to 0.25, i.e., "normalizing the denormal").
However, Alistarh does not explicitly teach using half precision but De Sa teaches this limitation. (De Sa, pg. 11, Section 4, para. 1, “Compared with Hogwild!, which uses 32-bit floating point numbers to represent input data, Buckwild! uses limited-precision arithmetic by rounding the input data to 8-bit or 16-bit integers.”)
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to use specifically half precision because half precision is a well-known low precision method.  
 

Claims 20, 21, 32 are rejected under 35 U.S.C. 103 as being unpatentable over Hwang et al. (“Fixed-Point Feedforward Deep Neural Network Design Using Weights +1, 0, and -1”) in view of Alistarh et al. (“QSGD: Communication-Optimal Stochastic Gradient Descent with Applications to Training Neural Networks”). 

Regarding Claim 20
Hwang teaches a mixed precision computing component comprising: 
at least one numerical computation circuit capable of operating in a reduced precision mode; (Hwang, pg. 3, col. 1, para. 1, “we maintain both the high-precision and low-precision weights”).
the at least one numerical computation circuit being configured to perform computations associated with processing data through a deep neural network to develop a loss value; (Hwang, Fig. 1; pg. 2, col.1, section II, para. 3, “Once the number of quantization points for each weight matrix is given, the goal is to minimize the output error of the network.”). and when operating in the reduced precision mode, 
the at least one numerical computation circuit being configured to perform computations associated with back propagating the loss value through the deep neural network to compute gradients useful for weight updates, (Hwang, Fig. 3; pg. 1, Abstract, “Feedforward deep neural networks that employ multiple hidden layers show high performance in many applications, but they demand complex hardware for implementation. The hardware complexity can be much lowered by minimizing the word-length of weights and signals, but direct quantization for fixed-point network design does not yield good results. We optimize the fixed-point design by employing backpropagation based retraining.”; pg. 3, col. 1, para. 3, “We further modify the backpropagation algorithm to quantize the signals or the outputs of the units.”). 
However, Hwang does not explicitly teach wherein the at least one numerical computation circuit is further configured to scale the loss value but Alistarh teaches this limitation. (Alistarh pg. 8 section 3.3 para. 1, “1/m” in    

    PNG
    media_image2.png
    27
    182
    media_image2.png
    Greyscale
).
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to scale the loss function. The motivation to do so is that “smooth convex functions.” (Alistarh pg. 8, section 3.3 para. 1).

Regarding Claim 21
Hwang/ Alistarh teaches the component of claim 20 
Hwang further teaches wherein the numerical computation circuit is configured to develop the loss value using said reduced precision mode, (Hwang, pg. 3, col. 1, para. 2, “The low-precision weights are obtained by quantizing the high-precision weights and used in the forward and backward steps of the backpropagation algorithm.”) and 
to compute weight updates using other than said reduced precision mode.   (Hwang, pg. 3, col. 1, para. 3, "Therefore, the derivative should be calculated using high-precision signal values").

Regarding Claim 32
Hwang/ Alistarh teaches the component of claim 20 
Alistarh further teaches wherein the scaling compensates for a magnitude component of the computed gradients due to said scaling. (Alistarh pg. 4, para. 5, 

    PNG
    media_image7.png
    30
    98
    media_image7.png
    Greyscale
).
It would have been obvious to one of ordinary skill of the art before the effective filing date of the claimed invention to scale the gradients. The motivation to do so is to compensate for variable batch sizes, as the loss function changes for variable batch sizes.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Deuk K Lee whose telephone number is 571-272-8440.  The examiner can normally be reached on Monday-Friday 8:30am-5:30pm CDT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on 571-272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/D. L./Examiner, Art Unit 2122                                                                                                                                                                                                        


/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122