DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
In reference to desig. ID 23, the correct title is "Acceleration of Stochastic Approximation by Averaging".
In reference to desig. ID 7, the document was not considered as it was not furnished.
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(4) because reference character “100” has been used to designate both “Training System” and “Quantized Inference System” in Fig. 1.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they do not include the following reference sign(s) mentioned in the description: 120.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
 Specification
The specification is objected to as failing to provide proper antecedent basis for the claimed subject matter.  See 37 CFR 1.75(d)(1) and MPEP § 608.01(o).  Correction of the following is required: 
The terms “long term variance” and “batch variance” recited in claims 4 and 15  do not appear in the specification. 
The terms “epsilon” and “upsilon” recited in claims 5 and 16 do not appear in the specification.
The specification does not have adequate support for “determining that sufficient training has occurred prior to receiving the first batch of training data” and “determining that sufficient training has not occurred prior to receiving the first batch of training data” which is recited in claims 6, 7, 8, 11, 17, 18, 19, and 22.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.


Claims 11 and 22 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. The claims cite the limitation of “determining that sufficient training has occurred prior to receiving the first batch of training data” however the specification does not have adequate support for the determination of training is sufficient prior to receiving a first batch of training data.

The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-23 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 1,10, 11,12,15,21,22,23 recite the limitation of “…the long-term moving averages of the batch normalization statistics” or “…freezing the long-term moving averages” or “updating long term moving averages” or “long term variance”. Specifically, the term “Long term” is considered unclear and ambiguous thereby rendering the claims indefinite. The specification at lines 7-10 on page 6 cites “long term moving statistics, e.g., overall batch normalization statistics from all of the training data or statistics from processing a large number of new network inputs post-training, to perform the normalization during inference.” However, the specification only gives examples and does not explicitly define “long term.” The examiner cannot determine a clear and unique definition for this term as it is considered subjective.
Claims 2,3,5,6,7,8,9,13,14,16,17,18,19,20 are rejected as they are dependent of rejected claims.
Claims 6,7,11,17,19,20 recite the limitation of “determining that sufficient training has…”  Specifically, the term “Sufficiently” is considered unclear and ambiguous thereby rendering the claims indefinite. The specification at lines 4-9 on page 11 cites “The system can determine that sufficient training has occurred once the batch normalization statistics start to stabilize. For example, the system can determine that sufficient training has occurred once a threshold number of training iterations, i.e., a threshold number of iterations of the process 200, have been performed or once the batch normalization statistics for a threshold number of consecutive batches remain within a certain threshold of the long-term moving averages.” However, the specification does not explicitly explain what the threshold is. The examiner cannot determine a clear and unique definition for this term as it is considered subjective.
Claims 8,9,19,20 recite equations, however, the variables in the equations are not explicitly declared. There is insufficient antecedent basis for these limitations as the variables are unclear. All the variables used in the equations need to be declared.
Claims 5 and 16 recite the limitation of “by a ratio of upsilon to batch standard deviation to generate a product and multiplying weights by the product, wherein epsilon is a constant value.” Specifically, the terms “upsilon” and “epsilon” are considered unclear and ambiguous thereby rendering the claim indefinite. It is unclear what these terms are supposed to represent as the claim has inadequate support from the specification. it is unclear if they are representing the same value as the limitation recites “upsilon” in the beginning of the limitation, but ends the limitation reciting “epsilon”. The examiner cannot determine a clear and unique definition for the following terms as they are ambiguous.
Claims 11 and 22 recite the limitation “freezing the long-term moving averages.” Specifically, the term “freezing” is considered unclear and ambiguous as to what the term specifically means as there is inadequate support from the specification. The specification cites at lines 6-13 on page 11, “For example, the system can determine that sufficient training has occurred once a threshold 
number of training iterations, i.e., a threshold number of iterations of the process 200, have 
been performed or once the batch normalization statistics for a threshold number of 
consecutive batches remain within a certain threshold of the long-term moving averages. In some cases, the first time that the system determines that sufficient training has occurred, the 
system freezes the long-term moving averages, i.e., keeps the moving averages constant for 
the remainder of the training instead of continuing to update the averages after each new 
batch of training data has been processed.” However, the specification does not explicitly explain what the threshold is. The examiner cannot determine a clear and unique definition for this term as it is considered subjective
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 2, 3, 4, 7, 9, 10, 12, 13, 14, 15, 18, 20, 21, and 23 are rejected under 35 USC § 103 under Ioffe et al. (hereinafter Ioffe) US20190325315A1, in view of El-Yaniv et al. (hereinafter El-Yaniv) US 20170286830 A1.
In reference to Claim 1. Ioffe teaches:
“maintaining long-term moving averages of batch normalization statistics for the batch normalized first neural network layer and floating point weights for the batch normalized first neural network layer” (Ioffe teaches in Paragraph 79, “the batch renormalization layer maintains moving normalization statistics for each component. The moving normalization statistics for a component are moving averages of the normalization statistics for the component (specifically, the mean and standard deviation normalization statistics). The moving averages are computed with respect to the normalization statistics determined for the component for batches of training examples processed during previous training iterations. While Ioffe teaches broadly the neural network has parameters (Ioffe paragraph 6) and El-Yaniv teaches that the weights are floating-points as seen in El-Yaniv’s paragraph 6. The batch renormalization layer updates the moving normalization statistics for each component at each training iteration, as described further with reference to 210” The citation demonstrates maintaining moving averages of batch normalization statistics throughout each training iteration. The examiner notes that the broadest reasonable interpretation of the limitation is maintaining moving averages of batch normalization statistics and floating point weights);
“receiving a first batch of training data” (Ioffe teaches in paragraph 55, “the neural network 120 can be trained on multiple batches of training examples in order to determine trained values of the parameters of the neural network layers” The citation demonstrates a neural network receiving data. The examiner notes that the broadest reasonable interpretation of the limitation is receiving data);
determining batch normalization statistics for the first batch of training data (Ioffe teaches in paragraph 4, “compute respective current batch normalization statistics for each of the plurality of components from the first layer outputs for the training examples in the current batch” The citation demonstrates batch normalization statistics being determined. The examiner notes that the broadest reasonable interpretation of the limitation is determining batch normalization statistics);
determining a correction factor from the batch normalization statistics for the first batch of training data and the long-term moving averages of the batch normalization statistics (Ioffe teaches in paragraph 32, “A neural network system as described in this specification includes batch renormalization layers that, during training, generate batch renormalization layer outputs which include a correction factor to adjust for differences between the normalization statistics for the batch of training examples currently being processed and the normalization statistics for the set of training data as a whole.” The citation referenced demonstrates a correction factor being created. The examiner notes that the broadest reasonable interpretation of the limitation is determining a correction factor);
determining a gradient of an objective function with respect to the quantized batch normalized weights for the first neural network layer (Ioffe teaches in paragraph 60, “Backpropagating through the normalization statistics for the batch refers to performing gradient descent by computing the gradient of a loss function with respect to parameters of the neural network 120 including the normalization statistics.” The citation referenced demonstrates determining the gradient of a loss function which is an objective function that you want to minimize. The examiner notes that the broadest reasonable interpretation of the limitation is determining the gradient of an objective function.);

Ioffe does not explicitly disclose:
 “generating batch normalized weights from the floating point weights for the batch normalized first neural network layer, comprising applying the correction factor to the floating point weights of the batch normalized first neural network layer”
“quantizing the batch normalized weights”
“updating the floating point weights of the first neural network layer using the gradient with respect to the quantized batch normalized weights”

However, El-Yaniv discloses:
 “generating batch normalized weights from the floating point weights for the batch normalized first neural network layer, comprising applying the correction factor to the floating point weights of the batch normalized first neural network layer” (El-Yaniv teaches at [0067]: “A normalization function, referred to herein as BatchNorm( ), batch-normalizes floating point activation values of neurons, by a batch normalization (BN). The BN accelerates the training and reduces the overall impact of a weight scale. In particular, at train-time training a BN requires many multiplications as the standard deviation is calculated and the BN is divided by a running variance (e.g. a weighted mean of a training set activation variance. The number of scaling calculations is optionally the same as the number of neurons of the QNN or the BNN.”. In addition, as previously explained, Ioffe teaches generating batch renormalization layer outputs which include a correction factor, as it can be seen at Ioffe’s paragraph [0032]; therefore, the combination of Ioffe’s batch renormalization including a correction factor with El-Yaniv’s normalization function referred at BatchNorm teaches the claim limitation)
“quantizing the batch normalized weights” (El- Yaniv teaches in paragraphs 83 and 84, “First, during the training phase, each floating point connection weight value is optionally constrained between −1 and 1, for instance by projecting w.sup.r to −1 or 1 when a connection weight value update brings w.sup.r outside of [−1; 1]. This is done for example by clipping connection weight values during training as indicated above. The floating point connection weight values would otherwise grow without any impact on the binary weights and increase computation without need.[0084] Second, when a weight w.sup.r is used, w.sup.r is quantized using w.sup.b=Sign(w.sup.r).”The citation referenced demonstrates a weight being normalized and the normalized weight being quantized. The examiner notes that the broadest reasonable interpretation for the limitation is quantizing a normalized weight.)
“updating the floating point weights of the first neural network layer using the gradient with respect to the quantized batch normalized weights”(El-Yaniv teaches in paragraph 19 “ using a training set dataset to train the neural network model according to respective the quantized connection weight values, the training includes computing a plurality of weight gradients for backpropagation sub-processes by: computing a plurality of neuron gradients, each the neuron gradient is of an output of a respective the quantized activation function in one layer of the plurality of layers with respect to an input of the respective quantized activation function and is calculated such that when an absolute value of the input is smaller than a positive constant threshold value, the respective neuron gradient is set as a positive constant value and when the absolute value of the input is smaller than the positive constant threshold value the neuron gradient is set to zero, and updating a plurality of floating point connection weight values according to the plurality of weight gradients ” The Citation referenced demonstrates updating floating point weights with respect to quantized weights. The examiner notes that the broadest reasonable interpretation of the limitation is updating floating points with respect to quantized weights.)
  It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).

In reference to Claim 12. Ioffe teaches:
“maintaining long-term moving averages of batch normalization statistics for the batch normalized first neural network layer and floating point weights for the batch normalized first neural network layer” (Ioffe teaches in Paragraph 79, “the batch renormalization layer maintains moving normalization statistics for each component. The moving normalization statistics for a component are moving averages of the normalization statistics for the component (specifically, the mean and standard deviation normalization statistics). The moving averages are computed with respect to the normalization statistics determined for the component for batches of training examples processed during previous training iterations. While Ioffe teaches broadly the neural network has parameters (Ioffe paragraph 6) and El-Yaniv teaches that the weights are floating-points as seen in El-Yaniv’s paragraph 6. The batch renormalization layer updates the moving normalization statistics for each component at each training iteration, as described further with reference to 210” The citation demonstrates maintaining moving averages of batch normalization statistics throughout each training iteration. The examiner notes that the broadest reasonable interpretation of the limitation is maintaining moving averages of batch normalization statistics and floating point weights);
“receiving a first batch of training data” (Ioffe teaches in paragraph 55, “the neural network 120 can be trained on multiple batches of training examples in order to determine trained values of the parameters of the neural network layers” The citation demonstrates a neural network receiving data. The examiner notes that the broadest reasonable interpretation of the limitation is receiving data);
determining batch normalization statistics for the first batch of training data (Ioffe teaches in paragraph 4, “compute respective current batch normalization statistics for each of the plurality of components from the first layer outputs for the training examples in the current batch” The citation demonstrates batch normalization statistics being determined. The examiner notes that the broadest reasonable interpretation of the limitation is determining batch normalization statistics);
determining a correction factor from the batch normalization statistics for the first batch of training data and the long-term moving averages of the batch normalization statistics (Ioffe teaches in paragraph 32, “A neural network system as described in this specification includes batch renormalization layers that, during training, generate batch renormalization layer outputs which include a correction factor to adjust for differences between the normalization statistics for the batch of training examples currently being processed and the normalization statistics for the set of training data as a whole.” The citation referenced demonstrates a correction factor being created. The examiner notes that the broadest reasonable interpretation of the limitation is determining a correction factor);
determining a gradient of an objective function with respect to the quantized batch normalized weights for the first neural network layer (“Backpropagating through the normalization statistics for the batch refers to performing gradient descent by computing the gradient of a loss function with respect to parameters of the neural network 120 including the normalization statistics.” The citation referenced demonstrates determining the gradient of a loss function which is an objective function that you want to minimize. The examiner notes that the broadest reasonable interpretation of the limitation is determining the gradient of an objective function.);

Ioffe does not explicitly disclose:
“generating batch normalized weights from the floating point weights for the batch normalized first neural network layer, comprising applying the correction factor to the floating point weights of the batch normalized first neural network layer”
“quantizing the batch normalized weights”
“updating the floating point weights of the first neural network layer using the gradient with respect to the quantized batch normalized weights”

However, El-Yaniv discloses:
“generating batch normalized weights from the floating point weights for the batch normalized first neural network layer, comprising applying the correction factor to the floating point weights of the batch normalized first neural network layer” (El-Yaniv teaches at [0067]: “A normalization function, referred to herein as BatchNorm( ), batch-normalizes floating point activation values of neurons, by a batch normalization (BN). The BN accelerates the training and reduces the overall impact of a weight scale. In particular, at train-time training a BN requires many multiplications as the standard deviation is calculated and the BN is divided by a running variance (e.g. a weighted mean of a training set activation variance. The number of scaling calculations is optionally the same as the number of neurons of the QNN or the BNN.”. In addition, as previously explained, Ioffe teaches generating batch renormalization layer outputs which include a correction factor, as it can be seen at Ioffe’s paragraph [0032]; therefore, the combination of Ioffe’s batch renormalization including a correction factor with El-Yaniv’s normalization function referred at BatchNorm teaches the claim limitation)
“quantizing the batch normalized weights” (El- Yaniv teaches in paragraphs 83 and 84, “First, during the training phase, each floating point connection weight value is optionally constrained between −1 and 1, for instance by projecting w.sup.r to −1 or 1 when a connection weight value update brings w.sup.r outside of [−1; 1]. This is done for example by clipping connection weight values during training as indicated above. The floating point connection weight values would otherwise grow without any impact on the binary weights and increase computation without need.[0084] Second, when a weight w.sup.r is used, w.sup.r is quantized using w.sup.b=Sign(w.sup.r).”The citation referenced demonstrates a weight being normalized and the normalized weight being quantized. The examiner notes that the broadest reasonable interpretation for the limitation is quantizing a normalized weight.)
“updating the floating point weights of the first neural network layer using the gradient with respect to the quantized batch normalized weights”(El-Yaniv teaches in paragraph 19 “ using a training set dataset to train the neural network model according to respective the quantized connection weight values, the training includes computing a plurality of weight gradients for backpropagation sub-processes by: computing a plurality of neuron gradients, each the neuron gradient is of an output of a respective the quantized activation function in one layer of the plurality of layers with respect to an input of the respective quantized activation function and is calculated such that when an absolute value of the input is smaller than a positive constant threshold value, the respective neuron gradient is set as a positive constant value and when the absolute value of the input is smaller than the positive constant threshold value the neuron gradient is set to zero, and updating a plurality of floating point connection weight values according to the plurality of weight gradients ” The Citation referenced demonstrates updating floating point weights with respect to quantized weights. The examiner notes that the broadest reasonable interpretation of the limitation is updating floating points with respect to quantized weights.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).

In reference to Claim 23, Ioffe teaches:
“maintaining long-term moving averages of batch normalization statistics for the batch normalized first neural network layer and floating point weights for the batch normalized first neural network layer” (Ioffe teaches in Paragraph 79, “the batch renormalization layer maintains moving normalization statistics for each component. The moving normalization statistics for a component are moving averages of the normalization statistics for the component (specifically, the mean and standard deviation normalization statistics). The moving averages are computed with respect to the normalization statistics determined for the component for batches of training examples processed during previous training iterations. While Ioffe teaches broadly the neural network has parameters(Ioffe paragraph 6) and El-Yaniv teaches that the weights are floating-points as seen in El-Yaniv’s paragraph 6. The batch renormalization layer updates the moving normalization statistics for each component at each training iteration, as described further with reference to 210” The citation demonstrates maintaining moving averages of batch normalization statistics throughout each training iteration. The examiner notes that the broadest reasonable interpretation of the limitation is maintaining moving averages of batch normalization statistics and floating point weights );
“receiving a first batch of training data” (Ioffe teaches in paragraph 55, “the neural network 120 can be trained on multiple batches of training examples in order to determine trained values of the parameters of the neural network layers” The citation demonstrates a neural network receiving data. The examiner notes that the broadest reasonable interpretation of the limitation is receiving data);
determining batch normalization statistics for the first batch of training data (Ioffe teaches in paragraph 4, “compute respective current batch normalization statistics for each of the plurality of components from the first layer outputs for the training examples in the current batch” The citation demonstrates batch normalization statistics being determined. The examiner notes that the broadest reasonable interpretation of the limitation is determining batch normalization statistics);
determining a correction factor from the batch normalization statistics for the first batch of training data and the long-term moving averages of the batch normalization statistics (Ioffe teaches in paragraph 32, “A neural network system as described in this specification includes batch renormalization layers that, during training, generate batch renormalization layer outputs which include a correction factor to adjust for differences between the normalization statistics for the batch of training examples currently being processed and the normalization statistics for the set of training data as a whole.” The citation referenced demonstrates a correction factor being created. The examiner notes that the broadest reasonable interpretation of the limitation is determining a correction factor);
determining a gradient of an objective function with respect to the quantized batch normalized weights for the first neural network layer (“Backpropagating through the normalization statistics for the batch refers to performing gradient descent by computing the gradient of a loss function with respect to parameters of the neural network 120 including the normalization statistics.” The citation referenced demonstrates determining the gradient of a loss function which is an objective function that you want to minimize. The examiner notes that the broadest reasonable interpretation of the limitation is determining the gradient of an objective function.);

Ioffe does not explicitly disclose:
•	“generating batch normalized weights from the floating point weights for the batch normalized first neural network layer, comprising applying the correction factor to the floating point weights of the batch normalized first neural network layer”
•	“quantizing the batch normalized weights”
•	“updating the floating point weights of the first neural network layer using the gradient with respect to the quantized batch normalized weights”

However, El-Yaniv discloses:
“generating batch normalized weights from the floating point weights for the batch normalized first neural network layer, comprising applying the correction factor to the floating point weights of the batch normalized first neural network layer” (El-Yaniv teaches at [0067]: “A normalization function, referred to herein as BatchNorm( ), batch-normalizes floating point activation values of neurons, by a batch normalization (BN). The BN accelerates the training and reduces the overall impact of a weight scale. In particular, at train-time training a BN requires many multiplications as the standard deviation is calculated and the BN is divided by a running variance (e.g. a weighted mean of a training set activation variance. The number of scaling calculations is optionally the same as the number of neurons of the QNN or the BNN.”. In addition, as previously explained, Ioffe teaches generating batch renormalization layer outputs which include a correction factor, as it can be seen at Ioffe’s paragraph [0032]; therefore, the combination of Ioffe’s batch renormalization including a correction factor with El-Yaniv’s normalization function referred at BatchNorm teaches the claim limitation)
“quantizing the batch normalized weights” (El- Yaniv teaches in paragraphs 83 and 84, “First, during the training phase, each floating point connection weight value is optionally constrained between −1 and 1, for instance by projecting w.sup.r to −1 or 1 when a connection weight value update brings w.sup.r outside of [−1; 1]. This is done for example by clipping connection weight values during training as indicated above. The floating point connection weight values would otherwise grow without any impact on the binary weights and increase computation without need.[0084] Second, when a weight w.sup.r is used, w.sup.r is quantized using w.sup.b=Sign(w.sup.r).”The citation referenced demonstrates a weight being normalized and the normalized weight being quantized. The examiner notes that the broadest reasonable interpretation for the limitation is quantizing a normalized weight.)
“updating the floating point weights of the first neural network layer using the gradient with respect to the quantized batch normalized weights”(El-Yaniv teaches in paragraph 19 “ using a training set dataset to train the neural network model according to respective the quantized connection weight values, the training includes computing a plurality of weight gradients for backpropagation sub-processes by: computing a plurality of neuron gradients, each the neuron gradient is of an output of a respective the quantized activation function in one layer of the plurality of layers with respect to an input of the respective quantized activation function and is calculated such that when an absolute value of the input is smaller than a positive constant threshold value, the respective neuron gradient is set as a positive constant value and when the absolute value of the input is smaller than the positive constant threshold value the neuron gradient is set to zero, and updating a plurality of floating point connection weight values according to the plurality of weight gradients ” The Citation referenced demonstrates updating floating point weights with respect to quantized weights. The examiner notes that the broadest reasonable interpretation of the limitation is updating floating points with respect to quantized weights.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).

In reference to Claim 2, Ioffe teaches:
“wherein determining the batch normalization statistics comprises: for each training example in the batch: receiving a layer input for the batch normalized layer” (Ioffe teaches in paragraph 4, “ wherein the batch renormalization layer is configured to, during training of the neural network on a current batch of training examples: obtain respective current moving normalization statistics for each of the plurality of components that are based on previous first layer outputs generated by the first neural network layer during training of the neural network on previous batches of training examples; receive a respective first layer output for each training example in the current batch” The citation referenced demonstrates the renormalization layer receiving the output of a previous layer. The examiner notes that the broadest reasonable interpretation of the limitation is receiving an input.)
 “determining the batch normalization statistics for the first batch from the initial outputs for the layer inputs in the batch.” (Ioffe teaches in paragraph 4, “ receive a respective first layer output for each training example in the current batch; compute respective current batch normalization statistics for each of the plurality of components from the first layer outputs for the training examples in the current batch” The citation referenced above demonstrates the determining of batch normalization statistics from an output.  The examiner notes that the broadest reasonable interpretation of the limitation is determining normalization statistics from a layer’s output.)
Ioffe does not explicitly teach:
“applying the floating-point weights to the layer input for the batch normalized layer to generate an initial output for the layer input” 
However, El-Yaniv teaches:
“applying the floating-point weights to the layer input for the batch normalized layer to generate an initial output for the layer input” (El-Yaniv teaches in paragraph 58, “The quantized values are based on the outputs of the quantized weight functions and/or floating point values of the connection weights and the outputs of the quantized activation functions.” The citation referenced demonstrates using floating point weights to determine the output for an input. When viewed in light of El-Yaniv’s batch normalization of the neurons in paragraph 67, it is clear that the layers are batch normalized. The examiner notes that the broadest reasonable interpretation of the limitation is utilizing floating point weights to determine an output for a batch normalized layer.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).
In reference to Claim 13, Ioffe teaches:
“wherein determining the batch normalization statistics comprises: for each training example in the batch: receiving a layer input for the batch normalized layer” (Ioffe teaches in paragraph 4, “ wherein the batch renormalization layer is configured to, during training of the neural network on a current batch of training examples: obtain respective current moving normalization statistics for each of the plurality of components that are based on previous first layer outputs generated by the first neural network layer during training of the neural network on previous batches of training examples; receive a respective first layer output for each training example in the current batch” The citation referenced demonstrates the renormalization layer receiving the output of a previous layer. The examiner notes that the broadest reasonable interpretation of the limitation is receiving an input.)
 “determining the batch normalization statistics for the first batch from the initial outputs for the layer inputs in the batch.” (Ioffe teaches in paragraph 4, “receive a respective first layer output for each training example in the current batch; compute respective current batch normalization statistics for each of the plurality of components from the first layer outputs for the training examples in the current batch” The citation referenced above demonstrates the determining of batch normalization statistics from an output.  The examiner notes that the broadest reasonable interpretation of the limitation is determining normalization statistics from a layer’s output.)
Ioffe does not explicitly teach:
“applying the floating-point weights to the layer input for the batch normalized layer to generate an initial output for the layer input” 
However, El-Yaniv teaches:
“applying the floating-point weights to the layer input for the batch normalized layer to generate an initial output for the layer input” (El-Yaniv teaches in paragraph 58, “The quantized values are based on the outputs of the quantized weight functions and/or floating point values of the connection weights and the outputs of the quantized activation functions.” The citation referenced demonstrates using floating point weights to determine the output for an input. When viewed in light of El-Yaniv’s batch normalization of the neurons in paragraph 67, it is clear that the layers are batch normalized. The examiner notes that the broadest reasonable interpretation of the limitation is utilizing floating point weights to determine an output for a batch normalized layer.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).
In reference to Claim 3, Ioffe teaches:
“wherein the batch normalization statistics include a variance for the batch and a mean for the batch”(Ioffe teaches in paragraph 9, “wherein computing a plurality of current batch normalization statistics for the first layer outputs comprises, for each of the components: computing a mean of the component for the first layer outputs in the current batch; and computing an approximated standard deviation for the component of the first layer outputs in the current batch.” The citation referenced demonstrates batch normalization statistics including a mean and a standard deviation. the examiner notes that the broadest reasonable interpretation of the limitation is batch normalization statistics include a variance and a mean. The examiner also notes that the term “variance” is being interpreted as standard deviation as variance does not have sufficient support in the specification.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).
In reference to Claim 14. Ioffe teaches:
“wherein the batch normalization statistics include a variance for the batch and a mean for the batch” ( Ioffe teaches in paragraph 9, “wherein computing a plurality of current batch normalization statistics for the first layer outputs comprises, for each of the components: computing a mean of the component for the first layer outputs in the current batch; and computing an approximated standard deviation for the component of the first layer outputs in the current batch.” The citation referenced demonstrates batch normalization statistics including a mean and a standard deviation. the examiner notes that the broadest reasonable interpretation of the limitation is batch normalization statistics include a variance and a mean. The examiner also notes that the term “variance” is being interpreted as standard deviation as variance does not have sufficient support in the specification.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).
In reference to claim 4, Ioffe teaches:
“wherein the correction factor is based on a ratio of batch variance to long term variance”(Ioffe teaches in paragraph 11,“determining a first parameter for the component from a ratio between (i) a difference between the mean for the component and the moving mean for the component and (ii) the moving approximated standard deviation for the component; and determining a second parameter for the component from a ratio between the approximated standard deviation for the component and the moving approximated standard deviation for the component.” The citation referenced demonstrates determining a factor based on approximate standard deviation and moving standard deviation. The examiner notes that the broadest reasonable interpretation of the limitation is creating a factor based on a ratio as the spec does mention batch or long-term variance.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).

In reference to claim 15, Ioffe teaches:
“wherein the correction factor is based on a ratio of batch variance to long term variance”(Ioffe teaches in paragraph 11,“determining a first parameter for the component from a ratio between (i) a difference between the mean for the component and the moving mean for the component and (ii) the moving approximated standard deviation for the component; and determining a second parameter for the component from a ratio between the approximated standard deviation for the component and the moving approximated standard deviation for the component.” The citation referenced demonstrates determining a factor based on approximate standard deviation and moving standard deviation. The examiner notes that the broadest reasonable interpretation of the limitation is creating a factor based on a ratio as the spec does mention batch or long-term variance.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).

In reference to claim 7, Ioffe teaches:
“generating a layer output for the batch normalized neural network layer” (Ioffe teaches in paragraph 4, “for each of the first layer outputs for each of the training examples in the current batch: normalize each component of the first layer output using the current batch normalization statistics for the component to generate a normalized layer output for the training example” The citation referenced demonstrates generating an output for batch normalized layer. The examiner notes that the broadest reasonable interpretation is generating an output for a batch normalized layer.)
“refraining from applying a bias correction to the initial output based on determining that sufficient training has not occurred prior to receiving the first batch of training data.” (Ioffe teaches in paragraph 19, “determining the scale parameter value to be one and the bias parameter value to be zero if a number of completed training iterations is less than a predetermined threshold number of training iterations” The citation referenced demonstrates determining that training has not been done sufficiently and sets the bias value to zero which when added does not add any bias. The examiner notes that the broadest reasonable interpretation is determining if an amount of training is less than sufficient and refraining from adding a bias.)
“wherein determining a gradient of an objective function with respect to the quantized batch normalized weights for the first neural network layer comprises: determining that sufficient training has not occurred prior to receiving the first batch of training data” ( Ioffe teaches in paragraph 19, “determining the scale parameter value to be one and the bias parameter value to be zero if a number of completed training iterations is less than a predetermined threshold number of training iterations.” The citation referenced demonstrates determining whether or not training has reached a threshold. The examiner notes that the broadest reasonable interpretation of the limitation is determining whether or not training has reached a certain threshold.)
Ioffe does not explicitly teach:
applying the quantized weights to a layer input to generate an initial output
However, El-Yaniv teaches:
“applying the quantized weights to a layer input to generate an initial output”(El-Yaniv teaches in paragraph 58,”The quantized values are based on the outputs of the quantized weight functions and/or floating point values of the connection weights and the outputs of the quantized activation functions.” The citation referenced demonstrates applying a quantized weight to generate an output. The examiner notes that the broadest reasonable interpretation for the limitation is generating an output based on quantized weights.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).

In reference to claim 18, Ioffe teaches:
“generating a layer output for the batch normalized neural network layer” (Ioffe teaches in paragraph 4, “for each of the first layer outputs for each of the training examples in the current batch: normalize each component of the first layer output using the current batch normalization statistics for the component to generate a normalized layer output for the training example” The citation referenced demonstrates generating an output for batch normalized layer. The examiner notes that the broadest reasonable interpretation is generating an output for a batch normalized layer.)
“refraining from applying a bias correction to the initial output based on determining that sufficient training has not occurred prior to receiving the first batch of training data.” (Ioffe teaches in paragraph 19, “determining the scale parameter value to be one and the bias parameter value to be zero if a number of completed training iterations is less than a predetermined threshold number of training iterations” The citation referenced demonstrates determining that training has not been done sufficiently and sets the bias value to zero which when added does not add any bias. The examiner notes that the broadest reasonable interpretation is determining if an amount of training is less than sufficient and refraining from adding a bias.)
“wherein determining a gradient of an objective function with respect to the quantized batch normalized weights for the first neural network layer comprises: determining that sufficient training has not occurred prior to receiving the first batch of training data” ( Ioffe teaches in paragraph 19, “determining the scale parameter value to be one and the bias parameter value to be zero if a number of completed training iterations is less than a predetermined threshold number of training iterations.” The citation referenced demonstrates determining whether or not training has reached a threshold. The examiner notes that the broadest reasonable interpretation of the limitation is determining whether or not training has reached a certain threshold.)
Ioffe does not explicitly teach:
“applying the quantized weights to a layer input to generate an initial output”
However, El-Yaniv teaches:
“applying the quantized weights to a layer input to generate an initial output”(El-Yaniv teaches in paragraph 58,”The quantized values are based on the outputs of the quantized weight functions and/or floating point values of the connection weights and the outputs of the quantized activation functions.” The citation referenced demonstrates applying a quantized weight to generate an output. The examiner notes that the broadest reasonable interpretation for the limitation is generating an output based on quantized weights.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).
In reference to claim 10, Ioffe teaches:
“further comprising updating the long-term moving averages based on the batch normalization statistics for the first batch” (Ioffe teaches in paragraph 5, “update the current moving normalization statistics for each component using the current batch normalization statistics for the component to generate updated moving normalization statistics for the component.” The citation referenced demonstrates updating current moving normalization statistics which would include average (Ioffe paragraph 79). The examiner notes that the broadest reasonable interpretation of the limitation is updating moving averages)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).

In reference to claim 21, Ioffe teaches:
“further comprising updating the long-term moving averages based on the batch normalization statistics for the first batch” (Ioffe teaches in paragraph 5, “update the current moving normalization statistics for each component using the current batch normalization statistics for the component to generate updated moving normalization statistics for the component.” The citation referenced demonstrates updating current moving normalization statistics which would include average (Ioffe paragraph 79). The examiner notes that the broadest reasonable interpretation of the limitation is updating moving averages)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).
In reference to claim 9, Ioffe teaches:
“generating a layer output for the batch normalized neural network layer. ” (Ioffe teaches in paragraph 4, “for each of the first layer outputs for each of the training examples in the current batch: normalize each component of the first layer output using the current batch normalization statistics for the component to generate a normalized layer output for the training example” The citation referenced demonstrates generating an output for batch normalized layer. The examiner notes that the broadest reasonable interpretation of the limitation is generating an output for a batch normalized layer.)

    PNG
    media_image1.png
    59
    161
    media_image1.png
    Greyscale
adding the bias to the initial output, wherein the bias is: 

    PNG
    media_image2.png
    34
    83
    media_image2.png
    Greyscale

    PNG
    media_image3.png
    67
    181
    media_image3.png
    Greyscale
(Ioffe teaches in paragraphs 80 and 83 the following formulas:
  
Where r is a scale factor and d is a bias parameter for the output and y hat is a renormalized output based on input and the scale factor and bias parameter. The formulas perform equivalent functions of applying bias to an equation.)
Ioffe does not explicitly teach:
“applying the quantized weights to a layer input to generate an initial output”
However, El-Yaniv teaches:
“applying the quantized weights to a layer input to generate an initial output” (El-Yaniv teaches in paragraph 58,”The quantized values are based on the outputs of the quantized weight functions and/or floating point values of the connection weights and the outputs of the quantized activation functions.” The citation referenced demonstrates applying a quantized weight to generate an output. The examiner notes that the broadest reasonable interpretation for the limitation is generating an output based on quantized weights.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).
In reference to claim 20, Ioffe teaches:
“generating a layer output for the batch normalized neural network layer. ” (Ioffe teaches in paragraph 4, “for each of the first layer outputs for each of the training examples in the current batch: normalize each component of the first layer output using the current batch normalization statistics for the component to generate a normalized layer output for the training example” The citation referenced demonstrates generating an output for batch normalized layer. The examiner notes that the broadest reasonable interpretation of the limitation is generating an output for a batch normalized layer.)

    PNG
    media_image1.png
    59
    161
    media_image1.png
    Greyscale
adding the bias to the initial output, wherein the bias is: 

    PNG
    media_image2.png
    34
    83
    media_image2.png
    Greyscale

    PNG
    media_image3.png
    67
    181
    media_image3.png
    Greyscale
(Ioffe teaches In paragraphs 80 and 83 the following formulas:
  
Where r is a scale factor and d is a bias parameter for the output and y hat is a renormalized output based on input and the scale factor and bias parameter. The formulas perform equivalent functions of applying bias to an equation.)
Ioffe does not explicitly teach:
“applying the quantized weights to a layer input to generate an initial output”
However, El-Yaniv teaches:
“applying the quantized weights to a layer input to generate an initial output” (El-Yaniv teaches in paragraph 58,”The quantized values are based on the outputs of the quantized weight functions and/or floating point values of the connection weights and the outputs of the quantized activation functions.” The citation referenced demonstrates applying a quantized weight to generate an output. The examiner notes that the broadest reasonable interpretation for the limitation is generating an output based on quantized weights.)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe and El-Yaniv. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. One of ordinary skill would be motivated to combine Ioffe and El-Yaniv to reduce power consumption and complexity (El-Yaniv Paragraph 52)).

Claims 5, 8, 11, 16, 19, and 22 are rejected under 35 USC § 103 under Ioffe et al. (hereinafter Ioffe) US20190325315A1, in view of El-Yaniv et al. (hereinafter El-Yaniv) US 20170286830 A1, in view of Harrer et al. (hereinafter Harrer) US20180107451a1.
In reference to claim 5, Ioffe and El-Yaniv do not teach:
“applying the correction factor comprises multiplying the correction factor by a ratio of upsilon to batch standard deviation to generate a product and multiplying weights by the product, wherein epsilon is a constant value.”
However, Harrer teaches:
“applying the correction factor comprises multiplying the correction factor by a ratio of upsilon to batch standard deviation to generate a product and multiplying weights by the product, wherein epsilon is a constant value.” (Harrer teaches in paragraph 40, an equation in which two ratios are multiplied together to form a product which is then used as a scaling factor. The examiner notes that the broadest reasonable interpretation of the limitation is having the scaling factor be multiplied by a ratio.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe, El-Yaniv and Harrer. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. Harrer teaches automatic scaling of floating points for DNNs to fixed points. One of ordinary skill would be motivated to combine Ioffe, El-Yaniv, and Harrer to reduce power consumption and provide stable accuracy performance (Harrer paragraph 24)).
In reference to claim 16, Ioffe and El-Yaniv do not teach:
“applying the correction factor comprises multiplying the correction factor by a ratio of upsilon to batch standard deviation to generate a product and multiplying weights by the product, wherein epsilon is a constant value.”
However, Harrer teaches:
“applying the correction factor comprises multiplying the correction factor by a ratio of upsilon to batch standard deviation to generate a product and multiplying weights by the product, wherein epsilon is a constant value.” (Harrer teaches in paragraph 40, an equation in which two ratios are multiplied together to form a product which is then used as a scaling factor. The examiner notes that the broadest reasonable interpretation of the limitation is having the scaling factor be multiplied by a ratio.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe, El-Yaniv and Harrer. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. Harrer teaches automatic scaling of floating points for DNNs to fixed points. One of ordinary skill would be motivated to combine Ioffe, El-Yaniv, and Harrer to reduce power consumption and provide stable accuracy performance (Harrer paragraph 24)).
In reference to claim 8, Ioffe teaches:

    PNG
    media_image4.png
    62
    294
    media_image4.png
    Greyscale
 applying a bias correction to the initial output based on determining that sufficient training has occurred prior to receiving the first batch of training data, wherein the bias correction is: 
(Ioffe teaches In paragraphs 80 and 83 the following formulas:

    PNG
    media_image2.png
    34
    83
    media_image2.png
    Greyscale

    PNG
    media_image3.png
    67
    181
    media_image3.png
    Greyscale
  
Where r is a scale factor and d is a bias parameter for the output and y hat is a renormalized output based on input and the scale factor and bias parameter. The formulas perform equivalent functions of scaling.)
“determining a gradient of an objective function with respect to the quantized batch normalized weights for the first neural network layer” (Ioffe teaches in paragraph 60, “Backpropagating through the normalization statistics for the batch refers to performing gradient descent by computing the gradient of a loss function with respect to parameters of the neural network 120 including the normalization statistics.” The citation referenced demonstrates determining the gradient of a loss function which is an objective function that you want to minimize. The examiner notes that the broadest reasonable interpretation of the limitation is determining the gradient of an objective function.);
Ioffe does not explicitly teach:
“applying the quantized weights to a layer input to generate an initial output”
“determining that sufficient training has occurred prior to receiving the first batch of training data”
However, El-Yaniv teaches:
“applying the quantized weights to a layer input to generate an initial output" (El-Yaniv teaches in paragraph 58, ”The quantized values are based on the outputs of the quantized weight functions and/or floating point values of the connection weights and the outputs of the quantized activation functions.” The citation referenced demonstrates applying a quantized weight to generate an output. The examiner notes that the broadest reasonable interpretation for the limitation is generating an output based on quantized weights.)
El-Yaniv does not explicitly teach:
“determining that sufficient training has occurred prior to receiving the first batch of training data”
 However, Harrer teaches:
“determining that sufficient training has occurred prior to receiving the first batch of training data” (Harrer teaches in paragraph 31, “In block 230, it is determined whether the training is finished. If not (block 230=No), the process 120 proceeds to block 215. If the training is finished (block 230—Yes)” The citation referenced demonstrates determining whether or not training has been completed. The examiner notes that the broadest reasonable interpretation of the limitation is determining if training has been completed as the term sufficient is not defined.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe, El-Yaniv and Harrer. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. Harrer teaches automatic scaling of floating points for DNNs to fixed points. One of ordinary skill would be motivated to combine Ioffe, El-Yaniv, and Harrer to reduce power consumption and provide stable accuracy performance (Harrer paragraph 24)).
In reference to claim 19, Ioffe teaches:

    PNG
    media_image4.png
    62
    294
    media_image4.png
    Greyscale
 applying a bias correction to the initial output based on determining that sufficient training has occurred prior to receiving the first batch of training data, wherein the bias correction is: 
(Ioffe teaches In paragraphs 80 and 83 the following formulas:

    PNG
    media_image2.png
    34
    83
    media_image2.png
    Greyscale

    PNG
    media_image3.png
    67
    181
    media_image3.png
    Greyscale
  
Where r is a scale factor and d is a bias parameter for the output and y hat is a renormalized output based on input and the scale factor and bias parameter. The formulas perform equivalent functions of scaling.)
“determining a gradient of an objective function with respect to the quantized batch normalized weights for the first neural network layer” (Ioffe teaches in paragraph 60, “Backpropagating through the normalization statistics for the batch refers to performing gradient descent by computing the gradient of a loss function with respect to parameters of the neural network 120 including the normalization statistics.” The citation referenced demonstrates determining the gradient of a loss function which is an objective function that you want to minimize. The examiner notes that the broadest reasonable interpretation of the limitation is determining the gradient of an objective function.);
Ioffe does not explicitly teach:
“applying the quantized weights to a layer input to generate an initial output”
“determining that sufficient training has occurred prior to receiving the first batch of training data”
However, El-Yaniv teaches:
“applying the quantized weights to a layer input to generate an initial output" (El-Yaniv teaches in paragraph 58, ”The quantized values are based on the outputs of the quantized weight functions and/or floating point values of the connection weights and the outputs of the quantized activation functions.” The citation referenced demonstrates applying a quantized weight to generate an output. The examiner notes that the broadest reasonable interpretation for the limitation is generating an output based on quantized weights.)
El-Yaniv does not explicitly teach:
“determining that sufficient training has occurred prior to receiving the first batch of training data”
 However, Harrer teaches:
“determining that sufficient training has occurred prior to receiving the first batch of training data” (Harrer teaches in paragraph 31, “In block 230, it is determined whether the training is finished. If not (block 230=No), the process 120 proceeds to block 215. If the training is finished (block 230—Yes)” The citation referenced demonstrates determining whether or not training has been completed. The examiner notes that the broadest reasonable interpretation of the limitation is determining if  training has been completed as the term sufficient is not defined.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe, El-Yaniv and Harrer. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. Harrer teaches automatic scaling of floating points for DNNs to fixed points. One of ordinary skill would be motivated to combine Ioffe, El-Yaniv, and Harrer to reduce power consumption and provide stable accuracy performance (Harrer paragraph 24)).
In reference to claim 11, Ioffe teaches,
“freezing the long-term moving averages”(Ioffe teaches this in paragraph 17,” In some implementations, the pre-computed normalization statistics for the components are final moving normalization statistics after training of the neural network” the referenced citation demonstrates a moving average not being updated. The examiner notes that the broadest reasonable interpretation of the limitation is  stop updating the moving averages.)
Ioffe does not explicitly teach:
determining that sufficient training has occurred prior to receiving the first batch of training data
However, Harrer does teach:
“determining that sufficient training has occurred prior to receiving the first batch of training data” (Harrer teaches in paragraph 31, “In block 230, it is determined whether the training is finished. If not (block 230=No), the process 120 proceeds to block 215. If the training is finished (block 230—Yes)” The citation referenced demonstrates determining whether or not training has been completed. The examiner notes that the broadest reasonable interpretation of the limitation is determining if training has been completed as the term sufficient is not defined.)
In reference to claim 22, Ioffe teaches,
“freezing the long-term moving averages”(Ioffe teaches this in paragraph 17,” In some implementations, the pre-computed normalization statistics for the components are final moving normalization statistics after training of the neural network” the referenced citation demonstrates a moving average not being updated. The examiner notes that the broadest reasonable interpretation of the limitation is  stop updating the moving averages.)
Ioffe does not explicitly teach:
determining that sufficient training has occurred prior to receiving the first batch of training data
However, Harrer does teach:
“determining that sufficient training has occurred prior to receiving the first batch of training data” (Harrer teaches in paragraph 31, “In block 230, it is determined whether the training is finished. If not (block 230=No), the process 120 proceeds to block 215. If the training is finished (block 230—Yes)” The citation referenced demonstrates determining whether or not training has been completed. The examiner notes that the broadest reasonable interpretation of the limitation is determining if training has been completed as the term sufficient is not defined.)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe, El-Yaniv and Harrer. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. Harrer teaches automatic scaling of floating points for DNNs to fixed points. One of ordinary skill would be motivated to combine Ioffe, El-Yaniv, and Harrer to reduce power consumption and provide stable accuracy performance (Harrer paragraph 24)).

Claims 6 and 17 are rejected under 35 USC § 103 under Ioffe et al. (hereinafter Ioffe) US20190325315A1, in view of El-Yaniv et al. (hereinafter El-Yaniv) US 20170286830 A1, in view of Wu et al. (hereinafter Wu) US 20180322391 A1.
In reference to claim 6, Ioffe teaches:
“generating batch normalized weights from the weights for the batch normalized first neural network layer further comprises: determining that sufficient training has not occurred prior to receiving the first batch of training data” (Ioffe teaches in paragraph 19, “determining the scale parameter value to be one and the bias parameter value to be zero if a number of completed training iterations is less than a predetermined threshold number of training iterations.” The citation referenced demonstrates determining whether or not training has reached a threshold. The examiner notes that the broadest reasonable interpretation of the limitation is determining whether or not training has reached a certain threshold.)
 Ioffe and El-Yaniv do not teach:
 “and in response: undoing the application of the correction factor.”
However, Wu teaches:
“and in response: undoing the application of the correction factor.”(Wu teaches in paragraph 144, “he implementation may further include functionally that multiplies each computed weight gradient by the inverse of the scaling factor to undo the scaling that was performed previously based upon the default (or user-modified) scaling factor constant (block 662). The scaling factor may be stated as a separate hyperparameter, or it might involve modification of an existing hyperparameter. For example, if the algorithm does not use gradient clipping, then it may be possible to modify any number hyperparameters including for example learning rate, and loss weighting.” The citation referenced demonstrates undoing a scaling factor by applying its inverse. The examiner notes that the broadest reasonable interpretation is undo a correction factor.)
  It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe, El-Yaniv and Wu. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. Wu teaches a method of training a neural network with reduced precision using loss scaling. One of ordinary skill would be motivated to combine Ioffe, El-Yaniv, and Wu to increase training efficiency and reduce instability. (Wu paragraph 51-52 and 75-78)).
In reference to claim 17, Ioffe teaches:
“generating batch normalized weights from the weights for the batch normalized first neural network layer further comprises: determining that sufficient training has not occurred prior to receiving the first batch of training data” ( Ioffe teaches in paragraph 19, “determining the scale parameter value to be one and the bias parameter value to be zero if a number of completed training iterations is less than a predetermined threshold number of training iterations.” The citation referenced demonstrates determining whether or not training has reached a threshold. The examiner notes that the broadest reasonable interpretation of the limitation is determining whether or not training has reached a certain threshold.)
 Ioffe and El-Yaniv do not teach:
 “and in response: undoing the application of the correction factor.”
However, Wu teaches:
“and in response: undoing the application of the correction factor.”(Wu teaches in paragraph 144, “he implementation may further include functionally that multiplies each computed weight gradient by the inverse of the scaling factor to undo the scaling that was performed previously based upon the default (or user-modified) scaling factor constant (block 662). The scaling factor may be stated as a separate hyperparameter, or it might involve modification of an existing hyperparameter. For example, if the algorithm does not use gradient clipping, then it may be possible to modify any number hyperparameters including for example learning rate, and loss weighting.” The citation referenced demonstrates undoing a scaling factor by applying its inverse. The examiner notes that the broadest reasonable interpretation is undo a correction factor.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ioffe, El-Yaniv and Wu. Ioffe teaches a method and system which includes a renormalization layer that is configured to determine batch normalization statistics and determine affine transform parameters. El-Yaniv teaches a method of training neural networks by having neurons associated with a quantized activation function and quantized weight values. Wu teaches a method of training a neural network with reduced precision using loss scaling. One of ordinary skill would be motivated to combine Ioffe, El-Yaniv, and Wu to increase training efficiency and reduce instability. (Wu paragraph 51-52 and 75-78)).



Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICHOLAS GERALD RICCARDI whose telephone number is (571)272-3931. The examiner can normally be reached Week 1 - M-F 7:30-5 week 2 M-W 7:30-5 Th 7:30-4 F OFF.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Amir Mehrmanesh can be reached on (571)270-3351. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NICHOLAS GERALD RICCARDI/Examiner, Art Unit 4163                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126