Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on August 2, 2022, in which claims 1, 11, and 15 are currently amended. Claim 16 is canceled. Claims 1-15 and 17-20 are currently pending. 

Response to Arguments
Applicant’s arguments with respect to rejection of claims 1-20 under 35 U.S.C. 101 based on amendment have been considered and are persuasive. The rejections to claims 1-15 and 17-20 under 35 U.S.C. § 101 are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.
Applicant’s arguments with respect to rejection of claims 1-20 under 35 U.S.C. 102/103 based on amendment have been considered, however, have not been deemed persuasive.  
With respect to Applicant's arguments that neither Dexu Lin nor Hou discloses "instructions that cause the system to compute at least one metric representing a ratio of quantization noise present in the values represented in the quantized-precision format to the values represented in the quantized precision format", Examiner respectfully disagrees.  Dexu Lin clearly states as cited in the Non-Final Office Action mailed 2/2/2022 ([¶0027] "In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR)").  One of ordinary skill in the art would recognize that SQNR or signal quantization to noise ratio is synonymous with a ratio of quantization noise present in the values represented in a regular-precision format to values represented in the quantized precision format.  In the new grounds of rejection presented, Hou or Dexu Lin are not relied upon alone but rather in combination.  
	With regards to Applicant's arguments regarding claim 15.  While Examiner concedes that Dexu Lin is silent on a training rate hyperparameter, the claim language broadens the potential hyperparameter to include a hyperparameter comprising a number of layers in the neural network, which Examiner asserts that Dexu Lin is not silent on.  As mentioned in the combination statement for the claims reliant on the combination of Hou and Dexu Lin, it would be obvious based on the disclosure of Hou whose learning rate is adjusted based on quantization, that one could use the SQNR disclosed in Dexu Lin to analyze the loss in order to further refine the learning rate.  Examiner further asserts that the instant claim language is silent with regards to how the learning rate is 'based on' the signal to noise metric, such that it would be reasonable to interpret adjusting the learning rate based on the iterative quantization process in which a quantization loss is determined to be synonymous with adjusting the learning rate based on a signal to noise metric.  The combination of Dexu Lin would merely rely on using the SQNR as said signal to noise metric, which is consistent with the motivation described in the Non-Final Office Action mailed 2/2/2022.   

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 11-14 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

	Regarding claim 11, " a ratio of quantization noise present in the values represented in the quantized-precision format to the values represented in the quantized precision format" is indefinite.  It's unclear whether or not the values being compared are the same (in which case the ratio would always be 1, which would be non-limiting) or whether there are different quantized values being compared to determine a ratio.  In the interest of further examination the limitation is interpreted as "a ratio of quantization noise present in the values represented in a regular-precision format to values represented in the quantized precision format".

The remaining claims are rejected with respect to their dependence on the rejected claims.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 15 and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Dexu Lin (US20160328647A1).

Regarding claim 15, Dexu Lin teaches A method for compensating for noise during training of a neural network, comprising: ([¶0087] "In some aspects, where the model performance is below the threshold, the noise level may be reduced and the model performance may be reevaluated")
	computing at least one noise-to-signal ratio representing noise present in the neural network; ([¶0027] " In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR). That is, in a machine learning model such as a deep convolutional network, the effect of quantizing weights and/or activations is the introduction of quantization noise. Similar to other communication systems, when quantization noise increases, the model performance decreases. Accordingly, the SQNR observed at the output may provide an indication of model performance or accuracy.")
	adjusting a hyper-parameter of the neural network based on the at least one noise-to-signal ratio ([¶0030] "In some aspects, the bit width selection may be simplified as: min−Σ ρi log(x i), s.t. Σ α i x i =C, (3) where αi is the noise amplification or reduction factor from layer i to the output, C is a constant that constrains the α factors, and ρi is a scaling factor of the bit width at layer i...In some aspects, the constant C may be computed based on the SQNR limit")
	the hyper-parameter comprising at least one of: a learning rate, a learning rate schedule, a bias, a stochastic gradient descent batch size, a number of neurons in the neural network, or a number of layers in the neural network; ([¶0030] "ρi is a scaling factor of the bit width at layer i" layer I interpreted as a number of layers in the neural network of which the updated hyper-parameter comprises.)
	training the neural network using the adjusted hyper-parameter. ([¶0035] "In some aspects, additional safety factors may be added to account for non-Gaussian distribution of activations and weights and/or variations for different training and test sets"). 

	Regarding claim 19, Dexu Lin teaches The method of claim 15, wherein adjusting the hyper-parameter based on the at least one noise-to-signal ratio comprises: ([¶0030] "In some aspects, the bit width selection may be simplified as: min−Σ ρi log(x i), s.t. Σ α i x i =C, (3) where αi is the noise amplification or reduction factor from layer i to the output, C is a constant that constrains the α factors, and ρi is a scaling factor of the bit width at layer i...In some aspects, the constant C may be computed based on the SQNR limit")
	computing a scaling factor based on the at least one noise-to-signal ratio; and (See scaling factor α.)
	scaling the hyper-parameter using the scaling factor. ([¶0030] "In some aspects, the bit width selection may be simplified as: min−Σ ρi log(x i), s.t. Σ α i x i =C, (3)" Eqn. 3 shows that the hyper-parameter is scaled relative to the scaling factor.). 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-4, and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Hou (“LOSS-AWARE WEIGHT QUANTIZATION OF DEEP NETWORKS”, 2018) and in view of Dexu Lin.

	Regarding claim 1, Hou teaches A method for training a neural network implemented with a quantization-enabled system, the method comprising: with the quantization-enabled system: ([p. 1 Sec. 1] "Another effective approach to compress the network and accelerate training is by quantizing each full-precision weight to a small number of bits")
	obtaining a tensor comprising values of one or more parameters of the neural network represented in a quantized-precision format, ([p. 2 Sec. 2] "Let the full-precision weights from all L layers be w = [w>1 ,w>2 ,...,w>L ]>, where wl =vec(Wl), and Wl is the weight matrix at layer l. The corresponding quantized weights will be denoted ^w" Weights interpreted as parameters of the neural network.  Ternarization interpreted as a form of quantization. See also Wi.)
	the parameters comprising at least one of activation weights or edge weights; ([p. 15 §D] "apply batch-normalization and nonlinear activation to ztl
to obtain xtl" Algorithm 3 clearly shows that the weights are calculated using activation functions and are therefore interpreted as synonymous with activation weights.)
	generating a scaled learning rate based on the at least one noise-to-signal metric; and ([p. 3 Sec. 3.1] "we consider the loss explicitly during quantization and obtain the quantization
thresholds and scaling parameter by solving an optimization problem" [p. 4] "Obviously, this objective can be minimized layer by layer. Each proximal Newton iteration thus consists of two steps: (i) Obtain wtl in (7) by gradient descent along ∇l`(wˆt−1), which is preconditioned by the adaptive learning rate...so that the rescaled dimensions have similar curvature" Adaptive learning rate is interpreted as synonymous with scaled learning rate. Hou teaches that adaptive learning rate is scaled relative to the loss or error in response to quantization.)
	performing an epoch of training of the neural network using the values of the tensor, including computing one or more gradient updates using the scaled learning rate. ([p. 16 Sec. D.2] "We use a one-layer LSTM with 512 cells. The maximum number of epochs is 200, and the number of time steps is 100. The initial learning rate is 0.002. After 10 epochs, it is decayed by a factor of 0.98 after each epoch. The weights are initialized uniformly in [0.08, 0.08]. After each iteration, the gradients are clipped to the range [−5, 5]. All the updated weights are clipped to [−1, 1] for binarization and ternarization methods")
	However, Hou does not explicitly teach generating at least one noise-to-signal metric representing a ratio of quantization noise present in the tensor to the activation weights or edge weights;  

Dexu Lin, in the same field of endeavor, teaches generating at least one noise-to-signal metric representing a ratio of quantization noise present in the tensor to the activation weights or edge weights; ([¶0027] "In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR)"). 

	Hou and Dexu Lin are both directed towards accelerating neural networks by quantization.  Therefore, Hou and Dexu Lin are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the quantized neural network in Hou with the signal-to-quantization-noise ratio taught in Dexu Lin by using a signal to quantization noise ratio as an analysis metric. Hou determines the learning rate as a function of the quantization loss or noise.  Similarly, Dexu Lin teaches ([¶0027] “In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR)”).  Therefore, one of ordinary skill in the art would be able to determine that in order to optimize the learning rate as a function of noise, the SQNR would be a valuable mathematical tool.  Dexu Lin also provides as an additional motivation for combination ([¶0027] “Similar to other communication systems, when quantization noise increases, the model performance decreases. Accordingly, the SQNR observed at the output may provide an indication of model performance or accuracy.”).  This motivation for combination also applies to the remaining claims which depend on this combination.  

	Regarding claim 2, the combination of Hou, and Dexu Lin teaches The method of claim 1, wherein: the tensor is a second tensor obtained by converting values of a first tensor from a normal-precision floating-point format to the quantized-precision format, and (Hou [p. 3 Sec. 3.1] "In weight ternarization, TWN simply finds the closest ternary approximation of the full precision weight at each iteration" See also Eqn. 3 where a vector (tensor) of weights from 1 to L is disclosed.)
	the one or more parameters are weights used in a forward-propagation phase of a training epoch of the neural network. (Hou [p. 4 Sec. 3.1] "LBNN uses full-precision weights in the forward pass, while all other quantization methods including ours use quantized weights (which eliminates most of the multiplications and thus faster training)."). 

	Regarding claim 3, the combination of Hou, and Dexu Lin teaches The method of claim 2, wherein: the one or more parameters represent edge weights (Dexu Lin [¶0058] "A deep learning architecture may learn a hierarchy of features. If presented with visual data, for example, the first layer may learn to recognize relatively simple features, such as edges" Dexu Lin explicitly teaches that weights may be edge weights.)
	and activation weights of the neural network, and generating the at least one noise-to-signal metric comprises, for each of a plurality of layers of the neural network, generating a noise-to-signal ratio for the activation weights of the layer and generating a noise-to-signal ratio for the edge weights of the layer. (Dexu Lin [¶0027] " In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR). That is, in a machine learning model such as a deep convolutional network, the effect of quantizing weights and/or activations is the introduction of quantization noise. Similar to other communication systems, when quantization noise increases, the model performance decreases. Accordingly, the SQNR observed at the output may provide an indication of model performance or accuracy." Dexu Lin teaches quantizing weights and/or activations, the exclusive or nature of weights and/or activations suggests that the SQNR can be generated for either or both of the weights and activations.). 

	Regarding claim 4, the combination of Hou, and Dexu Lin teaches The method of claim 3, wherein: generating the noise-to-signal ratio for the activation weights of each of the plurality of layers comprises computing the difference between the activation weights of the second tensor for that layer and the activation weights of the first tensor for that layer, and dividing the difference by the absolute value of the activation weights of the first tensor for that layer; and (Dexu Lin [¶0028] " In some aspects, the signal variance (or power) of each stage may be assumed to be normalized to 1 for simplicity of notation. The bit width selection may be subject to certain constraints. For example, in some aspects, the bit width selection may be subject to a threshold of SQNR at the output of the model, which may be expressed as:" See Eqn. 2.   Second tensor is quantized tensor, therefore the claim amounts to dividing quantized weights by full precision weights.  Dexu Lin teaches normalizing or quantizing the weight to 1 and dividing by the sum of the signal variance (weights or activations) at each layer.  One of ordinary skill in the art would recognize that the difference between the activation weight of the first and second (quantized) tensor is the quantization noise.)
	generating the noise-to-signal ratio for the edge weights of each of the plurality of layers comprises computing the difference between the edge weights of the second tensor for that layer and the edge weights of the first tensor for that layer, and dividing the difference by the absolute value of the edge weights of the first tensor for that layer. (Dexu Lin [¶0028] " In some aspects, the signal variance (or power) of each stage may be assumed to be normalized to 1 for simplicity of notation. The bit width selection may be subject to certain constraints. For example, in some aspects, the bit width selection may be subject to a threshold of SQNR at the output of the model, which may be expressed as:" See Eqn. 2.   Second tensor is quantized tensor, therefore the claim amounts to dividing quantized weights by full precision weights.  Dexu Lin teaches normalizing or quantizing the weight to 1 and dividing by the sum of the signal variance (weights or activations) at each layer.  One of ordinary skill in the art would recognize that the difference between the activation weight of the first and second (quantized) tensor is the quantization noise.). 

	Regarding claim 10, the combination of Hou, and Dexu Lin teaches 	The method of claim 1, wherein the epoch of training of the neural network is a second epoch performed after a first epoch of training of the neural network, the method further comprising: (Hou [p. 14 Sec. D.1] "Batch normalization with a minibatch size 100, is used to accelerate learning. The maximum number of epochs is 50. The learning rate starts at 0:01, and decays by a factor of 0:1 at epochs 15 and 25" Hou explicitly teaches in one trial using 50 epochs, therefore a second epoch is necessarily taught.  See also t in algorithm 3.)
	prior to generating the scaled learning rate, performing the first epoch of training using the values of the tensor, including computing one or more gradient updates using a predetermined learning rate of the neural network, (Hou See Algorithm 3 lines 15-24 on p. 15. The scaled learning rate is calculated at the end of the epoch after the gradient updates using the predetermined rate.)
	wherein generating the scaled learning rate based on the at least one noise-to-signal metric comprises scaling the predetermined learning rate based on the at least one noise-to-signal metric. (Hou [p. 4 Sec. 3.1] See Eqn. 7 Hou teaches that what is the binarized weight, rearrangement of Eqn. 7 allows one to solve for scaled learning rate based on noise to signal metric.  Without rearrangement one of ordinary skill in the art could see the dependency.). 

	Claims 5-8 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Hou, and Dexu Lin and in further view of Darryl Lin (“Fixed Point Quantization of Deep Convolutional Networks”, 2016).

	Regarding claim 5, the combination of Hou and Dexu Lin teaches The method of claim 3, further comprising generating a scaling factor based on the at least one noise-to-signal metric, wherein: the neural network comprises a total of L layers; and (Dexu Lin [¶0030] "In some aspects, the bit width selection may be simplified as: min−Σ ρi log(x i), s.t. Σ α i x i =C, (3) where αi is the noise amplification or reduction factor from layer i to the output, C is a constant that constrains the α factors, and ρi is a scaling factor of the bit width at layer i...In some aspects, the constant C may be computed based on the SQNR limit").
	the scaling factor for a layer l of the neural network is generated based on an average value of the noise-to-signal ratio for the activation weights of the layer l as well as a sum of average values of the noise-to-signal ratios (Dexu Lin [¶0028] "Equations 1 and 2 may be considered an SQNR budget for a machine learning model." [¶0030] See Eqn. 1 and 2 "the constant C may be computed based on the SQNR limit." Equation 3 can be easily rearranged to solve for the scaling factor such that the scaling factor would necessarily be a factor of C which Dexu Lin teaches may be computed based on the SQNR.  Equation 3 shows both a summation of the signal variance (weight or activation values) which would then be divided by the average SQNR or SQNR limit). 
	However, the combination of Hou and Dexu Lin does not explicitly teach for the edge weights of layers l+1 through L of the neural network.  

Darryl Lin, in the same field of endeavor, teaches for the edge weights of layers l+1 through L of the neural network. ([p. 4-5 Sec. 4.1.3] Eqn. 7 and 8.  "Ignoring the bias term for the time being, since a(i+1)i is simply a sum of terms like w(l+1)i;j a(l)j , which when quantized all have the same SQNR w(l+1) a(l) . Assuming the product terms w(l+1)i,j*a(l)j are independent, it follows that the value of a(i+1)i , before further quantization, has inverse SQNR that equals [Eqn. 8]"   Darryl Lin explicitly teaches calculating SQNR from layers l+1.).  Darry Lin also provides additional motivation for combination ([Abstract] “Our experiments show that in comparison to equal bitwidth settings, the fixed point DCNs with optimized bit width allocation offer > 20%reduction in the model size without any loss in accuracy on CIFAR-10 benchmark. We also demonstrate that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model. In doing so, we report a new state-of-the-art fixed point performance of 6.78% error-rate on CIFAR-10 benchmark”).  This motivation for combination also applies to the remaining claims which depend on this combination.  

	Hou, Dexu Lin, and Darryl Lin are all directed towards accelerating neural networks through quantization.  Therefore, Hou, Dexu Lin, and Darryl Lin are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the quantized neural network methods of Hou and Dexu Lin with that taught by Darryl Lin. Darryl Lin teaches the rationale of iterating from an index of the layer number plus one in a zero indexed array of layer indices ([p. 4,5 Sec. 4.1.3] “In a DCN with multiple layers, computation of the ith activation in layer l+1 of the DCN can be expressed as follows [Eqn. 7]”).  Darryl Lin shows that specific activations for all layers after the current layer can be selectively processed in this manner.

	Regarding claim 6, the combination of Hou, Dexu Lin, and Darryl Lin teaches The method of claim 5, wherein: training the neural network comprises training the neural network via stochastic gradient descent; and (Hou See line 16 of Algorithm 3 on p. 15.)
	the scaled learning rate for the layer l of the neural network is computed by the equation: ɛ q = ɛ 1 + E  [ ξ ( l) X ( l) ] + ∑ k = l + 1 L  E  [ γ ( k) w ( k) ] (Hou [p. 2 Sec. 2.1] "By minimizing the difference between wl and albl, the optimal a l, bl have the simple form: [See Eqn.]" See also Eqns on p. 3 Hou teaches scaling factor based on binarazation a.  Hou explicitly teaches taking a summation of weights whose magnitude is greater than the gradient which Hou teaches may be substituted by E(|Wlt|) which is highly relevant to the disclosure of the instant.  In equation 6 Hou teaches the substitution of the weight in the learning rate equation with the calculation of the noise, the equation shown in equations 6 and 7 of Hou are interpreted as synonymous with the method of calculating a signal-to-noise ratio as is well known in the art.  Therefore, the equation in the instant is seen as simply a mathamatical manipulation of the scaled learning rate equation taught in Hou.)
	wherein Eq represents the scaled learning rate, ε represents a predetermined learning rate of the neural network, (Hou teaches scaling factor a [p. 2 Sec. 2.1], and adaptive learning rate d [p. 4 Sec. 3.1] and the relationship between the two in Proposition 3.2.)
	E  [ ξ ( l) X ( l) ] represents the average value of the noise-to-signal ratio for the activation weights of the layer l over a stochastic gradient descent batch size, in the form of a vector, and (Dexu Lin [¶0027] " In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR). That is, in a machine learning model such as a deep convolutional network, the effect of quantizing weights and/or activations is the introduction of quantization noise. Similar to other communication systems, when quantization noise increases, the model performance decreases. Accordingly, the SQNR observed at the output may provide an indication of model performance or accuracy." Dexu Lin explicitly teaches the noise-to-signal ratio for the weights.  One of ordinary skill in the art would understand that the expected value of the expectation represents an average value of the raio.)
	E  [ γ ( k) w ( k) ] represents the average value of the noise-to-signal ratio for the edge weights of a layer k of the neural network, per sample, in the form of a matrix. (Dexu Lin [¶0027] " In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR). That is, in a machine learning model such as a deep convolutional network, the effect of quantizing weights and/or activations is the introduction of quantization noise. Similar to other communication systems, when quantization noise increases, the model performance decreases. Accordingly, the SQNR observed at the output may provide an indication of model performance or accuracy." Dexu Lin explicitly teaches the noise-to-signal ratio for the weights.  One of ordinary skill in the art would understand that the expected value of the expectation represents an average value of the ratio.). 

	Regarding claim 7, the combination of Hou, Dexu Lin, and Darryl Lin teaches The method of claim 6, wherein computing the one or more gradient updates using the scaled learning rate comprises computing gradient updates for one or more parameters of the layer l using the scaled learning rate. (Hou See Algorithm 3 lines 15-24 on p. 15. Hou shows that the gradient of the weight is calculated in line 16, learning rate is then updated at the end of the epoch, therefore any subsequent epochs will use the parameters of the layer using the scaled learning rate.). 

	Regarding claim 8, the combination of Hou, Dexu Lin, and Darryl Lin teaches The method of claim 7, wherein computing the one or more gradient updates using the scaled learning rate further comprises computing gradient updates for one or more parameters of one or more other layers of the neural network using the same scaled learning rate generated for the layer l. (Hou [p. 14 Sec. D.1] "Batch normalization with a minibatch size 100, is used to accelerate learning. The maximum number of epochs is 50. The learning rate starts at 0:01, and decays by a factor of 0:1 at epochs 15 and 25." See also Algorithm 3 Each layer at each epoch is taught as using the same learning rate eta.). 

	Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Hou, and Dexu Lin and in further view of Jacob (Intel Lab Distiller Github Repository, 2018).

	Regarding claim 9, the combination of Hou and Dexu Lin teaches The method of claim 2, further comprising generating a scaling factor based on the at least one noise-to-signal metric, (Dexu Lin [¶0030] "In some aspects, the bit width selection may be simplified as: min−Σ ρi log(x i), s.t. Σ α i x i =C, (3) where αi is the noise amplification or reduction factor from layer i to the output, C is a constant that constrains the α factors, and ρi is a scaling factor of the bit width at layer i...In some aspects, the constant C may be computed based on the SQNR limit")
	wherein: the normal-precision floating-point format represents the values with a first bit width; (Dexu Lin [¶0025] 'in a fixed point representation, a fixed position of the decimal point is chosen such that there are a fixed number of bits to the right and/or the left of the decimal point and used to represent the elements").
	the quantized-precision format represents the values with a second bit width, the second bit width being lower than the first bit width; and (Dexu Lin [¶0026] "aspects of the present disclosure are directed to changing the bit widths based on performance specifications and system resources." One of ordinary skill in the art would expect that a quantized-precision bit width would be lower than a normal-precision bit width.)
	computing gradient updates for one or more other parameters of the neural network represented with the second bit width (Dexu Lin [¶0064] "To adjust the weights, a learning algorithm may compute a gradient vector for the weights. The gradient may indicate an amount that an error would increase or decrease if the weight were adjusted slightly.")
	computing the gradient updates for the one or more other parameters using the scaling factor for the second bit width. (Dexu Lin [¶0030] "In some aspects, the bit width selection may be simplified as: min−Σ ρi log(x i), s.t. Σ α i x i =C, (3) where αi is the noise amplification or reduction factor from layer i to the output, C is a constant that constrains the α factors, and ρi is a scaling factor of the bit width at layer i...In some aspects, the constant C may be computed based on the SQNR limit" [¶0064] "To adjust the weights, a learning algorithm may compute a gradient vector for the weights." Dexu Lin teaches that the bit widths for the weights are scaled with the scaling factor and that similarly the gradient update is a function of the weights.). 
	However, the combination of Hou and Dexu Lin does not explicitly teach the method further comprises: storing the scaling factor in an entry for the second bit width in a lookup table; 
	by accessing the entry for the second bit width in the lookup table to obtain the scaling factor for the second bit width; and  

Jacob, in the same field of endeavor, teaches the method further comprises: storing the scaling factor in an entry for the second bit width in a lookup table; ([l. 420-494] "def linear_quantize_param(param_fp, param_meta):
            perch = per_channel_wts and param_fp.dim() in [2, 4]
            with torch.no_grad():
                scale, zero_point = _get_tensor_quantization_params(param_fp, param_meta.num_bits, mode,  per_channel=perch)" Function shows setting and accessing the map (lookup table) of second bit width to scaling factor for network quantization.).
	by accessing the entry for the second bit width in the lookup table to obtain the scaling factor for the second bit width; and ([l. 49-69] "if mode == LinearQuantMode.SYMMETRIC:
        sat_fn = get_tensor_avg_max_abs if clip else get_tensor_max_abs
        sat_val = sat_fn(tensor, dim)
        scale, zp = symmetric_linear_quantization_params(num_bits, sat_val)" Function shows accessing the scale given second bit width via lookup table.). 

	Hou, Dexu Lin, and Jacob are all directed towards accelerating neural networks through quantization.  Therefore, Hou, Dexu Lin, and Jacob are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network quantization in Hou and Dexu Lin with the mapping of the scaling factor and number of bits as shown in Jacob. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Jacob that a mapping is a logical way to implement the method in an instruction set that could be performed by a processor.  The combination would be further obvious since Jacob implements the neural network quantization methods described in Hou and Dexu Lin in code which can be compiled into instructions for a variety of processors.  

	Claims 11-13, 17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Dexu Lin and in view of Hou. 

	Regarding claim 11, Dexu Lin teaches A system for training a neural network implemented with a quantization-enabled system, the system comprising: ([Abstract] "A method for selecting bit widths for a fixed point machine learning model includes evaluating a sensitivity of model accuracy to bit widths at each computational stage of the model. The method also includes selecting a bit width for parameters, and/or intermediate calculations in the computational stages of the mode" Selecting a bit width interpreted as synonymous with quantizing.)
	memory; one or more processors coupled to the memory and adapted to perform quantized-precision operations; ([¶0011] "The apparatus includes a memory and at least one processor coupled to the memory")
	one or more computer-readable storage media storing computer-readable instructions that, when executed by the one or more processors, cause the system to perform a method of training a neural network, the instructions comprising: ([¶007] "Deep neural networks may be trained to recognize a hierarchy of features and so they have increasingly been used in object recognition applications" [¶0053]  Instructions executed at the general-purpose processor 102 may be loaded from a program memory associated with the CPU 102 or may be loaded from a dedicated memory block 118.")
	instructions that cause the system to represent values of one or more parameters of the neural network in a quantized-precision format; ([¶0037] "For instance, a DCN with 3 convolutional layers, for the purpose of the SQNR calculation, may have 6 quantization “layers”, or steps as follows: Quantize weights and biases of convolution layer (conv) 1, Quantize activations of conv1, Quantize weights and biases of conv2, Quantize activations of conv2, Quantize weights and biases of conv3, Quantize activations of conv3.")
	instructions that cause the system to compute at least one metric representing a ratio of quantization noise present in the values represented in the quantized-precision format to the values represented in the quantized precision format; and ([¶0027] "In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR)")
	However, Dexu Lin does not explicitly teach instructions that cause the system to adjust a learning rate of the neural network based on the at least one metric.  

Hou, in the same field of endeavor, teaches instructions that cause the system to adjust a learning rate of the neural network based on the at least one metric. ([p. 3 Sec. 3.1] "we consider the loss explicitly during quantization and obtain the quantization thresholds and scaling parameter by solving an optimization problem" [p. 4] "Obviously, this objective can be minimized layer by layer. Each proximal Newton iteration thus consists of two steps: (i) Obtain wtl in (7) by gradient descent along ∇l`(wˆt−1), which is preconditioned by the adaptive learning rate...so that the rescaled dimensions have similar curvature" Adaptive learning rate is interpreted as synonymous with scaled learning rate. Hou teaches that adaptive learning rate is scaled relative to the loss or error in response to quantization.). 

	Dexu Lin and Hou are both directed towards accelerating neural networks through quantization.  Therefore, Dexu Lin and Hou are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Dexu Lin with the teachings of Hou by adjusting a learning rate. Hou determines the learning rate as a function of the quantization loss or noise.  Similarly, Dexu Lin teaches ([¶0027] “In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR)”).  Therefore, one of ordinary skill in the art would be able to determine that in order to optimize the learning rate as a function of noise, the SQNR would be a valuable mathematical tool.  Hou provides as an additional motivation for combination ([p. 2 §1] “In this paper, we propose an efficient and disciplined ternarization scheme for network compression.  Experiments on both feedforward and recurrent neural networks show that the proposed quantization scheme outperforms state-of-the-art algorithms.”).  This motivation for combination also applies to the remaining claims which depend on this combination.  

	Regarding claim 12, the combination of Dexu Lin, and Hou teaches The system of claim 11, wherein: the one or more parameters of the neural network comprise a plurality of weights of a layer of the neural network; and (Dexu Lin [¶0026] "For example, different bit widths may be selected for bias values, activation values, and/or weights of each layer of the neural network")
	the at least one metric comprises a noise-to-signal ratio computed by computing a difference between values of the weights represented in the quantized-precision format and values of the weights represented in a normal-precision floating-point format, and dividing the difference by an absolute value of the values of the weights represented in the normal-precision floating-point format. (Dexu Lin [¶0028] " In some aspects, the signal variance (or power) of each stage may be assumed to be normalized to 1 for simplicity of notation. The bit width selection may be subject to certain constraints. For example, in some aspects, the bit width selection may be subject to a threshold of SQNR at the output of the model, which may be expressed as:" See Eqn. 2.   Second tensor is quantized tensor, therefore the claim amounts to dividing quantized weights by full precision weights.  Dexu Lin teaches normalizing or quantizing the weight to 1 and dividing by the sum of the signal variance (weights or activations) at each layer.  One of ordinary skill in the art would recognize that the difference between the activation weight of the first and second (quantized) tensor is the quantization noise.). 

	Regarding claim 13, the combination of Dexu Lin, and Hou teaches The system of claim 11, wherein: the one or more parameters comprise activation weights and edge weights of a first layer of the neural network; (Dexu Lin [¶0026] "For example, different bit widths may be selected for bias values, activation values, and/or weights of each layer of the neural network")
	computing the at least one metric comprises computing a first noise-to-signal ratio for the activation weights of the first layer and a second noise-to-signal ratio for the edge weights of the first layer; and (Dexu Lin [¶0027] "In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR). That is, in a machine learning model such as a deep convolutional network, the effect of quantizing weights and/or activations is the introduction of quantization noise")
	the system further comprises instructions that cause the system to train the neural network with at least some values of the parameters represented in the quantized-precision format, including instructions that cause the system to compute gradient updates for the first layer and at least one other layer of the neural network using the adjusted learning rate. (Dexu Lin [¶0027] "In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR). That is, in a machine learning model such as a deep convolutional network, the effect of quantizing weights and/or activations is the introduction of quantization noise."). 

	Regarding claim 17, Dexu Lin teaches introducing noise to the neural network; ([¶0083] "In some aspects, the process may also inject noise into one or more computational stages of the model." [¶0086] "In block 504, the process determines a model performance. In some aspects, the model performance may comprise a classification accuracy, classification speed, SQNR, other model performance metric or a combination thereof. The model performance may be evaluated by comparing the performance to a threshold").
	computing a difference between one or more values of the second tensor and one or more corresponding values of the first tensor; and dividing the difference by the absolute value of the one or more corresponding values of the first tensor. ([¶0028] " In some aspects, the signal variance (or power) of each stage may be assumed to be normalized to 1 for simplicity of notation. The bit width selection may be subject to certain constraints. For example, in some aspects, the bit width selection may be subject to a threshold of SQNR at the output of the model, which may be expressed as:" See Eqn. 2.   Second tensor is quantized tensor, therefore the claim amounts to dividing quantized weights by full precision weights.  Dexu Lin teaches normalizing or quantizing the weight to 1 and dividing by the sum of the signal variance (weights or activations) at each layer.  One of ordinary skill in the art would recognize that the difference between the activation weight of the first and second (quantized) tensor is the quantization noise.).
	However, Dexu Lin does not explicitly teach The method of claim 15, wherein computing the at least one noise-to-signal ratio comprises: obtaining a first tensor comprising values of one or more parameters of the neural network before introducing noise to the neural network; 
	obtaining a second tensor comprising values of the one or more parameters after the introduction of noise to the neural network;  

Hou, in the same field of endeavor, teaches The method of claim 15, wherein computing the at least one noise-to-signal ratio comprises: obtaining a first tensor comprising values of one or more parameters of the neural network before introducing noise to the neural network; ([p. 2 Sec. 2] "Let the full-precision weights from all L layers be w = [w>1 ,w>2 ,...,w>L ]>, where wl =vec(Wl), and Wl is the weight matrix at layer l. The corresponding quantized weights will be denoted ^w" Weights interpreted as parameters of the neural network.  Ternarization interpreted as a form of quantization. See also Wi.)
	obtaining a second tensor comprising values of the one or more parameters after the introduction of noise to the neural network; ([p. 2 Sec. 2] "Let the full-precision weights from all L layers be w = [w>1 ,w>2 ,...,w>L ]>, where wl =vec(Wl), and Wl is the weight matrix at layer l. The corresponding quantized weights will be denoted ^w" W^ interpreted as second tensor after the introduction of noise to the neural network.). 

	Dexu Lin and Hou are both directed towards accelerating neural networks through quantization.  Therefore, Dexu Lin and Hou are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Dexu Lin with the teachings of Hou by adjusting a learning rate. Hou determines the learning rate as a function of the quantization loss or noise.  Similarly, Dexu Lin teaches ([¶0027] “In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR)”).  Therefore, one of ordinary skill in the art would be able to determine that in order to optimize the learning rate as a function of noise, the SQNR would be a valuable mathematical tool.  Hou provides as an additional motivation for combination ([p. 2 §1] “In this paper, we propose an efficient and disciplined ternarization scheme for network compression.  Experiments on both feedforward and recurrent neural networks show that the proposed quantization scheme outperforms state-of-the-art algorithms.”).  This motivation for combination also applies to the remaining claims which depend on this combination.  

	Regarding claim 20, Dexu Lin teaches The method of claim 15.
	However, Dexu Lin does not explicitly teach the hyper-parameter is adjusted to compensate for the effect of the noise present in the neural network on the accuracy of gradient updates computed during the training of the neural network.  

Hou, in the same field of endeavor, teaches The method of claim 15, wherein the hyper-parameter is adjusted to compensate for the effect of the noise present in the neural network on the accuracy of gradient updates computed during the training of the neural network. ([p. 3 Sec. 3.1] "we consider the loss explicitly during quantization and obtain the quantization thresholds and scaling parameter by solving an optimization problem" [p. 4] "Obviously, this objective can be minimized layer by layer. Each proximal Newton iteration thus consists of two steps: (i) Obtain wtl in (7) by gradient descent along ∇l`(wˆt−1), which is preconditioned by the adaptive learning rate...so that the rescaled dimensions have similar curvature" With respect to the instant specification scaling a learning rate is interpreted as capable of improving the gradient update accuracy.  Hou teaches adjusting the learning rate with respect to the the quantization loss (noise).). 

	Dexu Lin and Hou are both directed towards accelerating neural networks through quantization.  Therefore, Dexu Lin and Hou are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Dexu Lin with the teachings of Hou by adjusting a learning rate. Hou determines the learning rate as a function of the quantization loss or noise.  Similarly, Dexu Lin teaches ([¶0027] “In some aspects, model performance may be evaluated using a signal to quantization noise ratio (SQNR)”).  Therefore, one of ordinary skill in the art would be able to determine that in order to optimize the learning rate as a function of noise, the SQNR would be a valuable mathematical tool.  Hou provides as an additional motivation for combination ([p. 2 §1] “In this paper, we propose an efficient and disciplined ternarization scheme for network compression.  Experiments on both feedforward and recurrent neural networks show that the proposed quantization scheme outperforms state-of-the-art algorithms.”).  This motivation for combination also applies to the remaining claims which depend on this combination.  

	Claims 14 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Dexu Lin, and Hou and in further view of Yim (US 2019/0130255 A1).

	Regarding claim 14, the combination of Dexu Lin and Hou teaches The system of claim 11, wherein the one or more processors comprise a neural network accelerator (Dexu Lin [¶0091] "The various illustrative logical blocks, modules and circuits described in connection with the present disclosure may be implemented or performed with a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array signal (FPGA) or other programmable logic device (PLD), discrete gate or transistor logic, discrete hardware components or any combination thereof designed to perform the functions described herein" With respect to the instant specification an FPGA is given as an example of a neural network accelerator.).
	However, the combination of Dexu Lin and Hou does not explicitly teach a neural network accelerator having a tensor processing unit.  

Yim, in the same field of endeavor, teaches a neural network accelerator having a tensor processing unit. ([¶0051] "The hardware accelerator may be, for example, but is not limited to, a neural processing unit (NPU), a tensor processing unit (TPU), a neural engine, or the like, which is a dedicated module for driving a neural network."). 

Dexu Lin, Hou, and Yim are all directed towards accelerating neural networks through quantization.  Therefore, Dexu Lin, Hou, and Yim are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network hardware in Dexu Lin with that in Yim by having a dedicated tensor processing unit. The combination would have been obvious because a person of ordinary skill in the art would be able to determine that neural networks often take immense processing power and significant amounts of time, therefore any method of accelerating the training process would be beneficial.  Yim reiterates this ([¶0004] “In order for the neural network device to analyze high-quality input in real time and extract information, technology capable of efficiently processing neural network operations may be used.”).  Yim further teaches that a tensor processing unit is a dedicated module for driving the neural network (¶0051]).

	Regarding claim 18, the combination of Dexu Lin and Hou teaches The method of claim 17.
	However, the combination of Dexu Lin and Hou does not explicitly teach introducing noise to the neural network comprises one or more of the following: changing a data type of values of one or more parameters of the neural network. decreasing a stochastic gradient descent batch size for one or more layers of the neural network, reducing a voltage supplied to hardware implementing the neural network, implementing analog-based training of the neural network, or storing values of one or more parameters of the neural network in DRAM.  

Yim, in the same field of endeavor, teaches introducing noise to the neural network comprises one or more of the following: changing a data type of values of one or more parameters of the neural network. decreasing a stochastic gradient descent batch size for one or more layers of the neural network, reducing a voltage supplied to hardware implementing the neural network, implementing analog-based training of the neural network, or storing values of one or more parameters of the neural network in DRAM. ([¶0034] "The memory 120 is hardware for storing various data processed in the neural network quantization device" [¶0035] "The memory 120 may be dynamic random access memory (DRAM), but is not limited thereto"). 

Dexu Lin, Hou, and Yim are all directed towards accelerating neural networks through quantization.  Therefore, Dexu Lin, Hou, and Yim are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network hardware in Dexu Lin with that in Yim by having a dedicated tensor processing unit. The combination would have been obvious because a person of ordinary skill in the art would be able to determine that neural networks often take immense processing power and significant amounts of time, therefore any method of accelerating the training process would be beneficial.  Yim reiterates this ([¶0004] “In order for the neural network device to analyze high-quality input in real time and extract information, technology capable of efficiently processing neural network operations may be used.”).  Yim further teaches that a tensor processing unit is a dedicated module for driving the neural network (¶0051]).


Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126