DETAILED ACTION
The action is in response to the claims filed June 16th 2021. Claims 1-20 are pending and have been examined.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-4, 9, 10, 12, 13, 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Mohassel et al. US Document ID US 20200242466 A1 hereinafter Mohassel. Further in view of Choi et al. “Compression of Deep Choi.

Regarding Claim 1
Mohassel teaches, In a computer system, a method of training a machine learning model using a machine learning tool, the method comprising: receiving training data; (Abstract “privacy-preserving machine learning training [training a machine learning model] (e.g., for linear regression, logistic regression and neural network using the stochastic gradient descent method). A protocols can use the two-server model, where data owners distribute their private data [receiving training data] among two non-colluding servers [computer system], which train various models”) initializing parameters of a machine learning model; (¶0079 “The SGD algorithm works as follows: w is initialized as a vector [parameters] of random values” SGD is stochastic gradient descent, in the context of machine learning, it is an algorithm for training a machine learning model.) training at least some of the parameters in multiple iterations based on the training data, including, in a given iteration of multiple iterations: during the training (¶0078 "SGD is the most commonly used approach to train" ¶0079 "In each [multiple] iteration [ of the SGD training ], a sample [of the training data] (x.sub.i, y.sub.i) is selected randomly and a coefficient w.sub.j is updated") applying the machine learning model to at least some of the training data; (¶0014 "Using embodiments, the weights of the model can be efficiently determined in the training, e.g., by performing iterative updates of the weights based on error differences in a current predicted output and the known outputs of the data samples [training data] ” Data samples that consist of known output correspond to training data)  and outputting the parameters. (¶0037 “the values of the parameters can be determined in a training process” Examiner notes that the determined corresponds to outputting)
Mohassel does not explicitly teach, and a dithered quantizer function.
Choi, however, when addressing issues related to compressing neural networks teaches, and a dithered quantizer function; (Section 4 ¶01 “First, we randomize spatial-domain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval”, pg 7 ¶03 “After the codebook [parameters] is updated [trained], individual weights [parameters] are updated [trained] by following their shared quantized value Examiner notes that the dithered quantize function corresponds to “quantize the dithered weights”)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method for dithering Choi to the disclosed invention of Mohassel.	One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate “randomized uniform quantization, where uniform random dithering makes the distortion independent of the source… guaranteed to achieve near-optimal performance” (Choi, Section 2.3)

Regarding Claim 2
	Mohassel/Choi teach the method of claim 1
	Further, Mohassel teaches, wherein the machine learning model is a neural network having multiple layers, each of the multiple layers including one or more nodes, and wherein the parameters include one or more of: weights for connections between the nodes;  a count of the multiple layers; for each of the multiple layers, a count of nodes in the layer; (¶0092 “FIG. 3B shows an example of a neural network with m−1 hidden layers" Further claim 21 states “initializing values for M sets of d weights W for a machine learning model, each set of d weights W corresponding to one of M nodes of a layer") biases for at least some of the nodes; (¶0074 “Usually, a bias b is introduced…this can be easily achieved by appending a dummy feature equal to 1 for each x.sub.i” The dummy feature means that the node that is presented with X.sub.i is a bias node) (¶0244 “as depicted in step 10 of FIG. 5. A normalization factor |B| [control parameter] can be used when more than one training sample is used per iteration, e.g., a described herein for a mini-batch mode")

Regarding Claim 3
	Mohassel/Choi teach the method of claim 1
Further Choi teaches, wherein the updating the at least some of the parameters includes, (Section 4 “We compress the jointly sparse CNN model from Section 3.3 by universal compression in the spatial domain for universal deployment” Compression of the sparse CNN model corresponding to updating at least some of the parameters) for the given iteration t: calculating final parameters wt+1 for the given iteration t based upon starting parameters wt for the given iteration t and parameter updates Δwt in the given iteration t, as wt+1 = dither(wt + Δ wt), wherein dither() is the dithered quantizer function. (Section 4 “First, we randomize spatial-domain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval" 
    PNG
    media_image1.png
    40
    314
    media_image1.png
    Greyscale
 Examiner notes that prior to quantization and dithering, (wt +Δwt) corresponds to the last update made to the now trained CNN model, which corresponds to ai in the above equation. The equation above is the dithered quantizer function which computes the final parameter qi which corresponds to wt+1)

Regarding Claim 4
	Mohassel/Choi teach the method of claim 4
Further Choi teaches, wherein the updating the at least some of the parameters further includes determining dithering values d associated with the parameters wt, respectively, and wherein the dithered quantizer function dither is implemented as: [1] Round(wt + Awt + d), wherein round() is a rounding function; [2] round( (wt + Awt + d) / p ) x p, wherein p is a level of precision; [3] floor(wt + Awt + d), wherein floor() is a floor function; [4] floor( (wt + Awt + d) / p ) x p, wherein p is a level of precision; or [5] Quant(wt + Awt + d), wherein quant() is a quantizer function; (Section 4 “First, we randomize spatial-domain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval" 
    PNG
    media_image1.png
    40
    314
    media_image1.png
    Greyscale
… and U1, . . . , Unsd are independent and identically distributed uniform random variables” Examiner notes that U corresponds to d, which is associated with the weight parameters wt because they are both used in associated to perform a calculation, the dithered quantizer function depicted above. The dithered quantizer function corresponds to Quant(wt + Awt + d), wherein quant() is a quantizer function. Where (wt + Awt + d) is mathematically equivalent to (ai + Ui). The expressed limitation presents a conditional phrase enumerated by an “or” to reject the claim only 1 of the 5 conditionals needs to be rejected.)

Regarding Claim 9
	Mohassel/Choi teach the method of claim 3
Further Mohassel teaches, wherein the determining the parameter updates Δwt includes multiplying unscaled parameter updates Δw't by a learning rate ղ, as Δwt= ղ x  Δw't. (¶0079 "where α is a learning rate defining the magnitude to move towards the minimum in each iteration… The SGD algorithm works as follows: 
    PNG
    media_image2.png
    72
    192
    media_image2.png
    Greyscale
” Examiner notes that the equation corresponds to the SGD algorithm for weight updates. The partial derivative corresponds to Δw't )

Regarding Claim 10
	Mohassel/Choi teach the method of claim 1
	Further Mohassel teaches,  wherein the machine learning model is a neural network, wherein the applying the machine learning model to at least some of the training data includes forward propagating input values for the given iteration through multiple layers of the neural network, and wherein the determining the parameter updates to the at least some of the parameters includes: calculating a measure of loss based on output of the neural network after the forward propagation and expected output for the at least some of the training data; and backward propagating the measure of loss through the multiple layers of the neural network. (¶0094 “In the forward propagation for each iteration, the matrix X.sub.i of the ith layer [multiple layers of the neural network] is computed as X.sub.i=ƒ(X.sub.i−1×W.sub.i). In the backward propagation, given a cost function such as the cross entropy function [measure of loss based on output], the update function for each coefficient in each neuron can be expressed in a closed form” Examiner notes that in order to calculate the cross entropy in the propagation step, the known outputs must be known. Which implies that the two step method of forward then propagation is done utilizing training data. Further, the following quote discusses utilizing the cross entropy function on training data ¶0233 “a cost function is identified that provides an accuracy of the set of d weights in predicting the outputs Y of the set of training samples. The type of cost function can depend on the machine learning model used. Examples of cost functions are provided in section II, and include a sum of the squared error and a cross entropy function”)

Regarding Claim 12
	Mohassel/Choi teach the method of claim 1
	Further Mohassel teaches, the machine learning model is a deep neural network, a support vector machine, a Bayesian network, a decision tree, or a linear classifier. (Mohassel ¶0092 “FIG. 3B shows an example of a neural network with m−1 hidden layers” The depicted machine learning model, consists of multiple hidden layers. By definition, the model depicted is a deep neural network because it consists of more than 1 hidden layer)

Regarding Claim 13
	Mohassel/Choi teach the method of claim 1
	Further Mohassel teaches, the dithered quantizer function uses dithering values d based at least in part on output of a random number generator, and wherein the dithering values are random values having a power spectrum of white noise or blue noise. (Section 4 “ First, we randomize spatial-domain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval…
    PNG
    media_image3.png
    40
    314
    media_image3.png
    Greyscale
… and U1, . . . , Unsd are independent and identically distributed uniform random variables with the support of [−∆/2, ∆/2]” Examiner notes that Ui corresponds to “dithering values d,” Ui is a random variable so the selection is based on a random number generator. Because the selection is uniform the power spectrum of the values corresponds to white noise.)

Regarding Claim 15
Mohassel/Choi teach the method of claim 1
Further Mohassel teaches, the training data includes multiple examples, each of the multiple examples having one or more attributes and a label, and wherein each of the multiple iterations uses a single example or mini-batch of examples randomly selected from among any remaining examples, for an epoch, of the multiple examples. (¶0081 “Instead of selecting one sample of data per iteration, a small batch of samples can be selected randomly… |B| denotes the mini-batch size, usually ranging from 2 to 200” ¶0083 “all the training samples and select the mini-batch in each iteration sequentially, until all the samples are used once. This is referred to as one epoch (e.g., after all training samples are used once)… a testing dataset can be used to test the accuracy of the current w…each data sample in the testing dataset can be calculated as the prediction  and is compared to the corresponding output label")

Regarding Claim 16
Mohassel/Choi teach the method of claim 1
Further Mohassel teaches, the initializing the parameters includes: for an initial iteration of an initial epoch, setting the parameters to random values. (¶0141 “The weighting coefficients w also be secret shared between the training computers (e.g., the two servers). The weighting coefficients w can be initialized to random values…The weighting coefficients w can be updated and remain secret shared after each iteration in the SGD, until the end of training when it is reconstructed” Examiner notes that epoch corresponds to the end of training)

Regarding Claim 17
Mohassel teaches, A computer system comprising a buffer, in memory of the computer system, configured to receive training data; and a machine learning tool, implemented with one or more processors of the (Abstract “privacy-preserving machine learning training [training a machine learning model] (e.g., for linear regression, logistic regression and neural network using the stochastic gradient descent method). A protocols can use the two-server model, where data owners distribute their private data [receiving training data] among two non-colluding servers [computer system], which train various models” Examiner notes that a computing system that utilizes a two server model includes a buffer or memory on each server in order to store the received training data as well as the model parameters) initializing parameters of a machine learning model;  (¶0079 “The SGD algorithm works as follows: w is initialized as a vector [parameters] of random values” SGD is stochastic gradient descent, in the context of machine learning, it is an algorithm for training a machine learning model.) training at least some of the parameters in multiple iterations based on the training data, including, in a given iteration of multiple iterations during the training (¶0078 "SGD is the most commonly used approach to train" ¶0079 "In each iteration [ of the SGD training ], a sample [of the training data] (x.sub.i, y.sub.i) is selected randomly and a coefficient w.sub.j is updated")  applying the )  based at least in part on results of the applying the machine learning model, determining parameter updates to the at least some of the parameters; and updating the at least some of the parameters (¶0014 "Using embodiments, the weights of the model can be efficiently determined in the training, e.g., by performing iterative updates of the weights based on error differences in a current predicted output and the known outputs of the data samples [training data] ” Data samples that consist of known output correspond to training data)  
Mohassel does not explicitly teach, wherein the updating uses the parameter updates and a dithered quantizer function; and outputting the parameters;
Choi, however, when addressing issues related to compressing neural networks teaches, wherein the updating uses the parameter updates and a dithered quantizer function; and outputting the parameters; (Section 4 ¶01 “First, we randomize spatial-domain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval" Examiner notes that “parameter updates and a dithered quantizer function corresponds to “quantize the dithered weights” were the weights are the parameter updates)
Choi to the disclosed invention of Mohassel.	One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate “randomized uniform quantization, where uniform random dithering makes the distortion independent of the source… guaranteed to achieve near-optimal performance” (Choi, Section 2.3)

Regarding Claim 18
	Mohassel/Choi teach the method of claim 17
Further Choi teaches, wherein the updating the at least some of the parameters includes, (Section 4 “We compress the jointly sparse CNN model from Section 3.3 by universal compression in the spatial domain for universal deployment” Compression of the sparse CNN model corresponding to updating at least some of the parameters) for the given iteration t: calculating final parameters wt+1 for the given iteration i based upon starting parameters wt for the given iteration t and parameter updates Δwt in the given iteration t, as wt+1 = dither(wt + Δ wt), wherein dither() is the dithered quantizer function. (Section 4 “First, we randomize spatial-domain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval" 
    PNG
    media_image1.png
    40
    314
    media_image1.png
    Greyscale
 Examiner notes that prior to quantization and dithering, (wt +Δwt) corresponds to the last update made to the now trained CNN model, which corresponds to ai in the above equation. The equation above is the dithered quantizer function which computes the final parameter qi which corresponds to wt+1)

Regarding Claim 19
	Mohassel/Choi teach the method of claim 17
	Further Mohassel teaches,  wherein the machine learning model is a neural network, wherein the applying the machine learning model to at least some of the training data includes forward propagating input values for the given iteration through multiple layers of the neural network, and wherein the determining the parameter updates to the at least some of the parameters includes: calculating a measure of loss based on output of the neural network after the forward propagation and expected output for the at least some of the training data; and backward propagating the measure of loss through the multiple layers of the neural network. (¶0094 “In the forward propagation for each iteration, the matrix X.sub.i of the ith layer [multiple layers of the neural network] is computed as X.sub.i=ƒ(X.sub.i−1×W.sub.i). In the backward propagation, given a cost function such as the cross entropy function [measure of loss based on output], the update function for each coefficient in each neuron can be expressed in a closed form” Examiner notes that in order to calculate the cross entropy in the propagation step, the known outputs must be known. Which implies that the two step method of forward then propagation is done utilizing training data. Further, the following quote discusses utilizing the cross entropy function on training data ¶0233 “a cost function is identified that provides an accuracy of the set of d weights in predicting the outputs Y of the set of training samples. The type of cost function can depend on the machine learning model used. Examples of cost functions are provided in section II, and include a sum of the squared error and a cross entropy function”)

Regarding Claim 20
Mohassel teaches, One or more computer-readable media having stored thereon computer- executable instructions for causing one or more processing units, when programmed thereby, to perform operations comprising: receiving training data; (Abstract “privacy-preserving machine learning training [training a machine learning model] (e.g., for linear regression, logistic regression and neural network using the stochastic gradient descent method). A protocols can use the two-server model, where data owners distribute their private data [receiving training data] among two non-colluding servers [computer system], which train various models [executable instructions]” Examiner notes that a computing system that utilizes a two server model includes a buffer or memory on each server in order to store the received training data as well as the model parameters. In order for models to be trained according to an algorithm corresponding executable instructions must be stored on computer readable media) initializing parameters of a machine learning model;  (¶0079 “The SGD algorithm works as follows: w is initialized as a vector [parameters] of random values” SGD is stochastic gradient descent, in the context of machine learning, it is an algorithm for training a machine learning model.) training at least some of the parameters in multiple iterations based on the training data, including, in a given iteration of multiple iterations: during the training (¶0078 "SGD is the most commonly used approach to train" ¶0079 "In each iteration [ of the SGD training ], a sample [of the training data] (x.sub.i, y.sub.i) is selected randomly and a coefficient w.sub.j is updated")  applying the machine learning model to at least some of the training data;  based at least in part on results of the applying the machine learning model, determining parameter updates to the at least some of the parameters; and updating the at  (¶0014 "Using embodiments, the weights of the model can be efficiently determined in the training, e.g., by performing iterative updates of the weights based on error differences in a current predicted output and the known outputs of the data samples [training data] ” Data samples that consist of known output correspond to training data)  
Mohassel does not explicitly teach, wherein the updating uses the parameter updates and a dithered quantizer function; and outputting the parameters;
Choi, however, when addressing issues related to compressing neural networks teaches, wherein the updating uses the parameter updates and a dithered quantizer function; and outputting the parameters; (Section 4 ¶01 “First, we randomize spatial-domain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval" Examiner notes that “parameter updates and a dithered quantizer function corresponds to “quantize the dithered weights” were the weights are the parameter updates)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method for dithering weights for neural network compression as taught by Choi to the disclosed invention of Mohassel.	One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate “randomized uniform quantization, where uniform random dithering makes the distortion independent of the source… guaranteed to achieve near-optimal performance” (Choi, Section 2.3)


Claims 5 and 6 are rejected under 35 U.S.C. 103 as being patentable over Mohassel/Choi. US Document ID US 20200242466 A1 hereinafter Mohassel. Further in view of Nika Aldrich “Exploring Dither in Floating-Point Systems”, hereinafter Aldrich.

Regarding Claim 5
	Mohassel/Choi teach the method of claim 5
Further Choi teaches, before rounding or truncating to a nearest mantissa value for the given level of precision. (Section 4 “the rounding function satisfies round(x) = sign(x)[ |x| + 0.5 ], where ⌊x⌋ is the largest integer smaller than or equal to x” Rounding to a nearest mantissa corresponds to rounding to nearest integer when the mantissa presents 0 bits of precision) and wherein the dithered quantizer function applies dithering values d associated with the parameters wt, respectively (Section 4 “First, we randomize spatial-domain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval" 
    PNG
    media_image1.png
    40
    314
    media_image1.png
    Greyscale
… and U1, . . . , Unsd are independent and identically distributed uniform random variables” Examiner notes that U corresponds to d, which is associated with the weight parameters wt because they are both used in associated to perform a calculation, the dithered quantizer function depicted above.)
Mohassel/Choi does not explicitly teach, wherein the parameters wt are in a floating-point format having a given level of precision for mantissa values, by adding mantissa values at a higher level of precision
Aldrich, however, when addressing issues related to adding dither to floating point numbers teaches, wherein the parameters wt are in a floating-point format having a given level of precision for mantissa values, by adding mantissa values at a higher level of precision (See Fig 2.3A A graphical depiction of adding dither of a higher level of precision, a mantissa with a larger bit width, to floating point values of a given level of precision. 
    PNG
    media_image4.png
    263
    556
    media_image4.png
    Greyscale
The dither is represented by the grey shading, where the pink data representing the floating point precision. The bold vertical bar represents the decimal in floating point format. Last ¶ of page 8 “In this example we clearly did add dither that is uncorrelated noise”
 It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate adding dither to floating point numbers as taught by Aldrich to the disclosed invention of Mohassel/Choi.	One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate “perform this quantization without producing distortion. As we know, however, simply truncating off excess bits yields distortion, as this is the same as quantization. The error will inherently be correlated to the original waveform. We must apply dither in this situation in order  (Aldrich, Page 3 ¶1 Dithering Fixed Point Data)

Regarding Claim 6
Mohassel/Choi teach the method of claim 3
Further Mohassel teaches, wherein the parameters wt are in an integer format or fixed-point format (¶0015 “the training can involve multiplying these integers (and other intermediate values) and integer-represented weights… allowing efficient computation and limiting the amount of memory for storing the integer values” Examiner notes that integer represented weights corresponds to wt in integer format)
Further Choi teaches, before rounding or truncating to a nearest integer value (Section 4 “the rounding function satisfies round(x) = sign(x)[ |x| + 0.5 ], where ⌊x⌋ is the largest integer smaller than or equal to x”) and wherein the dithered quantizer function applies dithering values d associated with the parameters wt, respectively (Section 4 “First, we randomize spatial-domain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval" 
    PNG
    media_image1.png
    40
    314
    media_image1.png
    Greyscale
… and U1, . . . , Unsd are independent and identically distributed uniform random variables” Examiner notes that U corresponds to d, which is associated with the weight parameters wt because they are both used in associated to perform a calculation, the dithered quantizer function depicted above.)
Mohassel/Choi does not explicitly teach, by adding values at a higher level of precision
Aldrich, however, when addressing issues related to adding dither to floating point numbers teaches, by adding values at a higher, level of precision (See Fig 2.3A A graphical depiction of adding dither of a higher level of precision, a mantissa with a larger bit width, to floating point values of a given level of precision. 
    PNG
    media_image4.png
    263
    556
    media_image4.png
    Greyscale
The dither is represented by the grey shading, where the pink data representing the floating point precision. The bold vertical bar represents the decimal in floating point format. Last ¶ of page 8 “In this example we clearly did add dither that is uncorrelated noise”
 It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate adding dither to floating point numbers as taught by Aldrich to the disclosed invention of Mohassel/Choi.	One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate “perform this quantization without producing distortion. As we know, however, simply truncating off excess bits yields distortion, as this is the same as quantization. The error will inherently be correlated to the original waveform. We must apply dither in this situation in order to prevent the quantization error from being quantization distortion (Aldrich, Page 3 ¶1 Dithering Fixed Point Data)

Claims 7, 8, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Mohassel/Choi.  Further in view of Ando et al. “Dither NN: An Accurate Neural Network with Dithering for Low Bit-Precision Hardware”, hereinafter Ando.

Regarding Claim 7
Mohassel/Choi teach the method of claim 3
Mohassel/Choi does not explicitly teach, wherein the parameters wt are in format having a given level of precision after quantization is applied
Ando, however, when addressing issues related to quantizing neuronal weights with a dither neural network teaches, wherein the parameters wt are in format having a given level of precision after quantization is applied (See Fig 3A and Section IV B ¶04 “the quantization error to be added to the next neuron’s weighted-sum in the dithering algorithm is calculated using the sum and sign of the current neuron… This is because the binary-quantized network is a special case of fixed-point (or linear integer) quantization where the bit width is 1 " Dithering is applied to every neuron weighted sum in the network. This modification being part of the network architecture would then apply the dithered quantizer function to each iteration during training. So that wt would be a quantized weight with a bit width of 1 in the binary-quantized network special case.) 
Further Choi teaches, wherein the dithered quantizer function dither() is implemented using a rounding function,  and wherein the dithering values d are selected in a range of -0.5 to 0.5 of an increment of a value at the given level of precision. (Section 4 “First, we randomize spatial-domain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval" 
    PNG
    media_image1.png
    40
    314
    media_image1.png
    Greyscale
… , and U1, . . . , Unsd [dithering values d] are independent and identically distributed uniform random variables with the support of [−∆/2, ∆/2,]” When ∆ is selected to be 1 as taught by Ando, the dithering values would be selected in the range of [ -½, ½ ])
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a dither circuit for weight values as taught by Ando to the disclosed invention of Mohassel/Choi.	One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate weight dithering to “[minimize] the total quantization error for low-precision quantization” (Ando, Section IV ¶01)

Regarding Claim 8
Mohassel/Choi teach the method of claim 3
Mohassel/Choi does not explicitly teach, wherein the parameters wt are in format having a given level of precision after quantization is applied
Ando, however, when addressing issues related to quantizing neuronal weights with a dither neural network teaches, wherein the parameters wt are in format having a given level of precision after quantization is applied (See Fig 3A and Section IV B ¶04 “the quantization error to be added to the next neuron’s weighted-sum in the dithering algorithm is calculated using the sum and sign of the current neuron… This is because the binary-quantized network is a special case of fixed-point (or linear integer) quantization where the bit width is 1 " Dithering is applied to every neuron weighted sum in the network. This modification being part of the network architecture would then apply the dithered quantizer function to each iteration during training. So that wt would be a quantized weight with a bit width of 1 in the binary-quantized network special case.) 
Further Choi teaches, wherein the dithered quantizer function dither() is implemented using a floor function,  and wherein the dithering values d are selected in a range of 0.0 to 1.0 of an increment of a value at the given level of precision. (Section 4 “First, we randomize spatial-domain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval" 
    PNG
    media_image1.png
    40
    314
    media_image1.png
    Greyscale
… , and U1, . . . , Unsd [dithering values d] are independent and identically distributed uniform random variables with the support of [−∆/2, ∆/2,]” When ∆ is selected to be 2 , the dithering values would be selected in the range of [ -1, 1 ] Because this range encompasses the range [0,1] selecting from the large range would inherently be selected from the smaller range as well. “round(x) = sign(x)⌊ |x| + 0.5⌋, where ⌊x⌋ is the largest integer smaller than or equal to x” Examiner notes that the round function described is equivalent to the floor function claimed)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a dither circuit for weight values as taught by Ando to the disclosed invention of Mohassel/Choi.	One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate weight dithering to “[minimize] the total quantization error for low-precision quantization” (Ando, Section IV ¶01)

Regarding Claim 14
Mohassel/Choi teach the method of claim 1
Mohassel/Choi does not explicitly teach, the dithered quantizer function is the same in each of the multiple iterations, and wherein the dithered quantizer function is applied in all of the multiple iterations. (Section iv B ¶004 “the quantization error to be added to the next neuron’s weighted-sum in the dithering algorithm is calculated using the sum and sign of the current neuron" Examiner notes that dithering is applied to every neuron weighted sum in the network. This modification, being part of the network architecture, would then apply the dithered quantizer function or the dithering algorithm to each iteration during training. Further Section VA ¶001 “When we intend to use a certain operation in the hidden layers of a neural network, its derivative must be defined for applying the back propagation" Since Ando derives the backpropagation for the new architecture that contains the dithering circuit, this implies the authors apply dithering during backpropagation or during all of training iterations) 
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a dither circuit for weight values as taught by Ando to the disclosed invention of Mohassel/Choi.	One of ordinary skill in the arts would have been motivated to make Ando, Section IV ¶01)

Claim 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mohassel/Choi. Further in view of Hsu et al. “a study on speech enhancement using exponent-only floating point quantized neural network (eofp-qnn)”, hereinafter Hsu.

Regarding Claim 11
	Mohassel/Choi teach the method of claim 1
Mohassel/Choi does not explicitly teach, wherein the training the at least some of the parameters further includes: converting values, including the at least some of the parameters, from a first format to a second format, the second format having a lower precision than the first format, wherein the first format is a first floating-point format having m bits of precision for mantissa values and ei bits of precision for exponent values, wherein the second format is a second floating-point format: having m2 bits of precision for mantissa values and e2 bits of precision for exponent values, wherein m1 > m2, and wherein ei > e2; or having a shared exponent value.
Hsu, however, when addressing issues related to quantizing model parameters of a machine learning model teaches, wherein the training the at least some of the parameters further includes: converting values, including the at least some of the parameters, from a first format to a second format, the second format having a lower precision than the first format (Section 3.1 ¶01 "make the model learn how to quantize the parameters during the training phase" Examiner notes that quantizing by definition includes converting values from a first format to a second format, wherein the second format has lower precision.) wherein the first format is a first floating-point format having m bits of precision for mantissa values and ei bits of precision for exponent values, wherein the second format is a second floating-point format: having m2 bits of precision for mantissa values and e2 bits of precision for exponent values (Section 3.3 ¶01-02 “According to the format of the single-precision floating point, it is obvious that there are at most 23 bits that we can quantize in the mantissa-quantization… Therefore, we propose the statistical exponent-quantization to further compress the DL-based model” Examiner notes that given that the quantization is done on a floating point number, and quantizes both the mantissa and the exponent. The 1st format will include a mantissa and exponent, while the 2nd format will also include a mantissa and exponent) (Section 3.2 ¶02 “Algorithm 1 presents the mantissa-quantization, which is executed after the backward-propagation…The last n bits are removed by taking the intersection with the binary mask, and the binary bits is converted back to the floating point p. After all the parameters are updated (quantized)") and wherein ei > e2; or having a shared exponent value. (Section 3.3 ¶03 “Algorithm 2 presents the exponent-quantization for each parameter...It is clear that there is no performance degradation when applying the exponent-quantization since we only reduce the number of bits to represent a parameter value instead of changing the value")
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate floating point quantization as taught by Hsu to the disclosed invention of Mohassel/Choi.	One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate floating point quantization so that “model sizes of the quantized models were only 8.75% and 21.89% for BLSTM and FCN…results suggest that… may be able to install an SE system with a compressed DL-based model in embedded devices to operate in an IoT environment.” (Hsu, Conclusion)

Response to Arguments
In response to Applicant’s request to add the missing reference, Mohassel, to the PTO-892, Examiner has added the reference.
In response to Applicant’s amendments, the Examiner has withdrawn the objections and the rejections under 35 USC 112(d) and 35 USC 112(b). 
Applicant's arguments filed June 16th 2021 have been fully considered but they are not persuasive.

Regarding claim 1, 17, and 20
Applicant states that Choi teaches away from the cited language of claim 1, 17, and 20. Applicant asserts that “the compression scheme of Choi happens after training of the CNN model” and further states that the claims recite “a dithered quantizer function is used when updating parameters as part of training.” However the training of Choi, occurs in two steps. First the model is pre-trained, as correctly assessed by the applicant. Second further training is undergone that utilizes the dithered quantizer function, which corresponds to the claim limitation pertaining to updating parameters during the training. Choi discloses that “their shared quantized value in the codebook is updated by gradient descent using the average gradient… where t is the iteration time, n is the learning rate… After the codebook is updated, individual weights are updated by following their shared quantized value in the codebook” (see Choi pg 7 Fine tuning the codebook). In this way Choi teaches parameter updates based t iterations, where in the update codebook is based on the results of the dithered quantization described. Therefore, the compression scheme of Choi happens during auxiliary training of the model, corresponding to the cited claims.
In response to applicant’s argument that there is no teaching, suggestion, or motivation to combine the references, the examiner recognizes that obviousness may be established by combining or modifying the teachings of the prior art to produce the claimed invention where there is some teaching, suggestion, or motivation to do so found either in the references themselves or in the knowledge generally available to one of ordinary skill in the art.  See In re Fine, 837 F.2d 1071, 5 USPQ2d 1596 (Fed. Cir. 1988), In re Jones, 958 F.2d 347, 21 USPQ2d 1941 (Fed. Cir. 1992), and KSR International Co. v. Teleflex, Inc., 550 U.S. 398, 82 USPQ2d 1385 (2007). 
Regarding claim 1, 17, and 20
 In this case, Applicant states that one “may have been motivated to use uniform random dithering as part of that compression scheme” yet one would not have been motivated “to use a dithered quantizer function to change how parameters are updated during training.” Given that applicant states that one may have been motivated to use the compression scheme or Choi, but not the 
Regarding claim 5 and 6
	Applicant states that, because “Mohassel and Choi fail to teach… cited language of claim 1” and that Aldrich fails to remedy this by not addressing a training process for a machine learning modal, and the combination of Mohassel and/or Choi with Aldrich further teaches away. As stated previously, Mohassel and Choi indeed teach the limitation of claim 1. Further, examiner recognizes that Aldrich does not address the training process for a machine learning model, however, Aldrich teaches the limitations lacking in Mohassel/Choi for claim 5 and 6 as stated in the rejection provided. In particular, Aldrich describes “aspects of dithering for fixed-point and floating-point data,” which is of interest to Mohassel/Choi as they deal with the representation of such data types. Thus one would be motivated to combine Mohassel/Choi with Aldrich.
Regarding claim 7, 8, and 14
	Applicant states that, because “Mohassel and Choi fail to teach… cited language of claim 1” and that Ando fails to remedy this by not addressing a updating parameters as part of training, and the combination of Mohassel and/or 
Regarding claim 11
	Applicant states that, because “Mohassel and Choi fail to teach… cited language of claim 1” and that Hsu fails to remedy this by not dithering, and the combination of Mohassel and/or Choi with Hsu further teaches away. As stated previously, Mohassel and Choi indeed teach the limitation of claim 1. So this argument is considered moot. Further, examiner recognizes that Hsu does not dithering, however Hsu teaches the limitations lacking in Mohassel/Choi for claim 11 as stated in the rejection provided. In particular, Hsu describes “aspects of quantization of values in a neural network” to Mohassel/Choi as Choi, in particular, is motivated to reduce model sizes for use on mobile devices [low compute power device] (Choi Abstract and Introduction ¶01) and Hsu describes .
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNATHAN R GERMICK whose telephone number is (571)272-8363. The examiner can normally be reached on Monday-Friday 7:30 am – 5:00 pm (EST).

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://portal.uspto.gov/external/portal. Should you have questions about access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

	/J.R.G./          Examiner, Art Unit 2122 

/BRIAN M SMITH/Primary Examiner, Art Unit 2122