DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is responsive to the Application filed on 12/20/2021.  Claims 1-20 are pending in the case.  Claims 1, 17 and 20 are independent claims. Claims 1, 17 and 20 are amended. Claim 18 is cancelled. Claim 21 is new.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 11/10/2021 has been entered.

Response to Arguments
Applicant's arguments filed 07/14/2022 have been fully considered but they are not persuasive. 
Applicant argues that it would not be obvious to modify Faraone with Gray because Gray only describes applying a dithering function to an input signal, not to backpropagation information. Gray teaches not only a dithering function, but a dithered quantizer function. Faraone teaches using a quantizer function. As noted in the rejection Gray discloses the benefits of modifying a quantizer function by adding a dithering component. Faraone already teaches a quantizer function for backpropagation information, Gray simply teaches an improvement to quantizer functions generally. The updated rejection highlights the elements of the claim taught by Faraone and those elements taught by Gray.

Claim Objections
Claim 1 is objected to because of the following informalities:  
There appears to be a typographical error in the following limitation “apply the machine learning model to at least some of the training data the a;”. The extra words “the a” are disregarded. 

Appropriate correction is required.


Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-17 and 19-21 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
Each independent claim includes the limitation “quantizing the dithered parameter updates during backpropagation of the parameter updates”. There is no support for this claim, particularly because the specification does not specify that quantization is performed during “backpropagation of the parameter updates”. The specification does however state that quantization is used but not that the step of quantization occurs during backpropagation. Using quantization generally means that terms that were quantized are used. The following are a few examples of this in the specification: “When the parameters are adjusted, dithered quantization of parameters can be used as described in section V” (P0057), “For example, a machine learning tool uses dithered quantization of parameters during training” (P0017).


Claim Rejections - 35 U.S.C. § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. §§ 102 and 103 (or as subject to pre-AIA  35 U.S.C. §§ 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1, 13, 14, 17, 20 are rejected under 35 U.S.C. § 103 as being unpatentable over Faraone et al. (“SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks”, hereinafter Faraone) in view of Gray et al. (“Dithered Quantizers”, hereinafter Gray).

As to independent claim 1,
Faraone teaches, a method implemented in a computer system comprising at least one hardware processor and at least one memory coupled to the at least on hardware processor: (“In this paper we present Learning Symmetric Quantization (SYQ), a method to design binary/ternary networks with fine-grained scaling coefficients which preserve these complexities. We do this by learning a symmetric weight codebook”, pg 8 Section 7.2 “Table 7 shows the resource and performance estimates provided by Vivado HLS of the described hardware architecture for a target Xilinx ZU3 FPGA device” The algorithm SYQ, is implemented on a physical processor configured to perform the operations, including a memory.)receiving training data; initializing parameters of a machine learning model;  (  “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image1.png
    430
    376
    media_image1.png
    Greyscale
 , pg 5-6 Section 5.3, the minibatch corresponds to training data, Wt are some initial parameters) training at least some of the parameters in multiple iterations based on the training data, ( “DNN training is an iterative process which has a feedforward path to compute the output and a backpropagation path to calculate gradients and update its parameters for learning.” Pg 1 Section 1 ¶ 3)  [in each of the multiple iterations during the training:] applying the machine learning model to at least some of the training data; (  “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image2.png
    37
    353
    media_image2.png
    Greyscale
 see algorithm 1 , pg 5-6 Section 5.3 ¶ 2. In this step the training data is applied to the model with weights Q) [in each of the multiple iterations during the training: ] based at least in part on results of the applying the machine learning model, determining parameter updates to the at least some of the parameters; and updating the at least some of the parameters, (  “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image3.png
    82
    353
    media_image3.png
    Greyscale
 see algorithm 1 , pg 5-6 Section 5.3 ¶ 2) Quantizing the… parameter updated during backpropagation of the parameter updates. ( pg 3 section 3.2 ¶ 1 and algorithm 1 pg 6 “weight matrices for each layer Wl are approximated by a function f, resulting in a quantized weight matrix Ql… During training Ql is used for inference and backpropagation, and the corresponding elements in Wl are updated based on these gradients” the updates are based on Q weights quantized by a quantizer function. The quantized parameters are used in both the inference stage and the backpropagation stage. Examiner notes that Algorithm 1 performs the step of quantizing after performing weight updates
    PNG
    media_image4.png
    25
    324
    media_image4.png
    Greyscale
 . This appears to be equivalent to the steps described in paragraph 082 of the specification “more formally, suppose parameters w for a machine learning model are updated between iteration during training as 
    PNG
    media_image5.png
    28
    131
    media_image5.png
    Greyscale
”. As in the art weights are first updated, the terms within the parenthetical, then a quantization function is applied. Whether or not this occurs “during backpropagation” or “immediately after backpropagation” is only a matter of determining descriptive labels of a process that is identical regardless of the label.) and outputting the parameters.  (  “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image6.png
    26
    353
    media_image6.png
    Greyscale
 see algorithm 1 , pg 5-6 Section 5.3 ¶ 2)
Faraone does not explicitly teach, adding respective dither values to respect parameter updates to provide dithered parameter updates; quantizing the dithered parameters
Gray however when addressing dithered quantization of an input values teaches, adding respective dither values to respect parameter updates to provide dithered parameter updates; quantizing the dithered parameters ( pg 1 Introduction ¶ 04 “In a dithered quantizer, instead of quantizing an input signal X, directly, one quantizes a signal 
    PNG
    media_image7.png
    29
    115
    media_image7.png
    Greyscale
 where W, is a random process, independent of the signal X, called a dither process” Gray describes a quantizer function which employs dither. In gray the “parameters” are considered to be the input signal Xn, or values to be quantized. Un is the signal generated as a result of adding dither values to a signal Xn, which is later quantized.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the quantization step in Faraone with the dithered quantizer of Gray. One would have been motivated to make such a combination so that the “the quantizer error e, is uniformly distributed on (-Δ/2, Δ/2] and is independent of the original input signal X” (pg 2 column 1 Gray). Furthermore, “purposeful distortion of an input signal [dithering] is common, because it can result in a more subjectively pleasing reproduction and because, under certain conditions, it can cause the quantization error to behave in a statistically nice fashion.” (pg 1 abstract Gray)

Regarding Claim 13
	Faraone/Gray teach claim 1, the rejection is incorporated
Further Gray teaches, the dithered quantizer function uses dithering values d based at least in part on output of a random number generator, ( abstract “suitably chosen random dither signals can cause the quantization error to be signal independent, uniformly distributed white noise “ pg 1  column 2 “Simulations provided strong evidence that this simple randomization could indeed provide significant improvement in quality… In a dithered quantizer, instead of quantizing an input signal X, directly, one quantizes a signal… where Wn, is a random process, independent of the signal Xn, called a dither process.”  A dither signal adds random values to the input signal according to a random process corresponding to a random number generator.) and wherein the dithering values are random values having a power spectrum of white noise or blue noise. ( pg 2  column 1 “the dither signal has a uniform probability density function on (-A/2, A/2],” a random process whose PDF is uniform is a signal whose power spectrum is white. As described on ¶ 0120 in the specification.)
Farone and Gray are combinable at least for the reasons set forth in claim 1

Regarding Claim 14
Faraone/Gray teach claim 1, the rejection is incorporated
Faraone teaches, the … quantizer function is the same in each of the multiple iterations, and wherein the … quantizer function is applied in all of the multiple iterations. 
 ( pg 1 introduction ¶03 “DNN training is an iterative process which has a feedforward path to compute the output and a backpropagation path to calculate gradients and update its parameters for learning” See algorithm 1 
    PNG
    media_image8.png
    130
    353
    media_image8.png
    Greyscale
 for each batch or for multiple iterations the same quantizer function is applied to weights of the neural network model.)
Further Gray teaches, a dithered quantizer function; ( “In a dithered quantizer, instead of quantizing an input signal X, directly, one quantizes a signal 
    PNG
    media_image7.png
    29
    115
    media_image7.png
    Greyscale
 where W, is a random process, independent of the signal X, called a dither process” pg 1 Introduction ¶ 04)
Farone and Gray are combinable at least for the reasons set forth in claim 1


As to independent claim 17
Faraone teaches, A computer system comprising a buffer, in memory of the computer system, and; a buffer, in memory of the computer system, configured to store the parameters.  (“In this paper we present Learning Symmetric Quantization (SYQ), a method to design binary/ternary networks with fine-grained scaling coefficients which preserve these complexities. We do this by learning a symmetric weight codebook”, pg 2 ¶ 01, “As in the equivalent layer-wise scaling architecture, we can still maintain one multiplier in hardware and only increase memory slightly to store the scaling coefficients.” Coefficients or parameters are stored in memory, which is a buffer.) configured to receive training data; and a machine learning tool, implemented with one or more processors of the computer system, configured to perform operations comprising: initializing parameters of a machine learning model; (“The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image1.png
    430
    376
    media_image1.png
    Greyscale
 , pg 5-6 Section 5.3, the minibatch corresponds to training data, Wt are some initial parameters pg 8 Section 7.2 “Table 7 shows the resource and performance estimates provided by Vivado HLS of the described hardware architecture for a target Xilinx ZU3 FPGA device” The algorithm SYQ, is implemented on a physical processor configured to perform the operations) training at least some of the parameters in multiple iterations based on the training data, ( “DNN training is an iterative process which has a feedforward path to compute the output and a backpropagation path to calculate gradients and update its parameters for learning.” Pg 1 Section 1 ¶ 3)  [in each of the multiple iterations during the training:] applying the machine learning model to at least some of the training data; (  “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image2.png
    37
    353
    media_image2.png
    Greyscale
 see algorithm 1 , pg 5-6 Section 5.3 ¶ 2. In this step the training data is applied to the model with weights Q) [in each of the multiple iterations during the training: ] based at least in part on results of the applying the machine learning model, determining parameter updates to the at least some of the parameters; (  “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image3.png
    82
    353
    media_image3.png
    Greyscale
 see algorithm 1 , pg 5-6 Section 5.3 ¶ 2) Quantizing the… parameter updated during backpropagation of the parameter updates. ( pg 3 section 3.2 ¶ 1 and algorithm 1 pg 6 “weight matrices for each layer Wl are approximated by a function f, resulting in a quantized weight matrix Ql… During training Ql is used for inference and backpropagation, and the corresponding elements in Wl are updated based on these gradients” the updates are based on Q weights quantized by a quantizer function. The quantized parameters are used in both the inference stage and the backpropagation stage. Examiner notes that Algorithm 1 performs the step of quantizing after performing weight updates
    PNG
    media_image4.png
    25
    324
    media_image4.png
    Greyscale
 . This appears to be equivalent to the steps described in paragraph 082 of the specification “more formally, suppose parameters w for a machine learning model are updated between iteration during training as 
    PNG
    media_image5.png
    28
    131
    media_image5.png
    Greyscale
”. As in the art weights are first updated, the terms within the parenthetical, then a quantization function is applied. Whether or not this occurs “during backpropagation” or “immediately after backpropagation” is only a matter of determining descriptive labels of a process that is identical regardless of the label.) and outputting the parameters.  (  “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image6.png
    26
    353
    media_image6.png
    Greyscale
 see algorithm 1 , pg 5-6 Section 5.3 ¶ 2) and outputting the parameters.  (  “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image6.png
    26
    353
    media_image6.png
    Greyscale
 see algorithm 1 , pg 5-6 Section 5.3 ¶ 2)

Faraone does not explicitly teach, adding respective dither values to respect parameter updates to provide dithered parameter updates; quantizing the dithered parameters
Gray however when addressing dithered quantization of an input signal teaches, adding respective dither values to respect parameter updates to provide dithered parameter updates; quantizing the dithered parameters ( pg 1 Introduction ¶ 04 “In a dithered quantizer, instead of quantizing an input signal X, directly, one quantizes a signal 
    PNG
    media_image7.png
    29
    115
    media_image7.png
    Greyscale
 where W, is a random process, independent of the signal X, called a dither process” Gray describes a quantizer function which employs dither. In gray the “parameters” are considered to be the input signal Xn, or values to be quantized. Un is the signal generated as a result of adding dither values to a signal Xn, which is later quantized.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the quantization step in Faraone with the dithered quantizer of Gray. On would have been motivated to make such a combination so that the “the quantizer error e, is uniformly distributed on (-Δ/2, Δ/2] and is independent of the original input signal X” (pg 2 column 1 Gray). Furthermore, “purposeful distortion of an input signal [dithering] is common, because it can result in a more subjectively pleasing reproduction and because, under certain conditions, it can cause the quantization error to behave in a statistically nice fashion.” (pg 1 abstract Gray)

As to independent claim 20
Faraone teaches, One or more computer-readable media having stored thereon computer-executable instructions for causing one or more processing units, when programmed thereby, to perform operations comprising: (“In this paper we present Learning Symmetric Quantization (SYQ), a method to design binary/ternary networks with fine-grained scaling coefficients which preserve these complexities. We do this by learning a symmetric weight codebook”, pg 8 Section 7.2 “Table 7 shows the resource and performance estimates provided by Vivado HLS of the described hardware architecture for a target Xilinx ZU3 FPGA device” The algorithm SYQ, is implemented on a physical processor configured to perform the operations, including a computer readable media.) receiving training data initializing parameters of a machine learning model;  (“The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image1.png
    430
    376
    media_image1.png
    Greyscale
 , pg 5-6 Section 5.3, the minibatch corresponds to training data, Wt are some initial parameters) training at least some of the parameters in multiple iterations based on the training data, ( “DNN training is an iterative process which has a feedforward path to compute the output and a backpropagation path to calculate gradients and update its parameters for learning.” Pg 1 Section 1 ¶ 3)  [in each of the multiple iterations during the training:] applying the machine learning model to at least some of the training data (  “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image2.png
    37
    353
    media_image2.png
    Greyscale
 see algorithm 1 , pg 5-6 Section 5.3 ¶ 2. In this step the training data is applied to the model with weights Q) [in each of the multiple iterations during the training: ] based at least in part on results of the applying the machine learning model, determining parameter updates to the at least some of the parameters; (  “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image3.png
    82
    353
    media_image3.png
    Greyscale
 see algorithm 1 , pg 5-6 Section 5.3 ¶ 2) Quantizing the… parameter updated during backpropagation of the parameter updates. ( pg 3 section 3.2 ¶ 1 and algorithm 1 pg 6 “weight matrices for each layer Wl are approximated by a function f, resulting in a quantized weight matrix Ql… During training Ql is used for inference and backpropagation, and the corresponding elements in Wl are updated based on these gradients” the updates are based on Q weights quantized by a quantizer function. The quantized parameters are used in both the inference stage and the backpropagation stage. Examiner notes that Algorithm 1 performs the step of quantizing after performing weight updates
    PNG
    media_image4.png
    25
    324
    media_image4.png
    Greyscale
 . This appears to be equivalent to the steps described in paragraph 082 of the specification “more formally, suppose parameters w for a machine learning model are updated between iteration during training as 
    PNG
    media_image5.png
    28
    131
    media_image5.png
    Greyscale
”. As in the art weights are first updated, the terms within the parenthetical, then a quantization function is applied. Whether or not this occurs “during backpropagation” or “immediately after backpropagation” is only a matter of determining descriptive labels of a process that is identical regardless of the label.) and outputting the parameters.  (  “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image6.png
    26
    353
    media_image6.png
    Greyscale
 see algorithm 1 , pg 5-6 Section 5.3 ¶ 2)
Faraone does not explicitly teach, adding respective dither values to respect parameter updates to provide dithered parameter updates; quantizing the dithered parameters
Gray however when addressing dithered quantization of an input values teaches, adding respective dither values to respect parameter updates to provide dithered parameter updates; quantizing the dithered parameters ( pg 1 Introduction ¶ 04 “In a dithered quantizer, instead of quantizing an input signal X, directly, one quantizes a signal 
    PNG
    media_image7.png
    29
    115
    media_image7.png
    Greyscale
 where W, is a random process, independent of the signal X, called a dither process” Gray describes a quantizer function which employs dither. In gray the “parameters” are considered to be the input signal Xn, or values to be quantized. Un is the signal generated as a result of adding dither values to a signal Xn, which is later quantized.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the quantization step in Faraone with the dithered quantizer of Gray. On would have been motivated to make such a combination so that the “the quantizer error e, is uniformly distributed on (-Δ/2, Δ/2] and is independent of the original input signal X” (pg 2 column 1 Gray). Furthermore, “purposeful distortion of an input signal [dithering] is common, because it can result in a more subjectively pleasing reproduction and because, under certain conditions, it can cause the quantization error to behave in a statistically nice fashion.” (pg 1 abstract Gray)

Claim(s) 2-4, 7-10, 12, 15, 16 are rejected under 35 U.S.C. § 103 as being unpatentable over Faraone/Gray as applied above, and further in view of Mohassel et al. US Document ID US 20200242466 A1 hereinafter Mohassel

Regarding Claim 2
	Faraone/Gray teach claim 1, the rejection is incorporated
Further Faraone teaches, wherein the machine learning model is a neural network having multiple layers, (“All other CONV layers are quantized with SYQ pixel-wise scaling” pg 6 Section 6.1 ¶ 2)
Faraone/Gray does not explicitly teach, each of the multiple layers including one or more nodes, and wherein the parameters include one or more of: weights for connections between the nodes; biases for at least some of the nodes; a count of the multiple layers; for each of the multiple layers, a count of nodes in the layer; and a control parameter for batch normalization. 
However Mohassel  when addressing training of a multi-layer neural network teaches, wherein the machine learning model is a neural network having multiple layers, each of the multiple layers including one or more nodes, and wherein the parameters include one or more of: weights for connections between the nodes;  a count of the multiple layers; for each of the multiple layers, a count of nodes in the layer; (¶0092 “FIG. 3B shows an example of a neural network with m−1 hidden layers" Further claim 21 states “initializing values for M sets of d weights W for a machine learning model, each set of d weights W corresponding to one of M nodes of a layer") biases for at least some of the nodes; (¶0074 “Usually, a bias b is introduced…this can be easily achieved by appending a dummy feature equal to 1 for each x.sub.i” The dummy feature means that the node that is presented with X.sub.i is a bias node) and a control parameter for batch normalization. (¶0244 “as depicted in step 10 of FIG. 5. A normalization factor |B| [control parameter] can be used when more than one training sample is used per iteration, e.g., a described herein for a mini-batch mode")
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the neural network training method of in Faraone/Gray with the neural network training method of Mohassel. One would have been motivated to make such a combination because Mohassel clarifies usual elements of neural networks, such as the one in Faraone/Gray. In particular introducing biases, batch normalization and multi-layer networks, Mohassel enhance the neural network of Faraone/Gray in order “to learn more complicated relationships between high dimensional input and output data.” (Mohassel ¶0092)
so that the method can be applied to the problem presented by Mohassel namely “Privacy-preserving machine learning via secure multiparty computation (MPC) provides a promising solution by allowing different entities to train various models on their joint data without revealing any information beyond the outcome.” (Mohassel para. 07)
Farone and Gray are combinable at least for the reasons set forth in claim 1

Regarding Claim 3
	Faraone/Gray teach claim 1, the rejection is incorporated
Faraone teaches, wherein the updating the at least some of the parameters includes, for the each of the multiple iterations as a given iteration: calculating final parameters wt+1 for the given iteration based upon starting parameters wt for the given iteration and parameter updates Awt in the given iteration, (  “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image3.png
    82
    353
    media_image3.png
    Greyscale
 see algorithm 1 , pg 5-6 Section 5.3 ¶ 2 the weight updates are based on unquantized starting parameters wt and parameter updates.) 
Further Gray teaches, wherein dither() is the dithered quantizer function ( “In a dithered quantizer, instead of quantizing an input signal X, directly, one quantizes a signal 
    PNG
    media_image7.png
    29
    115
    media_image7.png
    Greyscale
 where W, is a random process, independent of the signal X, called a dither process” pg 1 Introduction ¶ 04 before quantizing the value Xn the dither is added.)
Faraone/Gray does not explicitly teach, [wt+1 = f(wt + Δ wt) ]
However Mohassel when addressing training of a multi-layer neural network teaches, [wt+1 = f(wt + Δ wt) ] ( ¶ 0079 “The SGD algorithm works as follows:… a coefficient w.sub.j is updated as 
    PNG
    media_image9.png
    66
    193
    media_image9.png
    Greyscale
” the update weights are the sum of non-quantized initial weights and parameter updates. )
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the neural network training method of in Faraone/Gray with the neural network training method of Mohassel. On would have been motivated to make such a combination so that the method can be applied to the problem presented by Mohassel namely “Privacy-preserving machine learning via secure multiparty computation (MPC) provides a promising solution by allowing different entities to train various models on their joint data without revealing any information beyond the outcome.” (Mohassel para. 07)
Farone and Gray are combinable at least for the reasons set forth in claim 1

Regarding Claim 4
	Faraone/Gray/Mohassel teach claim 3, the rejection is incorporated
Further Faraone/Gray teaches, wherein the updating the at least some of the parameters further includes determining dithering values d associated with the parameters Page 2 of 12 Application Number 16/240,514wt, respectively, (Faraone “The overall SYQ training process is summarized in Algorithm 1….
    PNG
    media_image3.png
    82
    353
    media_image3.png
    Greyscale
 see algorithm 1 , pg 5-6 Section 5.3 ¶ 2 weights are updates according to the results of a quantizer function which determines the values Q, Further in Gray “In a dithered quantizer, instead of quantizing an input signal X, directly, one quantizes a signal 
    PNG
    media_image7.png
    29
    115
    media_image7.png
    Greyscale
 where W, is a random process, independent of the signal X, called a dither process” pg 1 Introduction ¶ 04 to quantize an input value a dither is determined and added to the input.)
Further Faraone/Gray/Mohassel teaches, and wherein the dithered quantizer function dither is implemented as: [1] Round(wt + Awt + d), wherein round() is a rounding function; [2] round( (wt + Awt + d) / p ) x p, wherein p is a level of precision; [3] floor(wt + Awt + d), wherein floor() is a floor function; [4] floor( (wt + Awt + d) / p ) x p, wherein p is a level of precision; or [5] Quant(wt + Awt + d), wherein quant() is a quantizer function; ( Faraone Section 3.2 “for each layer Wl are approximated by a function f, resulting in a quantized weight matrix Ql: 
    PNG
    media_image10.png
    29
    109
    media_image10.png
    Greyscale
” Gray teaches the “dithered” aspect of the quantizer function, Gray pg 1 introduction ¶ 4 “In a dithered quantizer, instead of quantizing an input signal X, directly, one quantizes a signal 
    PNG
    media_image7.png
    29
    115
    media_image7.png
    Greyscale
 where W, is a random process, independent of the signal X, called a dither process”, thus Faraone/Gray disclose Q = f( W +d). Further, Mohassel ¶ 0079 “The SGD algorithm works as follows:… a coefficient w.sub.j is updated as 
    PNG
    media_image9.png
    66
    193
    media_image9.png
    Greyscale
” thus Faraone/Gray/Mohassel teach Q = f( (Wj – ΔW) + d). The expressed limitation presents a conditional phrase enumerated by an “or” to reject the claim only 1 of the 5 conditionals needs to be rejected)
Farone and Gray and Mohassel are combinable at least for the reasons set forth in claim 2


Regarding Claim 7
	Faraone/Gray/Mohassel teach claim 3, the rejection is incorporated
Further Faraone teaches, wherein the parameters wt are in format having a given level of precision after quantization is applied (Section 6.1 pg 6 “The VGG and ResNet models were initialized from floating point baseline weights…. All other CONV layers are quantized with SYQ pixel-wise scaling”)
Further Gray teaches, wherein the dithered quantizer function dither() is implemented using a rounding function, ( pg 3 Section 2 ¶001 “Consider the dithered quantizer of Fig. 4, and define the quantizer error… 
    PNG
    media_image11.png
    22
    203
    media_image11.png
    Greyscale
” “In this case, the normalized quantizer error can be expressed as … 
    PNG
    media_image12.png
    56
    176
    media_image12.png
    Greyscale
” the quantizer performs the rounding operations described in the equations.)and wherein dithering values d for the dithered quantizer function dither() are selected in a range of -0.5 to 0.5 of an increment of a value at the given level of precision. (Section 2 ¶ 01 “We require that the input and dither signals are chosen so that the quantizer does not overload. In particular, if there are M quantizer levels spaced Δ apart,… 
    PNG
    media_image13.png
    45
    146
    media_image13.png
    Greyscale
” the dither signal is selected such that the absolute value of the output is at least less than MΔ/2. This is equivilent to falling within in the range of [- MΔ/2, MΔ/2].)
Farone and Gray and Mohassel are combinable at least for the reasons set forth in claim 2


Regarding Claim 8
	Faraone/Gray/Mohassel teach claim 3, the rejection is incorporated
Faraone teaches, wherein the parameters wt are in format having a given level of precision after quantization is applied (Section 6.1 pg 6 “The VGG and ResNet models were initialized from floating point baseline weights…. All other CONV layers are quantized with SYQ pixel-wise scaling”)
Further Gray teaches, wherein the dithered quantizer function dithero is implemented using a floor function  ( pg 3 Section 2 ¶001 “Consider the dithered quantizer of Fig. 4, and define the quantizer error… 
    PNG
    media_image11.png
    22
    203
    media_image11.png
    Greyscale
” “In this case, the normalized quantizer error can be expressed as … 
    PNG
    media_image12.png
    56
    176
    media_image12.png
    Greyscale
… where 
    PNG
    media_image14.png
    20
    25
    media_image14.png
    Greyscale
 is the fractional part operator defined in (6)” pg 3 column 2 ¶ 3 “The following notation will be used throughout. Every real number r can be uniquely written in the form… 
    PNG
    media_image15.png
    26
    100
    media_image15.png
    Greyscale
 where 
    PNG
    media_image16.png
    18
    28
    media_image16.png
    Greyscale
 denotes the greatest integer less than or equal to r and 
    PNG
    media_image17.png
    21
    74
    media_image17.png
    Greyscale
is the fractional part of r” the quantizer performs the rounding operations described in the equations. Which includes a fractional part rounder and a floor rounding function for the integer part.) and wherein dithering values d for the dithered quantizer function dither() are selected in a range of 0.0 to 1.0 of an increment of a value at the given level of precision.   (Section 2 ¶ 01 “We require that the input and dither signals are chosen so that the quantizer does not overload. In particular, if there are M quantizer levels spaced Δ apart,… 
    PNG
    media_image13.png
    45
    146
    media_image13.png
    Greyscale
” the dither signal, Wn,  is selected such that the absolute value of the output is less than MΔ/2. This is equivilent to falling within in the range of at least [- MΔ/2, MΔ/2]. When MΔ/2 is greater than 1 the range is at least [-1,1] which includes the range [0,1] )
Farone and Gray and Mohassel are combinable at least for the reasons set forth in claim 2


Regarding Claim 9
	Faraone/Gray/Mohassel teach claim 3, the rejection is incorporated
Further Mohassel teaches, wherein the determining the parameter updates Δwt includes multiplying unscaled parameter updates Δw't by a learning rate ղ, as Δwt= ղ x  Δw't. (¶0079 "where α is a learning rate defining the magnitude to move towards the minimum in each iteration… The SGD algorithm works as follows: 
    PNG
    media_image18.png
    72
    192
    media_image18.png
    Greyscale
” Examiner notes that the equation corresponds to the SGD algorithm for weight updates. The partial derivative corresponds to Δw't )
Farone and Gray and Mohassel are combinable at least for the reasons set forth in claim 2


Regarding Claim 10
	Faraone/Gray teach claim 1, the rejection is incorporated
	Mohassel However  teaches,  wherein the machine learning model is a neural network, wherein the applying the machine learning model to at least some of the training data includes forward propagating input values for the each of the multiple iterations as a given iteration through multiple layers of the neural network, and wherein the determining the parameter updates to the at least some of the parameters includes: calculating a measure of loss based on output of the neural network after the forward propagation and expected output for the at least some of the training data; and backward propagating the measure of loss through the multiple layers of the neural network.(¶0094 “In the forward propagation for each iteration, the matrix X.sub.i of the ith layer [multiple layers of the neural network] is computed as X.sub.i=ƒ(X.sub.i−1×W.sub.i). In the backward propagation, given a cost function such as the cross entropy function [measure of loss based on output], the update function for each coefficient in each neuron can be expressed in a closed form” Examiner notes that in order to calculate the cross entropy in the propagation step, the known outputs must be known. Which implies that the two step method of forward then propagation is done utilizing training data. Further, the following quote discusses utilizing the cross entropy function on training data ¶0233 “a cost function is identified that provides an accuracy of the set of d weights in predicting the outputs Y of the set of training samples. The type of cost function can depend on the machine learning model used. Examples of cost functions are provided in section II, and include a sum of the squared error and a cross entropy function”)
Farone and Gray and Mohassel are combinable at least for the reasons set forth in claim 2

Regarding Claim 12
	Faraone/Gray teach claim 1, the rejection is incorporated
Mohassel however teaches, the machine learning model is a deep neural network, a support vector machine, a Bayesian network, a decision tree, or a linear classifier. (Mohassel ¶0092 “FIG. 3B shows an example of a neural network with m−1 hidden layers” The depicted machine learning model, consists of multiple hidden layers. By definition, the model depicted is a deep neural network because it consists of more than 1 hidden layer)
Farone and Gray and Mohassel are combinable at least for the reasons set forth in claim 2


Regarding Claim 15
Faraone/Gray teach claim 1, the rejection is incorporated
Further Mohassel teaches, the training data includes multiple examples, each of the multiple examples having one or more attributes and a label, and wherein each of the multiple iterations uses a single example or mini-batch of examples randomly selected from among any remaining examples, for an epoch, of the multiple examples. (¶0081 “Instead of selecting one sample of data per iteration, a small batch of samples can be selected randomly… |B| denotes the mini-batch size, usually ranging from 2 to 200” ¶0083 “all the training samples and select the mini-batch in each iteration sequentially, until all the samples are used once. This is referred to as one epoch (e.g., after all training samples are used once)… a testing dataset can be used to test the accuracy of the current w…each data sample in the testing dataset can be calculated as the prediction  and is compared to the corresponding output label")
Farone and Gray and Mohassel are combinable at least for the reasons set forth in claim 2

Regarding Claim 16
Faraone/Gray teach claim 1, the rejection is incorporated
Further Mohassel teaches, the initializing the parameters includes: for an initial iteration of an initial epoch, setting the parameters to random values. (¶0141 “The weighting coefficients w also be secret shared between the training computers (e.g., the two servers). The weighting coefficients w can be initialized to random values…The weighting coefficients w can be updated and remain secret shared after each iteration in the SGD, until the end of training when it is reconstructed” Examiner notes that epoch corresponds to the end of training)
Farone and Gray and Mohassel are combinable at least for the reasons set forth in claim 2

Regarding Claim 19
	Claim 19 is rejected at least for the reason set forth in claim 17 and claim 10


Claim(s) 5, 6 are rejected under 35 U.S.C. § 103 as being unpatentable over Faraone/Gray/Mohassel as applied above, and further in view in view of Nika Aldrich “Exploring Dither in Floating-Point Systems”, hereinafter Aldrich.

Regarding Claim 5
	Faraone/Gray/Mohassel teach claim 3, the rejection is incorporated
Faraone teaches, wherein the parameters wt are in a floating-point format having a given level of precision for mantissa values, (Section 6.1 pg 6 “The VGG and ResNet models were initialized from floating point baseline weights…. All other CONV layers are quantized with SYQ pixel-wise scaling”) 
Faraone/Gray/Mohassel teaches, and wherein the dithered quantizer function applies dithering values d associated with the parameters wt, respectively ( Gray pg 1 Section 1 ¶ 04 “In a dithered quantizer, instead of quantizing an input signal X, directly, one quantizes a signal 
    PNG
    media_image7.png
    29
    115
    media_image7.png
    Greyscale
 where W, is a random process, independent of the signal X, called a dither process” )
Faraone/Gray/Mohassel does not explicitly teach, by adding mantissa values at a higher level of precision before rounding or truncating to a nearest mantissa value for the given level of precision.  
Aldrich, however, when addressing issues related to adding dither to floating point numbers teaches, by adding mantissa values at a higher level of precision before rounding or truncating to a nearest mantissa value for the given level of precision.   (See Fig 2.3A A graphical depiction of adding dither of a higher level of precision, a mantissa with a larger bit width, to floating point values of a given level of precision. 
    PNG
    media_image19.png
    263
    556
    media_image19.png
    Greyscale
The dither is represented by the grey shading, where the pink data representing the floating point precision. The bold vertical bar represents the decimal in floating point format. Last ¶ of page 8 “In this example we clearly did add dither that is uncorrelated noise”
 It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate adding dither to floating point numbers as taught by Aldrich to the disclosed invention of Faraone/Gray/Mohassel	One of ordinary skill in the arts would have been motivated to make this modification in order to “perform this quantization without producing distortion. As we know, however, simply truncating off excess bits yields distortion, as this is the same as quantization. The error will inherently be correlated to the original waveform. We must apply dither in this situation in order to prevent the quantization error from being quantization distortion (Aldrich, Page 3 ¶1 Dithering Fixed Point Data)
Farone and Gray and Mohassel are combinable at least for the reasons set forth in claim 2



Regarding Claim 6
	Faraone/Gray/Mohassel teach claim 3, the rejection is incorporated
Faraone/Gray/Mohassel teaches, and wherein the dithered quantizer function applies dithering values d associated with the parameters wt, respectively, ( Gray pg 1 Section 1 para. 04 “In a dithered quantizer, instead of quantizing an input signal X, directly, one quantizes a signal 
    PNG
    media_image7.png
    29
    115
    media_image7.png
    Greyscale
 where W, is a random process, independent of the signal X, called a dither process” )
Further Mohassel teaches, wherein the parameters wt are in an integer format or fixed-point format (¶0015 “the training can involve multiplying these integers (and other intermediate values) and integer-represented weights… allowing efficient computation and limiting the amount of memory for storing the integer values” Examiner notes that integer represented weights corresponds to wt in integer format)
Faraone/Gray/Mohassel does not explicitly teach, by adding values at a higher level of precision before rounding or truncating to a nearest integer value.  
Aldrich, however, when addressing issues related to adding dither to floating point numbers teaches, by adding values at a higher level of precision before rounding or truncating to a nearest integer value.   (See Fig 2.3A A graphical depiction of adding dither of a higher level of precision, a mantissa with a larger bit width, to floating point values of a given level of precision. 
    PNG
    media_image19.png
    263
    556
    media_image19.png
    Greyscale
The dither is represented by the grey shading, where the pink data representing the floating point precision. The bold vertical bar represents the decimal in floating point format. Last ¶ of page 8 “In this example we clearly did add dither that is uncorrelated noise”
 It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate adding dither to floating point numbers as taught by Aldrich to the disclosed invention of Faraone/Gray/Mohassel.	One of ordinary skill in the arts would have been motivated to make this modification in order to “perform this quantization without producing distortion. As we know, however, simply truncating off excess bits yields distortion, as this is the same as quantization. The error will inherently be correlated to the original waveform. We must apply dither in this situation in order to prevent the quantization error from being quantization distortion (Aldrich, Page 3 ¶1 Dithering Fixed Point Data)
Farone and Gray and Mohassel are combinable at least for the reasons set forth in claim 2




Claim(s) 11 are rejected under 35 U.S.C. § 103 as being unpatentable over Faraone/Gray/Mohassel as applied above, and further in view of Hsu et al. “a study on speech enhancement using exponent-only floating point quantized neural network (eofp-qnn)”, hereinafter Hsu.

Regarding Claim 11
	Faraone/Gray teach claim 1, the rejection is incorporated
Faraone/Gray does not explicitly teach, wherein the training the at least some of the parameters further includes: converting values, including the at least some of the parameters, from a first format to a second format, the second format having a lower precision than the first format, wherein the first format is a first floating-point format having m bits of precision for mantissa values and ei bits of precision for exponent values, wherein the second format is a second floating-point format: having m2 bits of precision for mantissa values and e2 bits of precision for exponent values, wherein m1 > m2, and wherein ei > e2; or having a shared exponent value.
Hsu, however, when addressing issues related to quantizing model parameters of a machine learning model teaches, wherein the training the at least some of the parameters further includes: converting values, including the at least some of the parameters, from a first format to a second format, the second format having a lower precision than the first format (Section 3.1 ¶01 "make the model learn how to quantize the parameters during the training phase" Examiner notes that quantizing by definition includes converting values from a first format to a second format, wherein the second format has lower precision.) wherein the first format is a first floating-point format having m bits of precision for mantissa values and ei bits of precision for exponent values, wherein the second format is a second floating-point format: having m2 bits of precision for mantissa values and e2 bits of precision for exponent values (Section 3.3 ¶01-02 “According to the format of the single-precision floating point, it is obvious that there are at most 23 bits that we can quantize in the mantissa-quantization… Therefore, we propose the statistical exponent-quantization to further compress the DL-based model” Examiner notes that given that the quantization is done on a floating point number, and quantizes both the mantissa and the exponent. The 1st format will include a mantissa and exponent, while the 2nd format will also include a mantissa and exponent) wherein m1 > m2 (Section 3.2 ¶02 “Algorithm 1 presents the mantissa-quantization, which is executed after the backward-propagation…The last n bits are removed by taking the intersection with the binary mask, and the binary bits is converted back to the floating point p. After all the parameters are updated (quantized)") and wherein ei > e2; or having a shared exponent value. (Section 3.3 ¶03 “Algorithm 2 presents the exponent-quantization for each parameter...It is clear that there is no performance degradation when applying the exponent-quantization since we only reduce the number of bits to represent a parameter value instead of changing the value")
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate floating point quantization as taught by Hsu to the disclosed invention of Faraone/Gray	One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate floating point quantization so that “model sizes of the quantized models were only 8.75% and 21.89% for BLSTM and FCN…results suggest that… may be able to install an SE system with a compressed DL-based model in embedded devices to operate in an IoT environment.” (Hsu, Conclusion)
Farone and Gray are combinable at least for the reasons set forth in claim 1

Claim(s) 21 are rejected under 35 U.S.C. § 103 as being unpatentable over Faraone/Gray as applied above, and further in view in view of Zhang et al “LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks”, hereinafter Zhang.

Regarding Claim 21
	Faraone/Gray teach claim 1, the rejection is incorporated
	Farone/Gray teach, the dithered quantizer function
Faraone/Gray does not explicitly teach, wherein (1) the… quantizer function is different between at least a portion of the multiple iterations, or (2) the dithered quantization function is not applied during one or more iterations of the multiple iterations
Zhang, however, when addressing issues related to learning an optimal quantizer function for quantizing weights of a neural network during training teaches, wherein (1) the… quantizer function is different between at least a portion of the multiple iterations, or (2) the dithered quantization function is not applied during one or more iterations of the multiple iterations (pg 5 “To get better network quantizers and improve the accuracy of a quantized network, we propose to jointly train the network and its quantizers” Section 3.3 ¶02-03 “Here we present an algorithm based on quantization error minimization which optimizes our quantizers in the forward passes during training… Our goal is to find an optimal quantizer basis v” Algorithm 1 pg 8, for each training iteration a new basis vector corresponding to the quantizer function is computed and updated. Given that the basis vector is updated each iteration of training the quantizer must be different between at least a portion of iterations.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate learning an optimal quantization function as taught by Zhang to the disclosed invention of Faraone/Gray	One of ordinary skill in the arts would have been motivated to make this modification in order because for quantized models such as the one used by Faraone “there is still a noticeable gap in terms of prediction accuracy between the quantized model and the full-precision model. To address this gap, we propose to jointly train a quantized, bit-operation-compatible DNN… The comprehensive experiments…show that our method works consistently well for various network structures… surpassing previous quantization methods in terms of accuracy by an appreciable margin.” (Abstract Zhang)
Faraone and Gray are combinable at least for the reasons set forth in claim 1


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Hubara et al “Quantized Neural Networks: Training Neural Networks with
Low Precision Weights and Activations” discusses in algorithm 1 that before the forward propagation weight updates are quantized or clipped to be used in the next iteration.

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNATHAN R GERMICK whose telephone number is (571)272-8363. The examiner can normally be reached M-F 7:30-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/J.R.G./
Examiner, Art Unit 2122

/ERIC NILSSON/Primary Examiner, Art Unit 2122