Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1, 8, and 15 were amended and claims 6-7, 13-14, and 20 have been canceled. Currently, claims 1-5, 8-12, and 15-19 are pending.  
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 08/31/2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
6.	Claims 1-2, 4, 8-9, 11, 15-16, and 18  are rejected under 35 U.S.C. 103 as being unpatentable over Shi et al., (US 20170270408, "Shi") in view Pan et al., (US  20180349758, “Pan”) and in view of  Courbariaux et al. “BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations.” ArXiv:1511.00363 2016 (“Courbariaux”).  
Regarding claim 1, Shi teaches the method of providing an adaptive bit-width neural network model on a computing device, comprising: 
the computing device, wherein the computing device has one or more processor and memory(Shi, para. 0099); 
obtaining a first neural network model that includes a plurality of layers, wherein each layer of the plurality of layers has a respective set of parameters, and each parameter is expressed with a level of data precision that corresponds to an original bit-width of the first neural network model (Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”);
 reducing a footprint of the first neural network model on the computing device by using respective reduced bit-widths for storing the respective sets of parameters of different layers of the first neural network model, wherein (Shi, para. 0034-0035, fig 4(48, 36, 40), detailing an example that after many cycles of adjusting weights and re-calculating costs performed by the bit-depth optimization engine, the bit-depth optimization engine reduces the average bit size from 14-bits in the first neural network to 10-bits in the second neural network.): preferred values of the respective reduced bit-widths are determined through multiple iterations of forward propagation through the first neural network model using a validation data set (Shi, para. 0071-0072, fig., 8(34, 36) “At each optimization iteration, the following actions are taken: calculate and generate a cost J(θ): forward propagate data [training data] using θ to get a value from the cost J(θ).”) until a predefined information loss threshold is met by respective response statistics of the two or more layers (Shi, para. 0060, fig. 8(68), detailing that the end point terminator compares the aggregate weight costs (and accuracy costs) to a target threshold; if the target threshold has not been met, the forward computation component of the bit-depth optimization engine goes through another iteration, until the target threshold is reached), wherein during reducing the footprint of the first neural network model, for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model (Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”):
and generating a reduced neural network model that includes the plurality of layers, wherein each layer of two or more the plurality of layers includes a respective set of quantized parameters, and each quantized parameter is expressed with the preferred values of the respective reduced bit-widths for the layer as determined through the multiple iterations (Shi, para. 0061, fig 8(36, 40), detailing an example that after many iterations of the forward computation and backward computation of the bit-depth optimization engine, the bit-depth optimization engine reduces the number of bits from 14-bits in the first neural network to 10-bits in the second neural network(i.e., an average decrease of 4 binary bits per weight)).
Shi does not teach: while each of two or more layers of the first neural network model is expressed with different degrees of quantization corresponding to different reduced bit- widths; performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer.  
However Pan teaches while each of two or more layers of the first neural network model is expressed with different degrees of quantization corresponding to different reduced bit-widths (Pan para. 52, fig. 4, “In one embodiment, the first optimal quantization step-size [degree of quantization] may be different from the second optimal quantization step-size, and [because of that,] the first fixed-point format [the number of bits] may be different from the second fixed-point format.”); performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer (Pan, para. 0039-0040, table 1, detailing that by performing the uniform quantization method of Manner I using the error function of equation(4), the maximum bit length under a Gaussian distribution is 8-bits).  
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s method, in view of Pan to teach while each of two or more layers of the first neural network model is expressed with different degrees of quantization corresponding to different reduced bit- widths; performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer.  The motivation to do so would be to provide different degrees of quantization for different degrees of  bit-widths for layers based on the distribution of data (see Pan para. 0038-0044, para. 0049, table 2, table 3, detailing that if the distribution of the data to be quantized follows a Gaussian distribution uniform quantization could be performed to find the optimal quantization step-size for a given layer rather than non-uniform quantization that works on many different probability distributions; accordingly the number of bits for an integer (i.e., m) in a fixed-point format is calculated differently for non-uniform quantization than uniform quantization).
Shi does not teach: forgoing performance of the uniform quantization on the first set of parameters with the predefined reduced bit-width during the backward propagation through the first layer.
However Courbariaux teaches: forgoing performance of the uniform quantization on the first set of parameters with the predefined reduced bit-width during the backward propagation through the first layer (Courbariaux, pgs. 3-4, Algorithm 1(step 3), “A key point to understand with BinaryConnect is that we only binarize the weights during the forward and 1
   Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s method, in view of Courbariaux to teach: forgoing performance of the uniform quantization on the first set of parameters with the predefined reduced bit-width during the backward propagation through the first layer. The motivation to do so would be to still use Stochastic Gradient Decent algorithm as the learning optimizer (Courbariaux, pgs. 3, sec. 2.3 Propagations vs updates, “Keeping good precision weights during the updates is necessary for SGD to work at all.”). 
Regarding claim 2 Shi as modified in view of Pan and in view of Courbariaux, teaches the method of claim 1,Pan further teaches wherein:
a first layer of the plurality of layers in the reduced neural network model has a first reduced bit-width that is smaller than the original bit-width of the first neural network model (Pan, para. 0054, fig. 5(S515, S520), detailing that the first neural network data (of floating point type) is optimally quantized into a smaller range of fixed-point numbers and fed into the first layer), a second layer of the plurality of layers in the reduced neural network model has a second reduced bit-width that is smaller than the original bit-width of the first neural network model (Pan, para. 0055, fig. 5(S535, S550), detailing that the first layer outputs a smaller , and the first reduced bit-width is distinct from the second reduced bit-width in the reduced neural network model (Pan, para. 50-52, fig. 4, detailing that in determining the optimal quantization step size for two layers, it may be the case that different optimal quantization steps may be determined and because of this the first range of fixed-point numbers for the first layer may be different than the range of fixed-point numbers for the second layer).
	Regarding claim 4 Shi as modified in view of Pan and in view of Courbariaux, teaches the method of claim 1, Pan further teaches expressing the respective set of parameters of the first layer with the first reduced bit- width includes performing non-uniform quantization on the respective set of parameters of the first layer to generate a first set of quantized parameters for the first layer (Pan, para. 0045-0046, table 2, para. 0054, fig. 5(S510, S515, S520 ), detailing that the optimal quantization step size can be iteratively found over the original data using the non-uniform quantization equation of (14), and thereafter, the fixed-point format of the data is inputted to the first convolutional layer) and a maximal boundary value for the non-uniform quantization of the first layer is selected based on the baseline statistical distribution of activation values for the first layer during each forward propagation through the first layer (Pan para.  0045-0046, detailing that the optimal quantization step size can be iteratively found over the original data using the non-uniform equation of (14), where the definite integral’s upper limit is bounded by                         
                            
                                
                                    b
                                
                                
                                    q
                                
                            
                        
                    ;                          
                            
                                
                                    b
                                
                                
                                    q
                                
                            
                        
                     is defined as                         
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                        
                    (                        
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                        
                                    
                                
                                ^
                            
                            +
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                            +
                                            1
                                        
                                    
                                
                                ^
                            
                        
                     ) where both                         
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                        
                                    
                                
                                ^
                            
                             
                            a
                            n
                            d
                             
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                            +
                                            1
                                        
                                    
                                
                                ^
                            
                        
                       are estimated using the distribution of the data to be quantized at a certain quantization level).
performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer (Pan, para. 0039-0040, table 1, detailing that   
Regarding claim 8, Shi teaches the computing device, comprising: 
one or more processors and memory (Shi, para. 0099); 
obtaining a first neural network model that includes a plurality of layers, wherein each layer of the plurality of layers has a respective set of parameters, and each parameter is expressed with a level of data precision that corresponds to an original bit-width of the first neural network model (Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”);
 reducing a footprint of the first neural network model on the computing device by using respective reduced bit-widths for storing the respective sets of parameters of different layers of the first neural network model, wherein (Shi, para. 0034-0035, fig 4(48, 36, 40), detailing an example that after many cycles of adjusting weights and re-calculating costs performed by the bit-depth optimization engine, the bit-depth optimization engine reduces the average bit size from 14-bits in the first neural network to 10-bits in the second neural network.): preferred values of the respective reduced bit-widths are determined through multiple iterations of forward propagation through the first neural network model using a validation data set (Shi, para. 0071-0072, fig., 8(34, 36) “At each optimization iteration, the following actions are taken: calculate and generate a cost J(θ): forward propagate data [training data] using θ to get a value from the cost J(θ).”) until a predefined information loss threshold is met by respective response statistics of the two or more layers (Shi, para. 0060, fig. 8(68), detailing that the end point terminator compares the aggregate weight costs (and accuracy costs) to a target  wherein during reducing the footprint of the first neural network model, for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model (Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”):
and generating a reduced neural network model that includes the plurality of layers, wherein each layer of two or more the plurality of layers includes a respective set of quantized parameters, and each quantized parameter is expressed with the preferred values of the respective reduced bit-widths for the layer as determined through the multiple iterations(Shi, para. 0061, fig 8(36, 40), detailing an example that after many iterations of the forward computation and backward computation of the bit-depth optimization engine, the bit-depth optimization engine reduces the number of bits from 14-bits in the first neural network to 10-bits in the second neural network(i.e., an average decrease of 4 binary bits per weight)).
Shi does not teach: while each of two or more layers of the first neural network model is expressed with different degrees of quantization corresponding to different reduced bit- widths; performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer. 
However Pan teaches while each of two or more layers of the first neural network model is expressed with different degrees of quantization corresponding to different reduced bit-widths (Pan para. 52, fig. 4, “In one embodiment, the first optimal quantization step-size  performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer (Pan, para. 0039-0040, table 1, detailing that by performing the uniform quantization method of Manner I using the error function of equation(4), the maximum bit length under a Gaussian distribution is 8-bits). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s computing device, in view of Pan to teach while each of two or more layers of the first neural network model is expressed with different degrees of quantization corresponding to different reduced bit- widths; performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer. The motivation to do so would be to provide different degrees of quantization for different degrees of  bit-widths for layers based on the distribution of data (see Pan para. 0038-0044, para. 0049, table 2, table 3, detailing that if the distribution of the data to be quantized follows a Gaussian distribution uniform quantization could be performed to find the optimal quantization step-size for a given layer rather than non-uniform quantization that works on many different probability distributions; accordingly the number of bits for an integer (i.e., m) in a fixed-point format is calculated differently for non-uniform quantization than uniform quantization). 
Shi does not teach: forgoing performance of the uniform quantization on the first set of parameters with the predefined reduced bit-width during the backward propagation through the first layer.
 forgoing performance of the uniform quantization on the first set of parameters with the predefined reduced bit-width during the backward propagation through the first layer (Courbariaux, pgs. 3-4, Algorithm 1(step 3), “A key point to understand with BinaryConnect is that we only binarize the weights during the forward and backward propagations (steps 1 and 2) but not during the parameter update (step 3), as illustrated in Algorithm 1…[h]ence, at training time, BinaryConnect randomly picks one of two values for each weight, for each minibatch, for both the forward and backward propagation phases of backprop. However, the SGD [i.e., stochastic gradient descent] update is accumulated in a real-valued variable storing the parameter.” ).2
   Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s computing device, in view of Courbariaux to teach: forgoing performance of the uniform quantization on the first set of parameters with the predefined reduced bit-width during the backward propagation through the first layer. The motivation to do so would be to still use Stochastic Gradient Decent algorithm as the learning optimizer (Courbariaux, pgs. 3, sec. 2.3 Propagations vs updates, “Keeping good precision weights during the updates is necessary for SGD to work at all.”). 
Regarding claim 9 Shi as modified in view of Pan and in view of Courbariaux, teaches the computing device of claim 8,Pan further teaches wherein:
a first layer of the plurality of layers in the reduced neural network model has a first reduced bit-width that is smaller than the original bit-width of the first neural network model (Pan, para. 0054, fig. 5(S515, S520), detailing that the first neural network data (of floating , a second layer of the plurality of layers in the reduced neural network model has a second reduced bit-width that is smaller than the original bit-width of the first neural network model (Pan, para. 0055, fig. 5(S535, S550), detailing that the first layer outputs a smaller range of fixed-point numbers that are fed into the second layer), and the first reduced bit-width is distinct from the second reduced bit-width in the reduced neural network model (Pan, para. 50-52, fig. 4, detailing that in determining the optimal quantization step size for two layers, it may be the case that different optimal quantization steps may be determined and because of this the first range of fixed-point numbers for the first layer may be different than the range of fixed-point numbers for the second layer).
	Regarding claim 11 Shi as modified in view of Pan and in view of Courbariaux, teaches the computing device of claim 8, Pan further teaches expressing the respective set of parameters of the first layer with the first reduced bit- width includes performing non-uniform quantization on the respective set of parameters of the first layer to generate a first set of quantized parameters for the first layer (Pan, para. 0045-0046, table 2, para. 0054, fig. 5(S510, S515, S520), detailing that the optimal quantization step size can be iteratively found over the original data using the non-uniform quantization equation of (14), and thereafter, the fixed-point format of the data is inputted to the first convolutional layer), and a maximal boundary value for the non-uniform quantization of the first layer is selected based on the baseline statistical distribution of activation values for the first layer during each forward propagation through the first layer (Pan para.  0045-0046, detailing that the optimal quantization step size can be iteratively found over the original data using the non-uniform equation of (14), where the definite                         
                            
                                
                                    b
                                
                                
                                    q
                                
                            
                        
                    ;                          
                            
                                
                                    b
                                
                                
                                    q
                                
                            
                        
                     is defined as                         
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                        
                    (                        
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                        
                                    
                                
                                ^
                            
                            +
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                            +
                                            1
                                        
                                    
                                
                                ^
                            
                        
                     ) where both                         
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                        
                                    
                                
                                ^
                            
                             
                            a
                            n
                            d
                             
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                            +
                                            1
                                        
                                    
                                
                                ^
                            
                        
                       are estimated using the distribution of the data to be quantized at a certain quantization level).
Regarding claim 15, Shi teaches a non-transitory computer-readable storage medium, storing instructions, which, when executed by one or more processors, cause the processors to perform operations comprising: 
obtaining a first neural network model that includes a plurality of layers, wherein each layer of the plurality of layers has a respective set of parameters, and each parameter is expressed with a level of data precision that corresponds to an original bit-width of the first neural network model (Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”);
 reducing a footprint of the first neural network model on the computing device by using respective reduced bit-widths for storing the respective sets of parameters of different layers of the first neural network model, wherein (Shi, para. 0034-0035, fig 4(48, 36, 40), detailing an example that after many cycles of adjusting weights and re-calculating costs performed by the bit-depth optimization engine, the bit-depth optimization engine reduces the average bit size from 14-bits in the first neural network to 10-bits in the second neural network.): preferred values of the respective reduced bit-widths are determined through multiple iterations of forward propagation through the first neural network model using a validation data set (Shi, para. 0071-0072, fig., 8(34, 36) “At each optimization iteration, the following actions are taken: calculate and generate a cost J(θ): forward propagate data [training data] using θ to get a value from the cost J(θ).”) until a predefined information loss threshold is met by respective response statistics of the two or more layers (Shi, para. 0060, fig. 8(68), detailing  wherein during reducing the footprint of the first neural network model, for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model (Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”):
and generating a reduced neural network model that includes the plurality of layers, wherein each layer of two or more the plurality of layers includes a respective set of quantized parameters, and each quantized parameter is expressed with the preferred values of the respective reduced bit-widths for the layer as determined through the multiple iterations (Shi, para. 0061, fig 8(36, 40), detailing an example that after many iterations of the forward computation and backward computation of the bit-depth optimization engine, the bit-depth optimization engine reduces the number of bits from 14-bits in the first neural network to 10-bits in the second neural network(i.e., an average decrease of 4 binary bits per weight)).
Shi does not teach: the computing device, while each of two or more layers of the first neural network model is expressed with different degrees of quantization corresponding to different reduced bit- widths; performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer.
However Pan teaches while each of two or more layers of the first neural network model is expressed with different degrees of quantization corresponding to different reduced bit-widths (Pan para. 52, fig. 4, “In one embodiment, the first optimal quantization step-size [degree of quantization] may be different from the second optimal quantization step-size, and [because of that,] the first fixed-point format [the number of bits] may be different from the second fixed-point format.”); performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer (Pan, para. 0039-0040, table 1, detailing that by performing the uniform quantization method of Manner I using the error function of equation(4), the maximum bit length under a Gaussian distribution is 8-bits). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s method, in view of Pan to teach the computing device, while each of two or more layers of the first neural network model is expressed with different degrees of quantization corresponding to different reduced bit- widths; performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer. The motivation to do so would be to provide different degrees of quantization for different degrees of  bit-widths for layers based on the distribution of data (see Pan para. 0038-0044, para. 0049, table 2, table 3, detailing that if the distribution of the data to be quantized follows a Gaussian distribution uniform quantization could be performed to find the optimal quantization step-size for a given layer rather than non-uniform quantization that works on many different probability distributions; accordingly the number of bits for an integer (i.e., m) in a fixed-point format is calculated differently for non-uniform quantization than uniform quantization).  
 forgoing performance of the uniform quantization on the first set of parameters with the predefined reduced bit-width during the backward propagation through the first layer.
However Courbariaux teaches: forgoing performance of the uniform quantization on the first set of parameters with the predefined reduced bit-width during the backward propagation through the first layer (Courbariaux, pgs. 3-4, Algorithm 1(step 3), “A key point to understand with BinaryConnect is that we only binarize the weights during the forward and backward propagations (steps 1 and 2) but not during the parameter update (step 3), as illustrated in Algorithm 1…[h]ence, at training time, BinaryConnect randomly picks one of two values for each weight, for each minibatch, for both the forward and backward propagation phases of backprop. However, the SGD [i.e., stochastic gradient descent] update is accumulated in a real-valued variable storing the parameter.” ).3
   Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s non-transitory computer-readable storage medium, in view of Courbariaux to teach: forgoing performance of the uniform quantization on the first set of parameters with the predefined reduced bit-width during the backward propagation through the first layer. The motivation to do so would be to still use Stochastic Gradient Decent algorithm as the learning optimizer (Courbariaux, pgs. 3, sec. 2.3 Propagations vs updates, “Keeping good precision weights during the updates is necessary for SGD to work at all.”). 

a first layer of the plurality of layers in the reduced neural network model has a first reduced bit-width that is smaller than the original bit-width of the first neural network model (Pan, para. 0054, fig. 5(S515, S520), detailing that the first neural network data (of floating point type) is optimally quantized into a smaller range of fixed-point numbers and fed into the first layer), a second layer of the plurality of layers in the reduced neural network model has a second reduced bit-width that is smaller than the original bit-width of the first neural network model (Pan, para. 0055, fig. 5(S535, S550), detailing that the first layer outputs a smaller range of fixed-point numbers that are fed into the second layer), and the first reduced bit-width is distinct from the second reduced bit-width in the reduced neural network model (Pan, para. 50-52, fig. 4, detailing that in determining the optimal quantization step size for two layers, it may be the case that different optimal quantization steps may be determined and because of this the first range of fixed-point numbers for the first layer may be different than the range of fixed-point numbers for the second layer).
	Regarding claim 18 Shi as modified in view of Pan and in view of Courbariaux, teaches the method of claim 15, Pan further teaches wherein:
 expressing the respective set of parameters of the first layer with the first reduced bit- width includes performing non-uniform quantization on the respective set of parameters of the first layer to generate a first set of quantized parameters for the first layer (Pan, para. 0045-0046, table 2, para. 0054, fig. 5(S510, S515, S520 ), detailing that the optimal quantization step size can be iteratively found over the original data using the non-uniform quantization equation of (14), and thereafter, the fixed-point format of the data is inputted to the first and a maximal boundary value for the non-uniform quantization of the first layer is selected based on the baseline statistical distribution of activation values for the first layer during each forward propagation through the first layer (Pan para.  0045-0046, detailing that the optimal quantization step size can be iteratively found over the original data using the non-uniform equation of (14), where the definite integral’s upper limit is bounded by                         
                            
                                
                                    b
                                
                                
                                    q
                                
                            
                        
                    ;                          
                            
                                
                                    b
                                
                                
                                    q
                                
                            
                        
                     is defined as                         
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                        
                    (                        
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                        
                                    
                                
                                ^
                            
                            +
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                            +
                                            1
                                        
                                    
                                
                                ^
                            
                        
                     ) where both                         
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                        
                                    
                                
                                ^
                            
                             
                            a
                            n
                            d
                             
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                            +
                                            1
                                        
                                    
                                
                                ^
                            
                        
                       are estimated using the distribution of the data to be quantized at a certain quantization level).
Claim Rejections - 35 USC § 103
Claims 3,10, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Shi et al., (US 20170270408, "Shi") in view Pan et al., (US  20180349758, “Pan”) and in view of  Courbariaux et al. “BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations.” ArXiv:1511.00363 2016 (“Courbariaux”) and further in view of Tu et al. (Tu, Ming, et al. “Ranking the Parameters of Deep Neural Networks Using the Fisher Information.” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, “Tu”). 
 Regarding claim 3, Shi as modified in view of Pan and in view of Courbariaux, teaches the method of claim 1, wherein reducing the footprint of the first neural network includes: 
for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model(Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”): while the respective set of parameters of the first layer are expressed with a first reduced bit-width that are smaller than the original bit-width of the first neural network model (Shi, para. 0034-0035, fig 4(48, 36, 40), detailing an example that after many cycles of adjusting weights and re-calculating costs performed by the bit-depth optimization engine, the bit-depth optimization engine reduces the average bit size from 14-bits in the first neural network to 10-bits in the second neural network). 
Shi as modified in view of Pan and in view of Courbariaux does not teach: collecting a respective baseline statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, while the respective sets of parameters of the plurality of layers are expressed with the original bit-width of the first neural network model; collecting a respective modified statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, determining a predefined divergence between the respective modified statistical distribution of activation values for the first layer and the respective baseline statistical distribution of activation values for the first layer; and identifying a minimum value of the first reduced bit-width for which a reduction in the predefined divergence due to a further reduction of bit-width for the first layer is below a predefined threshold.
However, Tu teaches collecting a respective baseline statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, while the respective sets of parameters of the plurality of layers are expressed with the original bit-width of the first neural network model (Tu, page 2 col. 2,  page 3, algorithm 1, detailing that to compute the Fisher Information Matrix, a fully trained deep neural network is inputted with parameters                         
                            θ
                        
                    , input data X, and the input-output function                         
                            Y
                            =
                            f
                            
                                
                                    X
                                    ;
                                    θ
                                
                            
                        
                    which is the original network output distribution); collecting a respective modified statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model (Tu, page 2 cols. 1-2,  page 3, algorithm 1, detailing that a small random perturbation around                         
                            θ
                        
                    (the optimal value) is used to modify the original network output distribution), 
determining a predefined divergence between the respective modified statistical distribution of activation values for the first layer and the respective baseline statistical distribution of activation values for the first layer (Tu, page 2 cols. 2,  page 3, algorithm 1, detailing that the                         
                            
                                
                                    D
                                
                                
                                    α
                                
                            
                        
                     divergence as calculated by equation (4) is used to compare the original network output distribution and the modified  network output distribution); and identifying a minimum value of the first reduced bit-width for which a reduction in the predefined divergence due to a further reduction of bit-width for the first layer is below a predefined threshold (Tu, page 3 col 2 para. 1, pg.4 section 4.2, fig 5, detailing that the parameters that are still left(after pruning) are assigned bit-widths based on their threshold of importance in the output of the neural network as determined by the diagonal components of the FIM matrix). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s method, as modified by Pan, and as modified by Courbariaux and in view of Tu to teach collecting a respective baseline statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, while the respective sets of parameters of the plurality of layers are expressed with the original bit-width of the first neural network model; collecting a respective modified statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, determining a predefined divergence between the respective modified statistical distribution of activation values for the first layer and the respective baseline statistical distribution of activation 
 The motivation to do so would be to have a way to rank the most import parameters of a neural network based on non-parametric statistics (see Tu, pg. 1, cols.1-2 sec.1, detailing that by developing a method to rank parameters by their relative importance, without making assumptions about the neural network’s outputted distributional statistics this ranking can be applied for arbitrary problems (i.e., problems in which the distribution is not known)  and assign more bits to more relevant parameters).
Regarding claim 10, Shi as modified in view of Pan and in view of Courbariaux, teaches the computing device of claim 8, wherein reducing the footprint of the first neural network includes: 
for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model(Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”): while the respective set of parameters of the first layer are expressed with a first reduced bit-width that are smaller than the original bit-width of the first neural network model (Shi, para. 0034-0035, fig 4(48, 36, 40), detailing an example that after many cycles of adjusting weights and re-calculating costs performed by the bit-depth optimization engine, the bit-depth optimization engine reduces the average bit size from 14-bits in the first neural network to 10-bits in the second neural network). 

However, Tu teaches collecting a respective baseline statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, while the respective sets of parameters of the plurality of layers are expressed with the original bit-width of the first neural network model (Tu, page 2 col. 2,  page 3, algorithm 1, detailing that to compute the Fisher Information Matrix, a fully trained deep neural network is inputted with parameters                         
                            θ
                        
                    , input data X, and the input-output function                         
                            Y
                            =
                            f
                            
                                
                                    X
                                    ;
                                    θ
                                
                            
                        
                    which is the original network output distribution); collecting a respective modified statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model (Tu, page 2 cols. 1-2,  page 3, algorithm 1, detailing that a small random perturbation around                         
                            θ
                        
                    (the optimal value) is used to modify the original network output distribution), 
determining a predefined divergence between the respective modified statistical distribution of activation values for the first layer and the respective baseline statistical distribution of activation values for the first layer (Tu, page 2 cols. 2,  page 3, algorithm 1, detailing that the                         
                            
                                
                                    D
                                
                                
                                    α
                                
                            
                        
                     divergence as calculated by equation (4) is used to compare the original network output distribution and the modified  network output distribution); and identifying a minimum value of the first reduced bit-width for which a reduction in the predefined divergence due to a further reduction of bit-width for the first layer is below a predefined threshold (Tu, page 3 col 2 para. 1, pg.4 section 4.2, fig 5, detailing that the parameters that are still left(after pruning) are assigned bit-widths based on their threshold of importance in the output of the neural network as determined by the diagonal components of the FIM matrix). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s computing device, as modified by Pan, and as modified by Courbariaux, and in view of Tu to teach collecting a respective baseline statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, while the respective sets of parameters of the plurality of layers are expressed with the original bit-width of the first neural network model; collecting a respective modified statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, determining a predefined divergence between the respective modified statistical distribution of activation values for the first layer and the respective baseline statistical distribution of activation values for the first layer; and identifying a minimum value of the first reduced bit-width for which a reduction in the predefined divergence due to a further reduction of bit-width for the first layer is below a predefined threshold.

Regarding claim 17, Shi as modified in view of Pan and in view of Courbariaux, teaches the method of claim 15, wherein reducing the footprint of the first neural network includes: 
for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model(Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”): while the respective set of parameters of the first layer are expressed with a first reduced bit-width that are smaller than the original bit-width of the first neural network model (Shi, para. 0034-0035, fig 4(48, 36, 40), detailing an example that after many cycles of adjusting weights and re-calculating costs performed by the bit-depth optimization engine, the bit-depth optimization engine reduces the average bit size from 14-bits in the first neural network to 10-bits in the second neural network). 
Shi as modified in view of Pan and in view of Courbariaux does not teach: collecting a respective baseline statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, while the respective sets of parameters of the plurality of layers are expressed with the original bit-width of the first neural network model; collecting a respective modified statistical distribution of activation values 
However, Tu teaches collecting a respective baseline statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, while the respective sets of parameters of the plurality of layers are expressed with the original bit-width of the first neural network model (Tu, page 2 col. 2,  page 3, algorithm 1, detailing that to compute the Fisher Information Matrix, a fully trained deep neural network is inputted with parameters                         
                            θ
                        
                    , input data X, and the input-output function                         
                            Y
                            =
                            f
                            
                                
                                    X
                                    ;
                                    θ
                                
                            
                        
                    which is the original network output distribution); collecting a respective modified statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model (Tu, page 2 cols. 1-2,  page 3, algorithm 1, detailing that a small random perturbation around                         
                            θ
                        
                    (the optimal value) is used to modify the original network output distribution), 
determining a predefined divergence between the respective modified statistical distribution of activation values for the first layer and the respective baseline statistical distribution of activation values for the first layer (Tu, page 2 cols. 2,  page 3, algorithm 1, detailing that the                         
                            
                                
                                    D
                                
                                
                                    α
                                
                            
                        
                     divergence as calculated by equation (4) is used to compare the original network output distribution and the modified  network output distribution); and identifying a minimum value of the first reduced bit-width for which a reduction in the predefined divergence due to a further reduction of bit-width for the first layer is below a predefined threshold (Tu, page 3 col 2 para. 1, pg.4 section 4.2, fig 5, detailing that the parameters that are still left(after pruning) are assigned bit-widths based on their threshold of importance in the output of the neural network as determined by the diagonal components of the FIM matrix). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s method, as modified by Pan, and as modified by Courbariaux and in view of Tu to teach collecting a respective baseline statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, while the respective sets of parameters of the plurality of layers are expressed with the original bit-width of the first neural network model; collecting a respective modified statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, determining a predefined divergence between the respective modified statistical distribution of activation values for the first layer and the respective baseline statistical distribution of activation values for the first layer; and identifying a minimum value of the first reduced bit-width for which a reduction in the predefined divergence due to a further reduction of bit-width for the first layer is below a predefined threshold.
 The motivation to do so would be to have a way to rank the most import parameters of a neural network based on non-parametric statistics (see Tu, pg. 1, cols.1-2 sec.1, detailing that by developing a method to rank parameters by their relative importance, without making assumptions about the neural network’s outputted distributional statistics this ranking can be applied for arbitrary problems (i.e., problems in which the distribution is not known)  and assign more bits to more relevant parameters).
Claims 5, 12, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Shi et al., (US 20170270408, "Shi") in view Pan et al., (US  20180349758, “Pan”) and in view of  Courbariaux et al. “BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations.” ArXiv:1511.00363 2016 (“Courbariaux”) and further in view of Gupta et al. ( Gupta et al,. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (ICML’15), 2015. pgs. 1737–1746, “Gupta”).
Regarding claim 5, Shi as modified in view of Pan and in view of Courbariaux, teaches the method of claim 1, wherein obtaining the first neural network model that includes the plurality of layers includes: 
during training of the first neural network: for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model (Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”):
and 115332-5002-US27adding the integer regularization term to a bias term during forward propagation through the first layer (Shi para. 0066-79, detailing that the bit depth penalty term is the sum  of both weights and biases and that the bit depth penalty term is added to the overall cost function                 
                    J
                    (
                    θ
                    )
                
             in  which                
                     
                    θ
                
             (the parameters) are forward propagated ) such that gradients during backward propagation through the first layer are altered to push values of the first set of parameters toward integer values (Shi para. 0071-80, fig., 5, detailing that the bit depth penalty term is incorporated by the gradient update in equation (5) and the cost function encourages weight values near bit-depth boundaries to take on lower integer ranges).

However Gupta teaches obtaining an integer regularization term corresponding to the first layer in accordance with a difference between a first set of weights that corresponds to the first layer and integer portions of the first set of weights (Gupta, pgs. 3-4, sec.3.1, pg. 4 sec.4 col. 2 para. 1, detailing  the stochastic rounding as the following:                 
                    
                        
                            
                                
                                    
                                        
                                            
                                            f
                                            l
                                            o
                                            o
                                            r
                                            
                                                
                                                    x
                                                
                                            
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            w
                                            .
                                            p
                                             
                                             
                                             
                                             
                                            1
                                            -
                                            
                                                
                                                    x
                                                    -
                                                    f
                                                    l
                                                    o
                                                    o
                                                    r
                                                    (
                                                    x
                                                    )
                                                
                                                
                                                    ϵ
                                                
                                            
                                        
                                    
                                
                                
                                    
                                        
                                            
                                            f
                                            l
                                            o
                                            o
                                            r
                                            
                                                
                                                    x
                                                
                                            
                                            +
                                            ϵ
                                             
                                             
                                             
                                            w
                                            .
                                            p
                                            .
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            
                                                
                                                    x
                                                    -
                                                    f
                                                    l
                                                    o
                                                    o
                                                    r
                                                    (
                                                    x
                                                    )
                                                
                                                
                                                    ϵ
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                
             where floor is the floor function, x equals the value of each parameter(e.g., weights) and the floor(x) equals the integer portion of each parameter(e.g., weights) as defined by                 
                    ϵ
                
            , which is the smallest positive number that may be represented in a given fixed-point format). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s method, as modified by Pan and Courbariaux and in view of Gupta to teach obtaining an integer regularization term corresponding to the first layer in accordance with a difference between a first set of weights that corresponds to the first layer and integer portions of the first set of weights. The motivation to do so would be further constrain the parameters in neural networks to take on lower precision when dealing with fixed-point numbers (see Gupta, pg. 2 sec. 1, col. 1 para. 2, detailing that the key finding is that deep neural networks can be trained using low-precision fixed-point arithmetic, provided that stochastic rounding is applied while operating on fixed-point numbers). 
 obtaining the first neural network model that includes the plurality of layers includes: 
during training of the first neural network: for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model(Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”):
and 115332-5002-US27adding the integer regularization term to a bias term during forward propagation through the first layer (Shi para. 0066-79, detailing that the bit depth penalty term is the sum  of both weights and biases and that the bit depth penalty term is added to the overall cost function                 
                    J
                    (
                    θ
                    )
                
             in  which                
                     
                    θ
                
             (the parameters) are forward propagated ) such that gradients during backward propagation through the first layer are altered to push values of the first set of parameters toward integer values (Shi para. 0071-80, fig., 5, detailing that the bit depth penalty term is incorporated by the gradient update in equation (5) and the cost function encourages weight values near bit-depth boundaries to take on lower integer ranges).
Shi as modified in view of Pan and in view of Courbariaux does not teach: obtaining an integer regularization term corresponding to the first layer in accordance with a difference between a first set of weights that corresponds to the first layer and integer portions of the first set of weights. 
However Gupta teaches obtaining an integer regularization term corresponding to the first layer in accordance with a difference between a first set of weights that corresponds to the first layer and integer portions of the first set of weights (Gupta, pgs. 3-4, sec.3.1, pg. 4                 
                    
                        
                            
                                
                                    
                                        
                                            
                                            f
                                            l
                                            o
                                            o
                                            r
                                            
                                                
                                                    x
                                                
                                            
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            w
                                            .
                                            p
                                             
                                             
                                             
                                             
                                            1
                                            -
                                            
                                                
                                                    x
                                                    -
                                                    f
                                                    l
                                                    o
                                                    o
                                                    r
                                                    (
                                                    x
                                                    )
                                                
                                                
                                                    ϵ
                                                
                                            
                                        
                                    
                                
                                
                                    
                                        
                                            
                                            f
                                            l
                                            o
                                            o
                                            r
                                            
                                                
                                                    x
                                                
                                            
                                            +
                                            ϵ
                                             
                                             
                                             
                                            w
                                            .
                                            p
                                            .
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            
                                                
                                                    x
                                                    -
                                                    f
                                                    l
                                                    o
                                                    o
                                                    r
                                                    (
                                                    x
                                                    )
                                                
                                                
                                                    ϵ
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                
             where floor is the floor function, x equals the value of each parameter(e.g., weights) and the floor(x) equals the integer portion of each parameter(e.g., weights) as defined by                 
                    ϵ
                
            , which is the smallest positive number that may be represented in a given fixed-point format). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s method, as modified by Pan and Courbariaux, and in view of Gupta to teach obtaining an integer regularization term corresponding to the first layer in accordance with a difference between a first set of weights that corresponds to the first layer and integer portions of the first set of weights. The motivation to do so would be further constrain the parameters in neural networks to take on lower precision when dealing with fixed-point numbers (see Gupta, pg. 2 sec. 1, col. 1 para. 2, detailing that the key finding is that deep neural networks can be trained using low-precision fixed-point arithmetic, provided that stochastic rounding is applied while operating on fixed-point numbers). 
Regarding claim 19, Shi as modified in view of Pan and in view of Courbariaux, teaches the method of claim 15, wherein obtaining the first neural network model that includes the plurality of layers includes: 
during training of the first neural network: for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model (Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”):
and 115332-5002-US27adding the integer regularization term to a bias term during forward propagation through the first layer (Shi para. 0066-79, detailing that the bit depth penalty term is the sum  of both weights and biases and that the bit depth penalty term is added to the overall cost function                 
                    J
                    (
                    θ
                    )
                
             in  which                
                     
                    θ
                
             (the parameters) are forward propagated ) such that gradients during backward propagation through the first layer are altered to push values of the first set of parameters toward integer values (Shi para. 0071-80, fig., 5, detailing that the bit depth penalty term is incorporated by the gradient update in equation (5) and the cost function encourages weight values near bit-depth boundaries to take on lower integer ranges).
Shi as modified in view of Pan and in view of Courbariaux does not teach: obtaining an integer regularization term corresponding to the first layer in accordance with a difference between a first set of weights that corresponds to the first layer and integer portions of the first set of weights. 
However Gupta teaches obtaining an integer regularization term corresponding to the first layer in accordance with a difference between a first set of weights that corresponds to the first layer and integer portions of the first set of weights (Gupta, pgs. 3-4, sec.3.1, pg. 4 sec.4 col. 2 para. 1, detailing  the stochastic rounding as the following:                 
                    
                        
                            
                                
                                    
                                        
                                            
                                            f
                                            l
                                            o
                                            o
                                            r
                                            
                                                
                                                    x
                                                
                                            
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            w
                                            .
                                            p
                                             
                                             
                                             
                                             
                                            1
                                            -
                                            
                                                
                                                    x
                                                    -
                                                    f
                                                    l
                                                    o
                                                    o
                                                    r
                                                    (
                                                    x
                                                    )
                                                
                                                
                                                    ϵ
                                                
                                            
                                        
                                    
                                
                                
                                    
                                        
                                            
                                            f
                                            l
                                            o
                                            o
                                            r
                                            
                                                
                                                    x
                                                
                                            
                                            +
                                            ϵ
                                             
                                             
                                             
                                            w
                                            .
                                            p
                                            .
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            
                                                
                                                    x
                                                    -
                                                    f
                                                    l
                                                    o
                                                    o
                                                    r
                                                    (
                                                    x
                                                    )
                                                
                                                
                                                    ϵ
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                
             where floor is the floor function, x equals the value of each parameter(e.g., weights) and the floor(x) equals the integer portion of each parameter(e.g., weights) as defined by                 
                    ϵ
                
            , which is the smallest positive number that may be represented in a given fixed-point format). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s method, as modified by Pan and 
Response to Arguments
Applicant’s arguments, filed 11/10/2020, with respect to the rejection of claims under 1-2, 4, 6-9, 11, 13-16, 18, and 20 under 35 U.S.C. §103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Shi et al., (US 20170270408, "Shi") in view Pan et al., (US  20180349758, “Pan”) and in view of  Courbariaux et al. “BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations.” ArXiv:1511.00363 2016 (“Courbariaux”). 
Shi teaches: wherein during reducing the footprint of the first neural network model, for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model. See pages 4, 9, 14, and 15 of the current Office Action for further clarification on this limitation. Pan teaches: performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer. See pages 5, 8, 10, 15, and 16 of the current Office Action for further clarification on this limitation. And Courbariaux teaches: forgoing 

Conclusion	

9. 	Any inquiry concerning this communication or earlier communications from the examiner should be directed to ADAM CLARK STANDKE whose telephone number is (571)270-1806.  The examiner can normally be reached on 7:00-5:00 M-Th.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on 571-272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to 








/ADAM C STANDKE/Examiner, Art Unit 2122   

/ERIC NILSSON/Primary Examiner, Art Unit 2122                                                                                                                                                                                                        











 





    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 Note: In accordance with MPEP §2173.01(I), this limitation has been given is broadest reasonable interpretation consistent with the specification. The specification details that forgoing uniform quantization on the first set of parameters/weights with the predefined reduced bit-width happens during the gradient update portion (i.e., after backpropagation).   
        2 Note: In accordance with MPEP §2173.01(I), this limitation has been given is broadest reasonable interpretation consistent with the specification. The specification details that forgoing uniform quantization on the first set of parameters/weights with the predefined reduced bit-width happens during the gradient update portion (i.e., after backpropagation).   
        3 Note: In accordance with MPEP §2173.01(I), this limitation has been given is broadest reasonable interpretation consistent with the specification. The specification details that forgoing uniform quantization on the first set of parameters/weights with the predefined reduced bit-width happens during the gradient update portion (i.e., after backpropagation).