DETAILED ACTION
Claims 1, 5, 8, 12, 15, and 19 have been amended. Currently, claims 1-5, 8-12, and 15-19 are pending.  
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-2, 4, 8-9, 11, 15-16, and 18  are rejected under 35 U.S.C. 103 as being unpatentable over Shi et al., (US 20170270408, "Shi") in view Pan et al., (US  20180349758, “Pan”) and in view of  Courbariaux et al. “BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations.” ArXiv:1511.00363 2016 (“Courbariaux”).  
Regarding claim 1, Shi teaches the method of providing an adaptive bit-width neural network model on a computing device, comprising: 
the computing device, wherein the computing device has one or more processor and memory(Shi, para. 0099); 
obtaining a first neural network model that includes a plurality of layers, wherein each layer of the plurality of layers has a respective set of parameters, and each parameter is expressed with a level of data precision that corresponds to an original bit-width of the first neural network model (Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”);
 reducing a footprint of the first neural network model on the computing device by using respective reduced bit-widths for storing the respective sets of parameters of different layers of the first neural network model, the respective sets of parameters include a first set of weights, and a respective reduced bit-width of the first set of weights is greater or equal to 4-bits wherein (Shi, para. 0034-0035, fig 4(48, 36, 40), detailing an example that after many cycles of adjusting weights and re-calculating costs performed by the bit-depth optimization engine, the bit-depth optimization engine reduces the average bit size from 14-bits in the first neural network to 10-bits in the second neural network. Note: As the example in para. 35 of Shi details after doing bit-depth optimization the initial weights of 14-bits gets reduced to an “average bit-depth…[of] 10 bits…a reduction of 4 binary bits.” Which corresponds to the claim limitation of a respective reduced bit-width of the first set of weights is greater or equal to 4-bits; see also footnote 1 for the  broadest reasonable interpretation when it comes to the use of alternative language)1: preferred values of the respective reduced bit-widths are determined through multiple iterations of forward propagation through the first neural network model using a validation data set (Shi, para. 0071-0072, fig., 8(34, 36) “At each optimization iteration, the following actions are taken: calculate and generate a cost J(θ): forward propagate data [training until a predefined information loss threshold is met by respective response statistics of the two or more layers (Shi, para. 0060, fig. 8(68), detailing that the end point terminator compares the aggregate weight costs (and accuracy costs) to a target threshold; if the target threshold has not been met, the forward computation component of the bit-depth optimization engine goes through another iteration, until the target threshold is reached), wherein during reducing the footprint of the first neural network model, for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model (Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”):
and generating a reduced neural network model that includes the plurality of layers, wherein each layer of two or more the plurality of layers includes a respective set of quantized parameters, and each quantized parameter is expressed with the preferred values of the respective reduced bit-widths for the layer as determined through the multiple iterations (Shi, para. 0061, fig 8(36, 40), detailing an example that after many iterations of the forward computation and backward computation of the bit-depth optimization engine, the bit-depth optimization engine reduces the number of bits from 14-bits in the first neural network to 10-bits in the second neural network(i.e., an average decrease of 4 binary bits per weight)).
Shi does not teach: while each of two or more layers of the first neural network model is expressed with different degrees of quantization corresponding to different   
However Pan teaches while each of two or more layers of the first neural network model is expressed with different degrees of quantization corresponding to different reduced bit-widths (Pan para. 52, fig. 4, “In one embodiment, the first optimal quantization step-size [degree of quantization] may be different from the second optimal quantization step-size, and [because of that,] the first fixed-point format [the number of bits] may be different from the second fixed-point format.”); performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer (Pan, para. 0039-0040, table 1, detailing that by performing the uniform quantization method of Manner I using the error function of equation(4), the maximum bit length under a Gaussian distribution is 8-bits).  
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s method, in view of Pan to teach while each of two or more layers of the first neural network model is expressed with different degrees of quantization corresponding to different reduced bit- widths; performing uniform quantization on the first set of parameters with a predefined reduced bit-width that is smaller than the original bit-width of the first neural network model during the forward propagation through the first layer.  The motivation to do so would be to provide different degrees of quantization for different 
Shi does not teach: forgoing performance of the uniform quantization on the first set of parameters with the predefined reduced bit-width during the backward propagation through the first layer.
However Courbariaux teaches: forgoing performance of the uniform quantization on the first set of parameters with the predefined reduced bit-width during the backward propagation through the first layer (Courbariaux, pgs. 3-4, Algorithm 1(step 3), “A key point to understand with BinaryConnect is that we only binarize the weights during the forward and backward propagations (steps 1 and 2) but not during the parameter update (step 3), as illustrated in Algorithm 1…[h]ence, at training time, BinaryConnect randomly picks one of two values for each weight, for each minibatch, for both the forward and backward propagation phases of backprop. However, the SGD [i.e., stochastic gradient descent] update is accumulated in a real-valued variable storing the parameter.” ).2

Regarding claim 2 Shi as modified in view of Pan and in view of Courbariaux, teaches the method of claim 1,Pan further teaches wherein:
a first layer of the plurality of layers in the reduced neural network model has a first reduced bit-width that is smaller than the original bit-width of the first neural network model (Pan, para. 0054, fig. 5(S515, S520), detailing that the first neural network data (of floating point type) is optimally quantized into a smaller range of fixed-point numbers and fed into the first layer), a second layer of the plurality of layers in the reduced neural network model has a second reduced bit-width that is smaller than the original bit-width of the first neural network model (Pan, para. 0055, fig. 5(S535, S550), detailing that the first layer outputs a smaller range of fixed-point numbers that are fed into the second layer), and the first reduced bit-width is distinct from the second reduced bit-width in the reduced neural network model (Pan, para. 50-52, fig. 4, detailing that in determining the optimal quantization step size for two layers, it may be the case that different optimal quantization steps may be 
	Regarding claim 4 Shi as modified in view of Pan and in view of Courbariaux, teaches the method of claim 1, Pan further teaches expressing the respective set of parameters of the first layer with the first reduced bit- width includes performing non-uniform quantization on the respective set of parameters of the first layer to generate a first set of quantized parameters for the first layer (Pan, para. 0045-0046, table 2, para. 0054, fig. 5(S510, S515, S520 ), detailing that the optimal quantization step size can be iteratively found over the original data using the non-uniform quantization equation of (14), and thereafter, the fixed-point format of the data is inputted to the first convolutional layer) and a maximal boundary value for the non-uniform quantization of the first layer is selected based on the baseline statistical distribution of activation values for the first layer during each forward propagation through the first layer (Pan para.  0045-0046, detailing that the optimal quantization step size can be iteratively found over the original data using the non-uniform equation of (14), where the definite integral’s upper limit is bounded by                         
                            
                                
                                    b
                                
                                
                                    q
                                
                            
                        
                    ;                          
                            
                                
                                    b
                                
                                
                                    q
                                
                            
                        
                     is defined as                         
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                        
                    (                        
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                        
                                    
                                
                                ^
                            
                            +
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                            +
                                            1
                                        
                                    
                                
                                ^
                            
                        
                     ) where both                         
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                        
                                    
                                
                                ^
                            
                             
                            a
                            n
                            d
                             
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            q
                                            +
                                            1
                                        
                                    
                                
                                ^
                            
                        
                       are estimated using the distribution of the data to be quantized at a certain quantization level).
Referring to independent claims 8 and 15, they are rejected on the same basis as independent claim 1 since they are analogous claims.
Referring to dependent claims 9 and 16 are rejected on the same basis as dependent claim 2 and dependent claims 11 and 18 are rejected on the same basis as dependent claim 4 since they are analogous claims.
s 3,10, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Shi et al., (US 20170270408, "Shi") in view Pan et al., (US  20180349758, “Pan”) and in view of  Courbariaux et al. “BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations.” ArXiv:1511.00363 2016 (“Courbariaux”) and further in view of Tu et al. (Tu, Ming, et al. “Ranking the Parameters of Deep Neural Networks Using the Fisher Information.” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, “Tu”). 
 Regarding claim 3, Shi as modified in view of Pan and in view of Courbariaux, teaches the method of claim 1, wherein reducing the footprint of the first neural network includes: 
for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model(Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”): while the respective set of parameters of the first layer are expressed with a first reduced bit-width that are smaller than the original bit-width of the first neural network model (Shi, para. 0034-0035, fig 4(48, 36, 40), detailing an example that after many cycles of adjusting weights and re-calculating costs performed by the bit-depth optimization engine, the bit-depth optimization engine reduces the average bit size from 14-bits in the first neural network to 10-bits in the second neural network). 
Shi as modified in view of Pan and in view of Courbariaux does not teach: collecting a respective baseline statistical distribution of activation values for the first 
However, Tu teaches collecting a respective baseline statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, while the respective sets of parameters of the plurality of layers are expressed with the original bit-width of the first neural network model (Tu, page 2 col. 2,  page 3, algorithm 1, detailing that to compute the Fisher Information Matrix, a fully trained deep neural network is inputted with parameters                         
                            θ
                        
                    , input data X, and the input-output function                         
                            Y
                            =
                            f
                            
                                
                                    X
                                    ;
                                    θ
                                
                            
                        
                    which is the original network output distribution); collecting a respective modified statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model (Tu, page 2 cols. 1-2,  page 3, algorithm 1, detailing that a small random perturbation around                         
                            θ
                        
                    (the optimal value) is used to modify the original network output distribution), 
determining a predefined divergence between the respective modified statistical distribution of activation values for the first layer and the respective baseline statistical distribution of activation values for the first layer (Tu, page 2 cols. 2,  page 3, algorithm 1, detailing that the                         
                            
                                
                                    D
                                
                                
                                    α
                                
                            
                        
                     divergence as calculated by equation (4) is used to compare the original network output distribution and the modified  network output distribution); and identifying a minimum value of the first reduced bit-width for which a reduction in the predefined divergence due to a further reduction of bit-width for the first layer is below a predefined threshold (Tu, page 3 col 2 para. 1, pg.4 section 4.2, fig 5, detailing that the parameters that are still left(after pruning) are assigned bit-widths based on their threshold of importance in the output of the neural network as determined by the diagonal components of the FIM matrix). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s method, as modified by Pan, and as modified by Courbariaux and in view of Tu to teach collecting a respective baseline statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, while the respective sets of parameters of the plurality of layers are expressed with the original bit-width of the first neural network model; collecting a respective modified statistical distribution of activation values for the first layer as the validation data set is forward propagated as input through the first neural network model, determining a predefined divergence between the respective modified statistical distribution of activation values for the first layer and the respective baseline statistical distribution of activation values for the first layer; and identifying a minimum value of 
 The motivation to do so would be to have a way to rank the most import parameters of a neural network based on non-parametric statistics (see Tu, pg. 1, cols.1-2 sec.1, detailing that by developing a method to rank parameters by their relative importance, without making assumptions about the neural network’s outputted distributional statistics this ranking can be applied for arbitrary problems (i.e., problems in which the distribution is not known)  and assign more bits to more relevant parameters).
Referring to dependent claims 10 and 17 they are rejected on the same basis as dependent claim 3 since they are analogous claims.
Claims 5, 12, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Shi et al., (US 20170270408, "Shi") in view Pan et al., (US  20180349758, “Pan”) and in view of  Courbariaux et al. “BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations.” ArXiv:1511.00363 2016 (“Courbariaux”) and further in view of Gupta et al. ( Gupta et al,. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (ICML’15), 2015. pgs. 1737–1746, “Gupta”).
Regarding claim 5, Shi as modified in view of Pan and in view of Courbariaux, teaches the method of claim 1, wherein obtaining the first neural network model that includes the plurality of layers includes: 
during training of the first neural network: for a first layer of the two or more layers that has a first set of parameters expressed with the level of data precision corresponding to the original bit-width of the first neural network model (Shi, para. 0031, fig. 4(36), “The weights used in neural network 36 [the first neural network] may have initial values such as 0 to 16K-1, which can be represented by a 14-bit weight.”):
and 115332-5002-US27adding the integer regularization term to a bias term during forward propagation through the first layer (Shi para. 0066-79, detailing that the bit depth penalty term is the sum  of both weights and biases and that the bit depth penalty term is added to the overall cost function                         
                            J
                            (
                            θ
                            )
                        
                     in  which                        
                             
                            θ
                        
                     (the parameters) are forward propagated ) such that gradients during backward propagation through the first layer are altered to push values of the first set of parameters toward integer values (Shi para. 0071-80, fig., 5, detailing that the bit depth penalty term is incorporated by the gradient update in equation (5) and the cost function encourages weight values near bit-depth boundaries to take on lower integer ranges).
Shi as modified in view of Pan and in view of Courbariaux does not teach: obtaining an integer regularization term corresponding to the first layer in accordance with a difference between a first set of weights that corresponds to the first layer and integer portions of the first set of weights. 
However Gupta teaches obtaining an integer regularization term corresponding to the first layer in accordance with a difference between a first set of weights that corresponds to the first layer and integer portions of the first set of weights (Gupta, pgs. 3-4, sec.3.1, pg. 4 sec.4 col. 2 para. 1, detailing  the stochastic                         
                            
                                
                                    
                                        
                                            
                                                
                                                    
                                                    f
                                                    l
                                                    o
                                                    o
                                                    r
                                                    
                                                        
                                                            x
                                                        
                                                    
                                                     
                                                     
                                                     
                                                     
                                                     
                                                     
                                                     
                                                     
                                                     
                                                    w
                                                    .
                                                    p
                                                     
                                                     
                                                     
                                                     
                                                    1
                                                    -
                                                    
                                                        
                                                            x
                                                            -
                                                            f
                                                            l
                                                            o
                                                            o
                                                            r
                                                            (
                                                            x
                                                            )
                                                        
                                                        
                                                            ϵ
                                                        
                                                    
                                                
                                            
                                        
                                        
                                            
                                                
                                                    
                                                    f
                                                    l
                                                    o
                                                    o
                                                    r
                                                    
                                                        
                                                            x
                                                        
                                                    
                                                    +
                                                    ϵ
                                                     
                                                     
                                                     
                                                    w
                                                    .
                                                    p
                                                    .
                                                     
                                                     
                                                     
                                                     
                                                     
                                                     
                                                     
                                                    
                                                        
                                                            x
                                                            -
                                                            f
                                                            l
                                                            o
                                                            o
                                                            r
                                                            (
                                                            x
                                                            )
                                                        
                                                        
                                                            ϵ
                                                        
                                                    
                                                
                                            
                                        
                                    
                                
                            
                        
                     where floor is the floor function, x equals the value of each parameter(e.g., weights) and the floor(x) equals the integer portion of each parameter(e.g., weights) as defined by                         
                            ϵ
                        
                    , which is the smallest positive number that may be represented in a given fixed-point format). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shi’s method, as modified by Pan and Courbariaux and in view of Gupta to teach obtaining an integer regularization term corresponding to the first layer in accordance with a difference between a first set of weights that corresponds to the first layer and integer portions of the first set of weights. The motivation to do so would be further constrain the parameters in neural networks to take on lower precision when dealing with fixed-point numbers (see Gupta, pg. 2 sec. 1, col. 1 para. 2, detailing that the key finding is that deep neural networks can be trained using low-precision fixed-point arithmetic, provided that stochastic rounding is applied while operating on fixed-point numbers).
Referring to dependent claims 12 and 19 they are rejected on the same basis as dependent claim 5 since they are analogous claims.
Response to Arguments
Applicant's arguments filed 06/29/2021 have been fully considered but they are not persuasive. In regards to Applicant being uncertain which portion of the specification suggests that forgoing uniform quantization happens “after backpropagation.” See pgs. 12-13 of Applicant’s 06/29/2021 Remarks.  Examiner respectfully reminds Applicant that the Patent and Trademark Office ("PTO") upon giving claims their broadest reasonable construction "in light of the specification as it would be interpreted by one of ordinary skill in the art." In re Am. Acad. of Sci. Tech. Ctr., 367 F.3d 1359, 1364[, 70 USPQ2d 1827, 1830] (Fed. Cir. 2004)(Emphasis added). Indeed, the rules of the PTO require that application claims must "conform to the invention as set forth in the remainder of the specification and the terms and phrases used in the claims must find clear support or antecedent basis in the description so that the meaning of the terms in the claims may be ascertainable by reference to the description." 37 CFR 1.75(d)(1). Page 16 of Applicant’s specification details a sample training process with the integer weight regularization and the 8-bit uniform quantization added. Para. 0054. The following training process of paragraphs [AltContent: textbox (Fig. 1)]0055-0056 is recited verbatim from the specification: 
[AltContent: rect]
    PNG
    media_image1.png
    509
    671
    media_image1.png
    Greyscale
 
	According to the Applicant’s specification the backward propagation phase happens only when the weights and biases are being updated. 
[AltContent: textbox (Fig. 2)][AltContent: rect]
    PNG
    media_image2.png
    200
    400
    media_image2.png
    Greyscale


 Fig. 2 details the backward pass of the backpropagation algorithm for a two layer network in which the errors from the output layers                         
                            
                                
                                    δ
                                
                                
                                    k
                                
                            
                        
                     and hidden units                         
                            
                                
                                    δ
                                
                                
                                    h
                                
                            
                        
                     are propagated backwards before updating the weights                         
                            
                                
                                    w
                                
                                
                                    j
                                    i
                                
                            
                        
                    . Mitchell, Tom M. "Machine learning." (1997). What Applicant lists in specification for backprogation is only the gradient descent/weight-update portion.  Hence the office action of 4/22/2021 stated a footnote explaining Examiner’s reasoning when mapping the prior art of Courbariaux to Applicant’s  claim 1 limitation of: “forgoing performance of the uniform quantization on the first set of parameters with the predefined reduced bit-width during the backward propagation through the first layer.”(Emphasis added).  
While it is true that Courbariaux does not teach the amended claim 1 in regards to the limitation of: the respective sets of parameters include a first set of weights, and a respective reduced bit-width of the first set of weights is greater or equal to 4-bits. Examiner must respectfully remind Applicant that one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986) (Emphasis added). Shi does teach the above limitation in para(s). 0034-0035 by stating the following: “[a]s an example, the average number of bits for each weight in neural network 36 is 14 bits. Bit-depth optimization engine 48 reduces this average bit-depth to 10 bits in low-bit-depth neural network 40, a reduction of 4 binary bits. This is a 28% reduction in the storage requirements for weights in weights memory 100 (FIG. 1).”(Emphasis added).  
For the following reasons, Examiner is not withdrawing the 103 rejection.
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Adam Clark Standke whose telephone number is (571)270-1806.  The examiner can normally be reached on 9:30AM-6:30PM M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Adam Clark Standke
Assistant Examiner
Art Unit 2122

Adam Clark Standke
Assistant Examiner
Art Unit 2122


/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126                                                                                                                                                                                                        


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to mapping one or more elements to the claim but not all
        2 In accordance with MPEP §2173.01(I), this limitation has been given is broadest reasonable interpretation consistent with the specification. The specification details that forgoing uniform quantization on the first set of parameters/weights with the predefined reduced bit-width happens during the gradient update portion (i.e., after backpropagation).