DETAILED ACTION
This is responsive to application 17/002,814 filed on 08/26/2020 in which claims 1-8
are presented for examination.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 08/26/2020 and 05/05/2021 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-2, 7 and 8 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Zhou et al.( DOREFA-NET: TRAINING LOW BITWIDTH CONVOLUTIONAL
NEURAL NETWORKS WITH LOW BITWIDTH GRADIENTS) herein after referred as Zhou.

Regarding claim 1, Zhou teaches an information processing device, comprising: a memory; and a processor coupled to the memory (Zhou,  abstract “We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass….Moreover, as bit convolutions can be efficiently implemented on CPU, FPGA, ASIC and GPU, DoReFa-Net opens the way to accelerate training of low bitwidth neural network on these hardware…”) and configured to:
determine a plurality of bit ranges after quantization for at least one of a plurality of types of variables to be used in a neural network, (Zhou, abstract “We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass, parameter gradients[one of a plurality of types of variables] are stochastically quantized to low bitwidth [a plurality of bit ranges after quantization] numbers before being propagated to convolutional layers [neural network]. As convolutions during forward/backward passes can now operate on low bitwidth weights and activations/gradients respectively, DoReFa-Net can use bit convolution kernels to accelerate both training and inference. Moreover, as bit convolutions can be efficiently implemented on CPU, FPGA, ASIC and GPU, DoReFa-Net opens the way to accelerate training of low bitwidth neural network on these hardware. Our experiments on SVHN and ImageNet datasets prove that DoReFa-Net can achieve comparable prediction accuracy as 32-bit counterparts. For example, a DoReFa-Net derived from AlexNet that has 1-bit weights, 2-bit activations, can be trained from scratch using 6-bit gradients to get 46.1% top-1 accuracy on ImageNet validation set. The DoReFa-Net AlexNet model is released publicly.”  Note: The underlined lines show that parameters weight ,activation, and gradients are plurality of types of variables for which bit quantization is  done(quantized to low bitwidth) to be used in convolutional layers ( neural network); also see Pg. 7, Table 1 that teaches W, A, G parameters with different bit ranges).
calculate a plurality of recognition rates of the neural network by using each of a plurality of variable groups which includes the plurality of types of variables, and in which a bit range of at least one of the plurality of types of variables is different ( Zhou 3.1 “We use the prediction accuracy of several CNN [calculate a plurality of recognition rates of the neural network ]models on SVHN dataset to evaluate the efficacy of configurations. As balancing between multiple factors like training time, inference time, model size and accuracy is more a problem of practical trade-off, there will be no definite conclusion as which combination of (W, A, G)[ plurality of variable groups] one should choose.” Note:  As seen in the table below , the  variable group ( W, A, G) has a different bit ranges in different combination groups. As for example, in the first row ( W, A, G) is ( 1 , 1, 2 ) and in the next row it is (1,1,4) so “G” has changed representing different bit range from the previous. Also, note that recognition rate of model is same as prediction accuracy, as both terms represent how well the model is performing in determining the accurate output)


    PNG
    media_image1.png
    892
    1007
    media_image1.png
    Greyscale


and determine to use a variable group of the plurality of variable groups, the variable group having a maximum recognition rate among the plurality of calculated recognition rates, for calculation of the neural network. (Zhou 3.1 “Model B, C, D is derived from Model A by reducing the number of channels for all seven convolutional layers [calculation of the neural network ]by 50%, 75%, 87.5%, respectively. The listed prediction accuracy is the maximum accuracy [ maximum recognition rate among the plurality of calculated recognition rates ] on test set over 200 epochs. We use ADAM (Kingma & Ba, 2014) learning rule with 0.001 as learning rate.” Also, Zhou 3.1 “As balancing between multiple factors like training time, inference time, model size and accuracy is more a problem of practical trade-off, there will be no definite conclusion as which combination of (W, A, G) one should choose. Nevertheless, we find in these experiments that weights, activations and gradients are progressively more sensitive to bitwidth, and using gradients with G  ≤  4 would significantly degrade prediction accuracy. Based on these observations, we take (W, A) = (1, 2) and G≥ 4 as rational combinations and use them for most of our experiments on ImageNet dataset.” Note: The above paragraph shows that maximum recognition rate( prediction accuracy) is calculated among different variable groups. Also, the table above shows that different variable values combinations have different accuracy rates and maximum accuracy or recognition can be achieved , so by keeping (W, A) = (1, 2) and G ≥ 4 maximum accuracy can be achieved.)

Regarding claim 2  Zhou teaches, the information processing device of claim 1. 
Zhou further teaches wherein the processor is configured to: 
execute the calculation of the neural network in such a manner that the calculation is executed in a plurality of calculation cycles each of which includes a group determination period and a calculation execution period,( Zhou, Pg. 7, table1 teaches plurality of calculation cycles; note here in each row a combination group of parameter is being determined (each row representing a cycle of calculations), and then the calculation are performed using different models to determine model accuracy; here time to determine the combination of each group is interpreted as group determination period, and calculation for accuracy is being interpreted as calculation execution period (See, Pg. 1, Introduction, “run-time”) 
calculate the recognition rates and determine the variable group having the maximum recognition rate are operated in the group determination period, and in each of the plurality of calculation cycles, ( Zhou 3.1 “Model B, C, D is derived from Model A by reducing the number of channels for all seven convolutional layers by 50%, 75%, 87.5%, respectively. The listed prediction accuracy is the maximum accuracy [ maximum recognition rate among the plurality of calculated recognition rates ] on test set over 200 epochs. We use ADAM (Kingma & Ba, 2014) learning rule with 0.001 as learning rate.” Also, Zhou 3.1 “As balancing between multiple factors like training time, inference time, model size and accuracy is more a problem of practical trade-off, there will be no definite conclusion as which combination of (W, A, G) one should choose. Nevertheless, we find in these experiments that weights, activations and gradients are progressively more sensitive to bitwidth, and using gradients with G  ≤  4 would significantly degrade prediction accuracy. Based on these observations, we take (W, A) = (1, 2) and G ≥ 4 as rational combinations and use them for most of our experiments on ImageNet dataset.” Also, Zhou introduction “For example, the training process of a DCNN may take up to weeks on a modern multi-GPU server for large datasets like ImageNet (Deng et al., 2009). In light of this, substantial research efforts are invested in speeding up DCNNs at both run-time and training-time,[ determination period] on both general-purpose (Vanhoucke et al., 2011; Gong et al., 2014; Han et al., 2015b) and specialized computer hardware (Farabet et al., 2011; Pham et al., 2012; Chen et al., 2014a;b). Various approaches like quantization (Wu et al., 2015) and sparsification (Han et al., 2015a) have also been proposed.” 
In addition to this Zhou 3.1 “We use the prediction accuracy of several CNN models on SVHN dataset to evaluate the efficacy of configurations.[ plurality of calculation cycles]”
execute the calculation in the calculation execution period by using the variable group determined in the group determination period.(Zhou  3.1 “ As balancing between multiple factors like training time[determination period], inference time[calculation execution period], model size and accuracy is more a problem of practical trade-off, there will be no definite conclusion as which combination of (W, A, G) [variable group ] one should choose. Nevertheless, we find in these experiments that weights, activations and gradients are progressively more sensitive to bitwidth, and using gradients with G ≤ 4 would significantly degrade prediction accuracy. Based on these observations, we take (W, A) = (1, 2) and G ≥ 4 as rational combinations and use them for most of our experiments on ImageNet dataset.” Note:)


Regarding claim 7,  Zhou teaches,  an information processing method for causing a processor included in an information processing device to execute a process, the process comprising: 
determining a plurality of bit ranges after quantization for at least one of a plurality of types of variables to be used in a neural network;( Zhou, abstract “We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass, parameter gradients[one of a plurality of types of variables] are stochastically quantized to low bitwidth [a plurality of bit ranges after quantization] numbers before being propagated to convolutional layers [neural network]. As convolutions during forward/backward passes can now operate on low bitwidth weights and activations/gradients respectively, DoReFa-Net can use bit convolution kernels to accelerate both training and inference. Moreover, as bit convolutions can be efficiently implemented on CPU, FPGA, ASIC and GPU, DoReFa-Net opens the way to accelerate training of low bitwidth neural network on these hardware. Our experiments on SVHN and ImageNet datasets prove that DoReFa-Net can achieve comparable prediction accuracy as 32-bit counterparts. For example, a DoReFa-Net derived from AlexNet that has 1-bit weights, 2-bit activations, can be trained from scratch using 6-bit gradients to get 46.1% top-1 accuracy on ImageNet validation set. The DoReFa-Net AlexNet model is released publicly.”  Note: The underlined lines show that parameters weight ,activation, and gradients are plurality of types of variables for which bit quantization is  done(quantized to low bitwidth) to be used in convolutional layers ( neural network); also see Pg. 7, Table 1 that teaches W, A, G parameters with different bit ranges).
calculating a plurality of recognition rates of the neural network by using each of a plurality of variable groups which includes the plurality of types of variables, and in which a bit range of at least one of the plurality of types of variables is different; (Zhou 3.1 “We use the prediction accuracy of several CNN [calculate a plurality of recognition rates of the neural network ]models on SVHN dataset to evaluate the efficacy of configurations. As balancing between multiple factors like training time, inference time, model size and accuracy is more a problem of practical trade-off, there will be no definite conclusion as which combination of (W, A, G)[ plurality of variable groups] one should choose.” Note:  As seen in the table below , the  variable group ( W, A, G) has a different bit ranges in different combination groups. As for example, in the first row ( W, A, G) is ( 1 , 1, 2 ) and in the next row it is (1,1,4) so “G” has changed representing different bit range from the previous. Also, note that recognition rate of model is same as prediction accuracy, as both terms represent how well the model is performing in determining the accurate output)


    PNG
    media_image1.png
    892
    1007
    media_image1.png
    Greyscale


and determining to use a variable group of the plurality of variable groups, the variable group having a maximum recognition rate among the plurality of calculated recognition rates, for calculation of the neural network.( Zhou 3.1 “Model B, C, D is derived from Model A by reducing the number of channels for all seven convolutional layers [calculation of the neural network ]by 50%, 75%, 87.5%, respectively. The listed prediction accuracy is the maximum accuracy [ maximum recognition rate among the plurality of calculated recognition rates ] on test set over 200 epochs. We use ADAM (Kingma & Ba, 2014) learning rule with 0.001 as learning rate.” Also, Zhou 3.1 “As balancing between multiple factors like training time, inference time, model size and accuracy is more a problem of practical trade-off, there will be no definite conclusion as which combination of (W, A, G) one should choose. Nevertheless, we find in these experiments that weights, activations and gradients are progressively more sensitive to bitwidth, and using gradients with G  ≤  4 would significantly degrade prediction accuracy. Based on these observations, we take (W, A) = (1, 2) and G≥ 4 as rational combinations and use them for most of our experiments on ImageNet dataset.” Note: The above paragraph shows that maximum recognition rate( prediction accuracy) is calculated among different variable groups. Also, the table above shows that different variable values combinations have different accuracy rates and maximum accuracy or recognition can be achieved , so by keeping (W, A) = (1, 2) and G ≥ 4 maximum accuracy can be achieved.)


Regarding claim 8, Zhou teaches, A non-transitory computer-readable storage medium storing a program that causes a computer to execute a process, the process comprising: 
determining a plurality of bit ranges after quantization for at least one of a plurality of types of variables to be used in a neural network; (Zhou 3.1 “Model B, C, D is derived from Model A by reducing the number of channels for all seven convolutional layers [calculation of the neural network ]by 50%, 75%, 87.5%, respectively. The listed prediction accuracy is the maximum accuracy [ maximum recognition rate among the plurality of calculated recognition rates ] on test set over 200 epochs. We use ADAM (Kingma & Ba, 2014) learning rule with 0.001 as learning rate.” Also, Zhou 3.1 “As balancing between multiple factors like training time, inference time, model size and accuracy is more a problem of practical trade-off, there will be no definite conclusion as which combination of (W, A, G) one should choose. Nevertheless, we find in these experiments that weights, activations and gradients are progressively more sensitive to bitwidth, and using gradients with G  ≤  4 would significantly degrade prediction accuracy. Based on these observations, we take (W, A) = (1, 2) and G≥ 4 as rational combinations and use them for most of our experiments on ImageNet dataset.” Note: The above paragraph shows that maximum recognition rate( prediction accuracy) is calculated among different variable groups. Also, the table above shows that different variable values combinations have different accuracy rates and maximum accuracy or recognition can be achieved , so by keeping (W, A) = (1, 2) and G ≥ 4 maximum accuracy can be achieved).
calculating a plurality of recognition rates of the neural network by using each of a plurality of variable groups which includes the plurality of types of variables, and in which a bit range of at least one of the plurality of types of variables is different; (Zhou 3.1 “We use the prediction accuracy of several CNN [calculate a plurality of recognition rates of the neural network ]models on SVHN dataset to evaluate the efficacy of configurations. As balancing between multiple factors like training time, inference time, model size and accuracy is more a problem of practical trade-off, there will be no definite conclusion as which combination of (W, A, G)[ plurality of variable groups] one should choose.” Note:  As seen in the table below , the  variable group ( W, A, G) has a different bit ranges in different combination groups. As for example, in the first row ( W, A, G) is ( 1 , 1, 2 ) and in the next row it is (1,1,4) so “G” has changed representing different bit range from the previous. Also, note that recognition rate of model is same as prediction accuracy, as both terms represent how well the model is performing in determining the accurate output)


    PNG
    media_image1.png
    892
    1007
    media_image1.png
    Greyscale


and determining to use a variable group of the plurality of variable groups, the variable group having a maximum recognition rate among the plurality of calculated recognition rates, for calculation of the neural network.( Zhou 3.1 “Model B, C, D is derived from Model A by reducing the number of channels for all seven convolutional layers [calculation of the neural network ]by 50%, 75%, 87.5%, respectively. The listed prediction accuracy is the maximum accuracy [ maximum recognition rate among the plurality of calculated recognition rates ] on test set over 200 epochs. We use ADAM (Kingma & Ba, 2014) learning rule with 0.001 as learning rate.” Also, Zhou 3.1 “As balancing between multiple factors like training time, inference time, model size and accuracy is more a problem of practical trade-off, there will be no definite conclusion as which combination of (W, A, G) one should choose. Nevertheless, we find in these experiments that weights, activations and gradients are progressively more sensitive to bitwidth, and using gradients with G  ≤  4 would significantly degrade prediction accuracy. Based on these observations, we take (W, A) = (1, 2) and G≥ 4 as rational combinations and use them for most of our experiments on ImageNet dataset.” Note: The above paragraph shows that maximum recognition rate( prediction accuracy) is calculated among different variable groups. Also, the table above shows that different variable values combinations have different accuracy rates and maximum accuracy or recognition can be achieved , so by keeping (W, A) = (1, 2) and G ≥ 4 maximum accuracy can be achieved.)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 3-6 are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al.  in view of  Diril et al. (US 20190205094 A1) herein after referred as Diril.

Regarding claim 3, Zhou teaches the information processing device according of claim 1.
Zhou further teaches wherein the processor is configured to determine,
based on distribution of most significant bits when a determination target variable of the plurality of bit ranges is represented by a fixed-point number ( Zhou 2.6 teaches use of fixed point number), the plurality of bit ranges from the most significant bit side of the distribution (Table 2, teaches ranges based on significant bits; note here bit range/width is most significant bit).
However, Zhou does not teach that bit width/range is same as most significant bit.
Diril teaches:
bit width/range is same as most significant bit (Diril , paragraph “[0043] While FIG. 2 describes an embodiment in which precision levels 202 are generated externally to processing element 122(1) (e.g., as a result of quantizing or otherwise compressing weights 114), FIG. 3 illustrates a processing element 122(2) that, in addition to multiplier group identification unit 220, multiplier groups 230, accumulator 240, and activation unit 250, may include a precision level determination unit 310 that provides precision level information similar to precision levels 202, as depicted in FIG. 2. In some examples, precision level determination unit 310 may determine a value of each bit of each weight 114 to determine the precision level (e.g., bit width, or number of significant bits) of each weight 114. In other embodiments, precision level determination unit 310 may also compress (e.g., quantize) weights 114, as discussed above, as weights 114 are received at processing element 122(2). The various components of processing element 122(2) that are included in processing 122(1), in at least some examples, may operate as described above.” Note: the paraph above teaches determining precision levels (bit ranges) based on most significant bits)
In view of the teaching of Diril, it would have been obvious for a person of ordinary skill in the art to apply teaching of Zhou at the time the application was filed in order to determine the precision level of a learning attribute such as weight (Diril, “[0043] While FIG. 2 describes an embodiment in which precision levels 202 are generated externally to processing element 122(1) (e.g., as a result of quantizing or otherwise compressing weights 114), FIG. 3 illustrates a processing element 122(2) that, in addition to multiplier group identification unit 220, multiplier groups 230, accumulator 240, and activation unit 250, may include a precision level determination unit 310 that provides precision level information similar to precision levels 202, as depicted in FIG. 2. In some examples, precision level determination unit 310 may determine a value of each bit of each weight 114 to determine the precision level (e.g., bit width, or number of significant bits) of each weight 114. In other embodiments, precision level determination unit 310 may also compress (e.g., quantize) weights 114, as discussed above, as weights 114 are received at processing element 122(2). The various components of processing element 122(2) that are included in processing 122(1), in at least some examples, may operate as described above.”)


Regarding claim 4, Zhou as modified by Diril teaches the information processing device according of claim 3.
Zhou further teaches wherein the processor is configured to: 
calculate quantization errors when the determination target variable of the plurality of bit ranges is quantized in the plurality of bit ranges ( Zhou 2.5 “We have demonstrated deterministic quantization to produce low bitwidth weights and activations[determination target variable]. However, we find stochastic quantization is necessary for low bitwidth gradients[plurality of bit ranges] to be effective. This is in agreement with experiments of (Gupta et al., 2015) on 16-bit weights and 16-bit gradients. To quantize gradients to low bitwidth,[ quantized in the plurality of bit ranges] it is important to note that gradients are unbounded and may have significantly larger value range than activations. Recall in Eqn. 11, we can map the range of activations to [0, 1] by passing values through differentiable nonlinear functions……. To further compensate the potential bias introduced by gradient quantization, we introduce an extra noise function N(k) =                                
                                    
                                        
                                            σ
                                        
                                        
                                             
                                            
                                                
                                                    2
                                                
                                                
                                                    k
                                                
                                            
                                            -
                                            1
                                        
                                    
                                
                              where                                 
                                    σ
                                    ~
                                
                             Uniform(−0.5, 0.5). 5 The noise therefore has the same magnitude as the possible quantization error [calculate quantization errors]. We find that the artificial noise to be critical for achieving good performance……..”
and determine the plurality of bit ranges from the most significant bit side of the distribution in ascending order of the calculated quantization errors.( Zhou, 3.2 “From the table, it can be seen that increasing bitwidth [determine the plurality of bit ranges ] of activation from 1-bit to 2-bit and even to 4-bit,[ in ascending order ] while still keep 1-bit weights, leads to significant accuracy increase, approaching the accuracy of model where both weights and activations are 32-bit. Rounding gradients to 6-bit produces similar
accuracies as 32-bit gradients, in experiments of “1-1-6” v.s. “1-1-32”, “1-2-6” v.s. “1-2-32”, and “1-3-6” v.s. “1-3-32”.” 
Also, Zhou 3.2.1 “Figure 1 shows the evolution of accuracy v.s. epoch curves of DoReFa-Net. It can be seen that quantizing gradients to be 6-bit does not cause the training curve to be significantly different from not quantizing gradients. However, using 4-bit gradients as in “1-2-4” leads to significant accuracy
degradation.[ ascending order of the calculated quantization errors].” )
However, Zhou does not teach :
and determine the plurality of bit ranges from the most significant bit side of the distribution in ascending order of the calculated quantization errors.
Diril teaches:
and determine the plurality of bit ranges from the most significant bit side of the distribution in ascending order of the calculated quantization errors.( Diril, paragraph “[0043] While FIG. 2 describes an embodiment in which precision levels 202 are generated externally to processing element 122(1) (e.g., as a result of quantizing or otherwise compressing weights 114), FIG. 3 illustrates a processing element 122(2) that, in addition to multiplier group identification unit 220, multiplier groups 230, accumulator 240, and activation unit 250, may include a precision level determination unit 310 that provides precision level information similar to precision levels 202, as depicted in FIG. 2. In some examples, precision level determination unit 310 may determine a value of each bit of each weight 114 to determine the precision level (e.g., bit width, or number of significant bits [distribution of most significant bits]) of each weight 114. In other embodiments, precision level determination unit 310 may also compress (e.g., quantize) weights 114, as discussed above, as weights 114 are received at processing element 122(2). The various components of processing element 122(2) that are included in processing 122(1), in at least some examples, may operate as described above.” Note: the paraph above teaches determining precision levels (bit ranges) based on most significant bits) 
In view of the teaching of Diril, it would have been obvious for a person of ordinary skill in the art to apply teaching of Zhou at the time the application was filed in order to determine the precision level of a learning attribute such as weight (Diril, “0043] While FIG. 2 describes an embodiment in which precision levels 202 are generated externally to processing element 122(1) (e.g., as a result of quantizing or otherwise compressing weights 114), FIG. 3 illustrates a processing element 122(2) that, in addition to multiplier group identification unit 220, multiplier groups 230, accumulator 240, and activation unit 250, may include a precision level determination unit 310 that provides precision level information similar to precision levels 202, as depicted in FIG. 2. In some examples, precision level determination unit 310 may determine a value of each bit of each weight 114 to determine the precision level (e.g., bit width, or number of significant bits) of each weight 114. In other embodiments, precision level determination unit 310 may also compress (e.g., quantize) weights 114, as discussed above, as weights 114 are received at processing element 122(2). The various components of processing element 122(2) that are included in processing 122(1), in at least some examples, may operate as described above.”)

Regarding claim 5, Zhou as modified by Diril teaches the information processing device of claim 3.
Zhou further teaches wherein the processor is configured to 
execute learning of the neural network by using the determined variable group having the maximum recognition rate.( Zhou 3.1 “We use the prediction accuracy of several CNN models[execute learning of the neural network] on SVHN dataset to evaluate the efficacy of configurations. Model A is a CNN that costs about 80 FLOPs for one 40x40 image, and it consists of seven convolutional layers and one fully-connected layer. Model B, C, D is derived from Model A by reducing the number of channels for all seven convolutional layers by 50%, 75%, 87.5%, respectively. The listed prediction accuracy is the maximum accuracy on test set over 200 epochs. We use ADAM (Kingma & Ba, 2014) learning rule with 0.001 as learning rate. Model B, C, D is derived from Model A by reducing the number of channels for all seven convolutional layers [calculation of the neural network ]by 50%, 75%, 87.5%, respectively. The listed prediction accuracy is the maximum accuracy [ maximum recognition rate among the plurality of calculated recognition rates ] on test set over 200 epochs. We use ADAM (Kingma & Ba, 2014) learning rule with 0.001 as learning rate. Also, Zhou 3.1 “As balancing between multiple factors like training time, inference time, model size and accuracy is more a problem of practical trade-off, there will be no definite conclusion as which combination of (W, A, G) one should choose. Nevertheless, we find in these experiments that weights, activations and gradients are progressively more sensitive to bitwidth, and using gradients with G  ≤  4 would significantly degrade prediction accuracy. Based on these observations, we take (W, A) = (1, 2) and G≥ 4 as rational combinations and use them for most of our experiments on ImageNet dataset.” Note: The above paragraph shows that maximum recognition rate( prediction accuracy) is calculated among different variable groups. Also, the table above shows that different variable values combinations have different accuracy rates and maximum accuracy or recognition can be achieved , so by keeping (W, A) = (1, 2) and G ≥ 4 maximum accuracy can be achieved.

Regarding claim 6, Zhou as modified by Diril teaches the information processing device of claim 3.
Zhou further teaches wherein the processor is configured to 
execute inference of the neural network by using the determined variable group having the maximum recognition rate.( Zhou 3.1 “We use the prediction accuracy of several CNN models[execute learning of the neural network] on SVHN dataset to evaluate the efficacy of configurations. Model A is a CNN that costs about 80 FLOPs for one 40x40 image, and it consists of seven convolutional layers and one fully-connected layer. Model B, C, D is derived from Model A by reducing the number of channels for all seven convolutional layers by 50%, 75%, 87.5%, respectively. The listed prediction accuracy is the maximum accuracy on test set over 200 epochs. We use ADAM (Kingma & Ba, 2014) learning rule with 0.001 as learning rate. Model B, C, D is derived from Model A by reducing the number of channels for all seven convolutional layers [calculation of the neural network ]by 50%, 75%, 87.5%, respectively. The listed prediction accuracy is the maximum accuracy [ maximum recognition rate among the plurality of calculated recognition rates ] on test set over 200 epochs. We use ADAM (Kingma & Ba, 2014) learning rule with 0.001 as learning rate. Also, Zhou 3.1 “As balancing between multiple factors like training time, inference time, model size and accuracy is more a problem of practical trade-off, there will be no definite conclusion as which combination of (W, A, G) one should choose. Nevertheless, we find in these experiments that weights, activations and gradients are progressively more sensitive to bitwidth, and using gradients with G  ≤  4 would significantly degrade prediction accuracy. Based on these observations, we take (W, A) = (1, 2) and G≥ 4 as rational combinations and use them for most of our experiments on ImageNet dataset.” Note: The above paragraph shows that maximum recognition rate( prediction accuracy) is calculated among different variable groups. Also, the table above shows that different variable values combinations have different accuracy rates and maximum accuracy or recognition can be achieved , so by keeping (W, A) = (1, 2) and G ≥ 4 maximum accuracy can be achieved; here (W, A) = (1, 2) and G ≥ 4 combination is determined to be the best group for inference;  note inference is being performed as table 2 teaches inference complexity of each group, and Pg. 12, section 5, teaches inference process)


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 10943039 B1 teaches , an accumulator , quantizing data based on   most significant bit to least significant bit to define range for selection of bits.
US 20040223541 A1  “a method for efficient evaluation of measurement values from a bit error rate measurement for indication of channel quality, the characteristic of the transmission channel, and the bit error rate which is dependent on it, are taken into account via their stochastic distribution. This results in the bit error rate being quantized in a form matched to the channel transmission, for indication of the channel quality..”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUMA WASEEM whose telephone number is (571)272-1316. The examiner can normally be reached Monday-Friday(9:00am - 5 pm) EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Justin C. Mikowski can be reached on (571)272-8525. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/HUMA WASEEM/Examiner, Art Unit 4184                                                                                                                                                                                                        

/DIANE L LO/Primary Examiner, Art Unit 2466