DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claims 1-20 are pending under this Office action.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Yao, etc. (US 20200380357 A1) in view of Kundu, etc. (US 20180314940 A1), further in view of Wang. Etc. (US 20190385050 A1).
Regarding claim 1, Yao teaches that a method for neural network quantization (See Yao: Fig. 20, and [0188], “Aspects of INQ techniques will be described with reference to FIGS. 20, 21A-21C, and 22. FIG. 20 is a flowchart illustrating operations in a method for incremental network quantization”), the method comprising:
performing feedforward and backpropagation learning for a plurality of cycles on a first neural network having a first bit precision (See Yao: Fig. 18, and [0164], “Supervised learning is a learning method in which training is performed as a mediated operation, such as when the training dataset 1802 includes input paired with the desired output for the input, or where the training dataset includes input having known output and the output of the neural network is manually graded. The network processes the inputs and compares the resulting outputs against a set of expected or desired outputs. Errors are then propagated back through the system. The training framework 1804 can adjust to adjust the weights that control the untrained neural network 1806. The training framework 1804 can provide tools to monitor how well the untrained neural network 1806 is converging towards a model suitable to generating correct answers based on known input data. The training process occurs repeatedly as the weights of the network are adjusted to refine the output generated by the neural network. The training process can continue until the neural network reaches a statistically desired accuracy associated with a trained neural network 1808. The trained neural network 1808 can then be deployed to implement any number of machine learning operations”);
obtaining weight differences (See Yao: Fig. 15, and [0153], “Once the neural network is structured, a learning model can be applied to the network to train the network to perform specific tasks. The learning model describes how to adjust the weights within the model to reduce the output error of the network. Backpropagation of errors is a common method used to train neural networks. An input vector is presented to the network for processing. The output of the network is compared to the desired output using a loss function and an error value is calculated for each of the neurons in the output layer. The error values are then propagated backwards until each neuron has an associated error value which roughly represents its contribution to the original output. The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the of the neural network”) between an initial weight and an updated weight determined by the learning of each cycle for each of layers in the first neural network;
analyzing a statistic of the weight differences (See Yao: Fig. 18, and [0164], “The training process occurs repeatedly as the weights of the network are adjusted to refine the output generated by the neural network. The training process can continue until the neural network reaches a statistically desired accuracy associated with a trained neural network 1808. The trained neural network 1808 can then be deployed to implement any number of machine learning operations”) for each of the layers;
determining one or more layers, from among the layers, to be quantized with a second bit precision lower than the first bit precision, based on the analyzed statistic (See Yao: Fig. 20, and [0203], “At operation 2020 the weights of a DNN model are partitioned into two groups. FIGS. 21A-21B illustrates an example of a DNN model 2110 in which the weights are divided into two a first group 2115 represented by dashed lines between the nodes on the model and a second group 2120 represented by solid lines between the nodes. Referring to FIG. 22, the first row illustrates results from the first iteration of the proposed three operations. The top left cube 2210 illustrates weight partition operation (operation 2020) generating two disjoint groups. The middle cube 2210 illustrates the quantization operation (2025) on the first weight group, in which the shaded cells are represented in powers of two. The top right cube illustrates the re-training operation (operation 2030) on the second weight group (i.e., the shaded cells). At operation 2035 the quantization and retraining operations are repeated until the model weights are fully quantized as powers of two or zero. This is illustrated in the transition between FIG. 21B and FIG. 21C. In FIG. 22, the lower row depicts results from the second, third, and fourth iterations of the INQ. In the figure, the accumulated portion of the weights that have been quantized undergoes from 50%->75%->87.5%->100%”); and
generating a second neural network by quantizing the determined one or more layers with the second bit precision (See Yao: Figs. 20-22, and [0189], “FIGS. 20, 21A-21C, and 22 illustrate an overview of an INQ for learning lossless low-bit DNN model from any pre-trained full-precision reference on-the-fly. The final low-precision models are efficient both for memory and computation. Further aspects of INQ techniques are described below”).
However, Yao fails to explicitly disclose that obtaining weight differences between an initial weight and an updated weight determined by the learning of each cycle for each of layers in the first neural network; and analyzing a statistic of the weight differences for each of the layers.
However, Kundu teaches that obtaining weight differences between an initial weight and an updated weight determined by the learning of each cycle for each of layers in the first neural network (See Kundu: Figs. 14A-F, and [0197], “If the group of pre-trained weights eligible for residual computation at block 1427, the logic 1420 can cause computational logic to proceed to block 1428 to compute residual weights for the group. The residual weights can be computed based on the difference between the full precision pre-trained weights of the group and the ternary weights computed at block 1424. Operations of the logic 1420 can continue, at block 1430, to ternarize the residual weights. At block 1432, the logic 1420 can store the ternarized residual weights. The logic 1420 can then cause computational logic to store the ternarized weights at block 1434 in conjunction with storing the ternarized residual weights at block 1432, or to store only the ternarized weights if the logic 1420 is to bypass computation of residual weights for the group at block 1427”).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to modify Yao to have obtaining weight differences between an initial weight and an updated weight determined by the learning of each cycle for each of layers in the first neural network as taught by Kundu in order to reduce the error associated with the threshold-based ternarization (See Kundu: Figs. 14A-F, and [0188], “The error associated with the threshold-based ternarization can be reduced using a residual ternary representation by compensating for the loss in accuracy due to the low precision of the ternarized weights. Using residual representation can compensate for the loss in accuracy due to low-precision of the ternary representation, enabling low precision inference to be performed using ternary weights without requiring a re-training operation”). Yao teaches a method and system that may quantize the DNN weights by partitioning the weights into groups and quantizing the group weights by minimizing the error; while Kundu teaches a system and method that may quantize and compress the data based on the input data statistics. Therefore, it is obvious to one of ordinary skill in the art to modify Yao by Kundu to perform the weights quantization based on the weights statistics. The motivation to modify Yao by Kundu is “Use of known technique to improve similar devices (methods, or products) in the same way”.
However, Yao, modified by Kundu, fails to explicitly disclose that analyzing a statistic of the weight differences for each of the layers.
However, Wang teaches that analyzing a statistic of the weight differences for each of the layers (See Wang: Fig. 1, and [0029], “The quantizer component 106 and the quantizer management component 108 can be associated with (e.g., communicatively and/or functionally connected to) each other and with the processor component 102 and the data store 104 via one or more buses. The quantizer component 106 can quantize respective weights of a set of weights based at least in part on respective quantization scale values to generate respective quantized weights, in accordance with the defined quantization criteria, as more fully described herein. The quantizer management component 108 can determine or estimate the respective quantization scale values based at least in part on statistical information and/or statistical functions associated with weight distributions of respective subsets of weights of the set of weights, as more fully described herein”).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to modify Yao to have analyzing a statistic of the weight differences for each of the layers as taught by Wang in order to enable reducing or minimizing the quantization error to reduce or minimize degradation of accuracy due to quantization of the weights with respect to the training of the deep learning model or system (See Wang: Fig. 2, and [0039], “With regard to the quantization scale, there can be respective quantization scale values that, for a set of weights, can be associated with respective quantization errors for the weights. For each set of weights, there can be a quantization scale value of the respective quantization values that can minimize the quantization error of the weights, wherein such quantization scale value can be defined as”). Yao teaches a method and system that may quantize the DNN weights by partitioning the weights into groups and quantizing the group weights by minimizing the error; while Wang teaches a system and method that may quantize weights based on the weight distribution in order to reduce the error associated with the weight quantization. Therefore, it is obvious to one of ordinary skill in the art to modify Yao by Wang to perform the weights quantization based on the weights distribution. The motivation to modify Yao by Wang is “Use of known technique to improve similar devices (methods, or products) in the same way”.
Regarding claim 2, Yao, Kundu, and Wang teach all the features with respect to claim 1 as outlined above. Further, Wang teaches that the method of claim 1, wherein the statistic comprises a mean square of weight differences for the each of the layers (See Wang: Fig. 1, and [0040], “In some embodiments, the quantizer management component 108 can efficiently estimate or determine the desired quantization scale value α.sub.w* of the respective quantization scale values, to desirably reduce the quantization error of the weights, as a function of a first statistical function, E(w.sup.2), and a second statistical function, such as E(|w|), wherein E(w.sup.2) can be the expected value (e.g., mean value or average value) of the squared values of the weight values (w) of the weight distribution for the set of weights, or portion thereof, and wherein can be the expected value (e.g., mean value or average value) of the absolute values of the weight values of the weight distribution for the set of weights, or portion thereof”).
Regarding claim 3, Yao, Kundu, and Wang teach all the features with respect to claim 1 as outlined above. Further, Yao teaches that the method of claim 1, further comprising sorting the layers in order of a size of the analyzed statistic, wherein the determining of the one or more layers to be quantized comprises identifying layers having a relatively small analyzed statistic size from among the sorted layers (See Yao: Fig. 20, and [0202], “where A.sub.l.sup.(1) denotes the first weight group that needs to be quantized, and A.sub.l.sup.(2) denotes the other weight group that needs to be re-trained. In some examples a pruning-inspired strategy is used to divide the weights of each layer of a pre-trained DNN model into two disjoint groups by determining their absolute values (operation 2010) and comparing their absolute values with layer-wise thresholds which are automatically determined by a given splitting ratio (i.e., a threshold) (operation 2015). In some examples a binary matrix T.sub.l to help distinguish above two categories of weights. That is, T.sub.l(i,j)=0 means W.sub.l(i,j)∈A.sub.l.sup.(1), and T.sub.l(i,j)=1 means W.sub.l(i,j)∈A.sub.l.sup.(2)”).
Regarding claim 4, Yao, Kundu, and Wang teach all the features with respect to claim 3 as outlined above. Further, Kundu teaches that the method of claim 3, wherein the determining of the one or more layers to be quantized comprises identifying the one or more layers to be quantized using a binary search algorithm, in response to an accuracy loss of a neural network being within a threshold in comparison with the first neural network when some layers among the sorted layers are quantized with the second bit precision (See Kundu: Figs. 14A-F, and [0204], “In one embodiment, a sub 8-bit inference pipeline uses ternary weights and 8-bit activations, with minimal or no re-training. The full-precision weights may be converted to low-precision, such that the element-wise distance between full-precision and low-precision weights is small. Consequently, the low-precision weights remain in the neighborhood of pre-trained full-precision weights in the search space”).
Regarding claim 5, Yao, Kundu, and Wang teach all the features with respect to claim 4 as outlined above. Further, Yao teaches that the method of claim 4, wherein the accuracy loss comprises a recognition rate of the neural network (See Yao: Fig. 19, and [0183], “In a third aspect these three operations are repeated on the latest re-trained weight group in an iterative manner until all the weights are quantized, thereby acting as an incremental network quantization and accuracy enhancement procedure. The INQ techniques described herein can resolve aforementioned issues and performed pretty well on the ImageNet large scale classification task using all known DNN models including AlexNet, VGG-16, GoogLeNet and ResNets. Specifically, techniques employing 5-bit, 4-bit and 3-bit low-precision models (re-trained with 8-16 epochs, i.e., 1-2 days on a GPU) have improved or almost same accuracy compared with 32-bit full-precision models. Even for 2-bit ternary models, the accuracy of techniques described herein meets or exceeds other ternary and binary results with significant margins (>2.9%/4.2%) in top-5/top-1 recognition rate”).
Regarding claim 6, Yao, Kundu, and Wang teach all the features with respect to claim 3 as outlined above. Further, Yao teaches that the method of claim 3, wherein the determining of the one or more layers to be quantized comprises determining a number of layers from among the sorted layers to be the one or more layers in ascending order of the size of the analyzed statistic (See Yao: Figs. 14A-B, and [0142], “Before a machine learning algorithm can be used to model a particular problem, the algorithm is trained using a training data set. Training a neural network involves selecting a network topology, using a set of training data representing a problem being modeled by the network, and adjusting the weights until the network model performs with a minimal error for all instances of the training data set. For example, during a supervised learning training process for a neural network, the output produced by the network in response to the input representing an instance in a training data set is compared to the “correct” labeled output for that instance, an error signal representing the difference between the output and the labeled output is calculated, and the weights associated with the connections are adjusted to minimize that error as the error signal is backward propagated through the layers of the network. The network is considered “trained” when the errors for each of the outputs generated from the instances of the training data set are minimized”).
Regarding claim 7, Yao, Kundu, and Wang teach all the features with respect to claim 3 as outlined above. Further, Yao teaches that the method of claim 3, wherein the determining of the one or more layers to be quantized comprises not determining a layer having the smallest analyzed statistic size from among the sorted layers to be the one or more layers to be quantized (See Yao: Figs. 20-22, and [0203], “At operation 2020 the weights of a DNN model are partitioned into two groups. FIGS. 21A-21B illustrates an example of a DNN model 2110 in which the weights are divided into two a first group 2115 represented by dashed lines between the nodes on the model and a second group 2120 represented by solid lines between the nodes. Referring to FIG. 22, the first row illustrates results from the first iteration of the proposed three operations. The top left cube 2210 illustrates weight partition operation (operation 2020) generating two disjoint groups. The middle cube 2210 illustrates the quantization operation (2025) on the first weight group, in which the shaded cells are represented in powers of two. The top right cube illustrates the re-training operation (operation 2030) on the second weight group (i.e., the shaded cells). At operation 2035 the quantization and retraining operations are repeated until the model weights are fully quantized as powers of two or zero. This is illustrated in the transition between FIG. 21B and FIG. 21C. In FIG. 22, the lower row depicts results from the second, third, and fourth iterations of the INQ. In the figure, the accumulated portion of the weights that have been quantized undergoes from 50%->75%->87.5%->100%”).
Regarding claim 8, Yao, Kundu, and Wang teach all the features with respect to claim 1 as outlined above. Further, Kundu teaches that the method of claim 1, wherein
the first neural network has layers of fixed point parameters of the first bit precision and is quantized from a third neural network having layers of floating point parameters of a third bit precision that is higher than the first bit precision (See Kundu: Figs. 14A-F, and [0214], “Hardware logic can be configured to perform low-precision fixed point computations using a 32-bit accumulator for low precision computations. In one embodiment, a static grouping mechanism is used to ternarize weights at each convolution layer and quantize the group scaling factors to 8-bit fixed point values. The activations at each layer are quantized to 8-bits. One embodiment uses 8-bit precision for weights of the first convolution and fully connected layers to prevent loss accumulation. Batch normalization parameters may be computed during the inference phase to compensate for the shift in variance that quantization introduces”), and
the quantized second neural network comprises the determined one or more layers have fixed-point parameters of the second bit precision and other layers with the fixed-point parameters of the first bit precision (See Kundu: Figs. 14A-F, and [0215], “FIG. 14D illustrates a flow diagram of operations associated with computational logic, according to an embodiment. The illustrated operations and logic can be applied before computation to quantize a pre-trained full precision model for use in inferencing computation on hardware optimized for INT8 compute. Portions of the logic can also be applied on a layer-by-layer basis for machine learning compute operations to re-quantize activation tensors between layers. In one embodiment conversion logic 1450 includes first conversion sub-logic 1453 that accepts as input a single precision floating point (FP32) weight tensor 1451 and second conversion sub-logic 1454 that accepts an FP32 activation tensor 1452. In one embodiment, the first conversion sub-logic 1453 converts the FP32 weight tensor 1451 to an INT8 ternary tensor 1455, while the second conversion sub-logic 1454 converts the input FP32 activation tensor 1452 to an INT8 tensor 1456. The INT8 ternary tensor 1455 has the formulation (α×Ŵ) and includes one or more sets of ternarized (e.g., 2-bit) weights (Ŵ) and one or more 8-bit scaling factors (α). In one embodiment, some instances of the INT8 ternary tensor 1455 can use 8-bit weights (Ŵ). For example, the weights associated with a first layer and the fully-connected layers of a neural network can use 8-bit weights to reduce the degree of loss accumulation across subsequent layers”).
Regarding claim 9, Yao, Kundu, and Wang teach all the features with respect to claim 1 as outlined above. Further, Kundu teaches that the method of claim 1, further comprising, quantizing the layers other than the one or more layers, to layers of fixed-point parameters of a fourth bit precision that is lower than the first bit precision and higher than the second bit precision, in response to the first neural network having layers of floating-point parameters of the first bit precision (See Kundu: Figs. 14A-F, and [0216], “The quantized weight and activation data can be provided to a low precision parallel compute unit 1464. In one embodiment, the parallel compute unit 1464 performs 32-bit accumulation can store data internally within registers having at least 32-bits of precision. The internal registers can include an input register 1457 for ternary weights and an input register 1458 for INT8 activations. The parallel compute unit 1464 includes logic 1461 to perform a convolution operation that generates single precision floating point output in the form of an FP32 activation tensor 1452. In one embodiment, the FP32 activation tensor can be re-quantized by conversion logic 1462 to convert the output activation to INT8 for subsequent processing”),
wherein the quantized second neural network comprises the determined one or more layers having fixed-point parameters of the second bit precision and the layers have fixed-point parameters of the fourth bit precision (See Kundu: Figs. 4A-F, and [0103], “For the shared programming model, the graphics acceleration module 446 or an individual graphics processing engine 431-432, N selects a process element using a process handle. In one embodiment, process elements are stored in system memory 411 and are addressable using the effective address to real address translation techniques described herein. The process handle may be an implementation-specific value provided to the host process when registering its context with the graphics processing engine 431-432, N (that is, calling system software to add the process element to the process element linked list). The lower 16-bits of the process handle may be the offset of the process element within the process element linked list”).
Regarding claim 10, Yao, Kundu, and Wang teach all the features with respect to claim 1 as outlined above. Further, Yao teaches that a non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method defined in claim 1 (See Yao: Fig. 10, and [0116], “FIG. 10 illustrates exemplary graphics software architecture for a data processing system 1000 according to some embodiments. In some embodiments, software architecture includes a 3D graphics application 1010, an operating system 1020, and at least one processor 1030. In some embodiments, processor 1030 includes a graphics processor 1032 and one or more general-purpose processor core(s) 1034. The graphics application 1010 and operating system 1020 each execute in the system memory 1050 of the data processing system”).
Regarding claim 11, Yao, Kundu, and Wang teach all the features with respect to claim 1 as outlined above. Further, Yao, Kundu, and Wang teach that an apparatus for neural network quantization, the apparatus (See Yao: Fig. 20, and [0188], “Aspects of INQ techniques will be described with reference to FIGS. 20, 21A-21C, and 22. FIG. 20 is a flowchart illustrating operations in a method for incremental network quantization”) comprising:
a processor (See Yao: Fig. 1, and [0027], “FIG. 1 is a block diagram of a processing system 100, according to an embodiment. In various embodiments the system 100 includes one or more processors 102 and one or more graphics processors 108, and may be a single processor desktop system, a multiprocessor workstation system, or a server system having a large number of processors 102 or processor cores 107. In one embodiment, the system 100 is a processing platform incorporated within a system-on-a-chip (SoC) integrated circuit for use in mobile, handheld, or embedded devices”)  configured to:
perform feedforward and backpropagation learning for a plurality of cycles on a first neural network having a first bit precision (See Yao: Fig. 18, and [0164], “Supervised learning is a learning method in which training is performed as a mediated operation, such as when the training dataset 1802 includes input paired with the desired output for the input, or where the training dataset includes input having known output and the output of the neural network is manually graded. The network processes the inputs and compares the resulting outputs against a set of expected or desired outputs. Errors are then propagated back through the system. The training framework 1804 can adjust to adjust the weights that control the untrained neural network 1806. The training framework 1804 can provide tools to monitor how well the untrained neural network 1806 is converging towards a model suitable to generating correct answers based on known input data. The training process occurs repeatedly as the weights of the network are adjusted to refine the output generated by the neural network. The training process can continue until the neural network reaches a statistically desired accuracy associated with a trained neural network 1808. The trained neural network 1808 can then be deployed to implement any number of machine learning operations”);
obtain weight differences (See Yao: Fig. 15, and [0153], “Once the neural network is structured, a learning model can be applied to the network to train the network to perform specific tasks. The learning model describes how to adjust the weights within the model to reduce the output error of the network. Backpropagation of errors is a common method used to train neural networks. An input vector is presented to the network for processing. The output of the network is compared to the desired output using a loss function and an error value is calculated for each of the neurons in the output layer. The error values are then propagated backwards until each neuron has an associated error value which roughly represents its contribution to the original output. The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the of the neural network”) between an initial weight and an updated weight determined by the learning of each cycle for each of layers in the first neural network (See Kundu: Figs. 14A-F, and [0197], “If the group of pre-trained weights eligible for residual computation at block 1427, the logic 1420 can cause computational logic to proceed to block 1428 to compute residual weights for the group. The residual weights can be computed based on the difference between the full precision pre-trained weights of the group and the ternary weights computed at block 1424. Operations of the logic 1420 can continue, at block 1430, to ternarize the residual weights. At block 1432, the logic 1420 can store the ternarized residual weights. The logic 1420 can then cause computational logic to store the ternarized weights at block 1434 in conjunction with storing the ternarized residual weights at block 1432, or to store only the ternarized weights if the logic 1420 is to bypass computation of residual weights for the group at block 1427”);
analyze a statistic of the weight differences (See Yao: Fig. 18, and [0164], “The training process occurs repeatedly as the weights of the network are adjusted to refine the output generated by the neural network. The training process can continue until the neural network reaches a statistically desired accuracy associated with a trained neural network 1808. The trained neural network 1808 can then be deployed to implement any number of machine learning operations”) for each of the layers (See Wang: Fig. 1, and [0029], “The quantizer component 106 and the quantizer management component 108 can be associated with (e.g., communicatively and/or functionally connected to) each other and with the processor component 102 and the data store 104 via one or more buses. The quantizer component 106 can quantize respective weights of a set of weights based at least in part on respective quantization scale values to generate respective quantized weights, in accordance with the defined quantization criteria, as more fully described herein. The quantizer management component 108 can determine or estimate the respective quantization scale values based at least in part on statistical information and/or statistical functions associated with weight distributions of respective subsets of weights of the set of weights, as more fully described herein”);
determine one or more layers, from among the layers, to be quantized with a second bit precision lower than the first bit precision, based on the analyzed statistic (See Yao: Fig. 20, and [0203], “At operation 2020 the weights of a DNN model are partitioned into two groups. FIGS. 21A-21B illustrates an example of a DNN model 2110 in which the weights are divided into two a first group 2115 represented by dashed lines between the nodes on the model and a second group 2120 represented by solid lines between the nodes. Referring to FIG. 22, the first row illustrates results from the first iteration of the proposed three operations. The top left cube 2210 illustrates weight partition operation (operation 2020) generating two disjoint groups. The middle cube 2210 illustrates the quantization operation (2025) on the first weight group, in which the shaded cells are represented in powers of two. The top right cube illustrates the re-training operation (operation 2030) on the second weight group (i.e., the shaded cells). At operation 2035 the quantization and retraining operations are repeated until the model weights are fully quantized as powers of two or zero. This is illustrated in the transition between FIG. 21B and FIG. 21C. In FIG. 22, the lower row depicts results from the second, third, and fourth iterations of the INQ. In the figure, the accumulated portion of the weights that have been quantized undergoes from 50%->75%->87.5%->100%”); and
generate a second neural network by quantizing the determined one or more layers with the second bit precision (See Yao: Figs. 20-22, and [0189], “FIGS. 20, 21A-21C, and 22 illustrate an overview of an INQ for learning lossless low-bit DNN model from any pre-trained full-precision reference on-the-fly. The final low-precision models are efficient both for memory and computation. Further aspects of INQ techniques are described below”).
Regarding claim 12, Yao, Kundu, and Wang teach all the features with respect to claim 11 as outlined above. Further, Wang teaches that the apparatus of claim 11, wherein the statistic comprises a mean square of weight differences for the each of the layers (See Wang: Fig. 1, and [0040], “In some embodiments, the quantizer management component 108 can efficiently estimate or determine the desired quantization scale value α.sub.w* of the respective quantization scale values, to desirably reduce the quantization error of the weights, as a function of a first statistical function, E(w.sup.2), and a second statistical function, such as E(|w|), wherein E(w.sup.2) can be the expected value (e.g., mean value or average value) of the squared values of the weight values (w) of the weight distribution for the set of weights, or portion thereof, and wherein can be the expected value (e.g., mean value or average value) of the absolute values of the weight values of the weight distribution for the set of weights, or portion thereof”).
Regarding claim 13, Yao, Kundu, and Wang teach all the features with respect to claim 11 as outlined above. Further, Yao teaches that the apparatus of claim 11, wherein the processor is further configured to:
sort the layers in order of a size of the analyzed statistic (See Yao: Fig. 20, and [0202], “where A.sub.l.sup.(1) denotes the first weight group that needs to be quantized, and A.sub.l.sup.(2) denotes the other weight group that needs to be re-trained. In some examples a pruning-inspired strategy is used to divide the weights of each layer of a pre-trained DNN model into two disjoint groups by determining their absolute values (operation 2010) and comparing their absolute values with layer-wise thresholds which are automatically determined by a given splitting ratio (i.e., a threshold) (operation 2015). In some examples a binary matrix T.sub.l to help distinguish above two categories of weights. That is, T.sub.l(i,j)=0 means W.sub.l(i,j)∈A.sub.l.sup.(1), and T.sub.l(i,j)=1 means W.sub.l(i,j)∈A.sub.l.sup.(2)”); and
determine layers having relatively small analyzed statistic size from among the sorted layers to be the one or more layers to be quantized (See Yao: Figs. 20-22, and [0203], “At operation 2020 the weights of a DNN model are partitioned into two groups. FIGS. 21A-21B illustrates an example of a DNN model 2110 in which the weights are divided into two a first group 2115 represented by dashed lines between the nodes on the model and a second group 2120 represented by solid lines between the nodes. Referring to FIG. 22, the first row illustrates results from the first iteration of the proposed three operations. The top left cube 2210 illustrates weight partition operation (operation 2020) generating two disjoint groups. The middle cube 2210 illustrates the quantization operation (2025) on the first weight group, in which the shaded cells are represented in powers of two. The top right cube illustrates the re-training operation (operation 2030) on the second weight group (i.e., the shaded cells). At operation 2035 the quantization and retraining operations are repeated until the model weights are fully quantized as powers of two or zero. This is illustrated in the transition between FIG. 21B and FIG. 21C. In FIG. 22, the lower row depicts results from the second, third, and fourth iterations of the INQ. In the figure, the accumulated portion of the weights that have been quantized undergoes from 50%->75%->87.5%->100%”). 
Regarding claim 14, Yao, Kundu, and Wang teach all the features with respect to claim 13 as outlined above. Further, Kundu teaches that the apparatus of claim 13, wherein the processor is further configured to determine the one or more layers to be quantized using a binary search algorithm, in response to an accuracy loss of a neural network being within a threshold in comparison with the first neural network when some layers among the sorted layers are quantized with the second bit precision (See Kundu: Figs. 14A-F, and [0204], “In one embodiment, a sub 8-bit inference pipeline uses ternary weights and 8-bit activations, with minimal or no re-training. The full-precision weights may be converted to low-precision, such that the element-wise distance between full-precision and low-precision weights is small. Consequently, the low-precision weights remain in the neighborhood of pre-trained full-precision weights in the search space”).
Regarding claim 15, Yao, Kundu, and Wang teach all the features with respect to claim 14 as outlined above. Further, Yao teaches that the apparatus of claim 14, wherein the accuracy loss comprises a recognition rate of the neural network (See Yao: Fig. 19, and [0183], “In a third aspect these three operations are repeated on the latest re-trained weight group in an iterative manner until all the weights are quantized, thereby acting as an incremental network quantization and accuracy enhancement procedure. The INQ techniques described herein can resolve aforementioned issues and performed pretty well on the ImageNet large scale classification task using all known DNN models including AlexNet, VGG-16, GoogLeNet and ResNets. Specifically, techniques employing 5-bit, 4-bit and 3-bit low-precision models (re-trained with 8-16 epochs, i.e., 1-2 days on a GPU) have improved or almost same accuracy compared with 32-bit full-precision models. Even for 2-bit ternary models, the accuracy of techniques described herein meets or exceeds other ternary and binary results with significant margins (>2.9%/4.2%) in top-5/top-1 recognition rate”).
Regarding claim 16, Yao, Kundu, and Wang teach all the features with respect to claim 13 as outlined above. Further, Yao teaches that the apparatus of claim 13, wherein the processor is further configured to determine a number of layers from among the sorted layers to be the one or more layers in ascending order of the size of the analyzed statistic (See Yao: Figs. 14A-B, and [0142], “Before a machine learning algorithm can be used to model a particular problem, the algorithm is trained using a training data set. Training a neural network involves selecting a network topology, using a set of training data representing a problem being modeled by the network, and adjusting the weights until the network model performs with a minimal error for all instances of the training data set. For example, during a supervised learning training process for a neural network, the output produced by the network in response to the input representing an instance in a training data set is compared to the “correct” labeled output for that instance, an error signal representing the difference between the output and the labeled output is calculated, and the weights associated with the connections are adjusted to minimize that error as the error signal is backward propagated through the layers of the network. The network is considered “trained” when the errors for each of the outputs generated from the instances of the training data set are minimized”).
Regarding claim 17, Yao, Kundu, and Wang teach all the features with respect to claim 13 as outlined above. Further, Yao teaches that the apparatus of claim 13, wherein the processor is further configured to not determine a layer having the smallest analyzed statistic size from among the sorted layers to be the one or more layers to be quantized (See Yao: Figs. 20-22, and [0203], “At operation 2020 the weights of a DNN model are partitioned into two groups. FIGS. 21A-21B illustrates an example of a DNN model 2110 in which the weights are divided into two a first group 2115 represented by dashed lines between the nodes on the model and a second group 2120 represented by solid lines between the nodes. Referring to FIG. 22, the first row illustrates results from the first iteration of the proposed three operations. The top left cube 2210 illustrates weight partition operation (operation 2020) generating two disjoint groups. The middle cube 2210 illustrates the quantization operation (2025) on the first weight group, in which the shaded cells are represented in powers of two. The top right cube illustrates the re-training operation (operation 2030) on the second weight group (i.e., the shaded cells). At operation 2035 the quantization and retraining operations are repeated until the model weights are fully quantized as powers of two or zero. This is illustrated in the transition between FIG. 21B and FIG. 21C. In FIG. 22, the lower row depicts results from the second, third, and fourth iterations of the INQ. In the figure, the accumulated portion of the weights that have been quantized undergoes from 50%->75%->87.5%->100%”).
Regarding claim 18, Yao, Kundu, and Wang teach all the features with respect to claim 11 as outlined above. Further, Kundu teaches that the apparatus of claim 11, wherein
the first neural network has layers of fixed point parameters of the first bit precision and is quantized from a third neural network having layers of floating point parameters of a third bit precision that is higher than the first bit precision (See Kundu: Figs. 14A-F, and [0214], “Hardware logic can be configured to perform low-precision fixed point computations using a 32-bit accumulator for low precision computations. In one embodiment, a static grouping mechanism is used to ternarize weights at each convolution layer and quantize the group scaling factors to 8-bit fixed point values. The activations at each layer are quantized to 8-bits. One embodiment uses 8-bit precision for weights of the first convolution and fully connected layers to prevent loss accumulation. Batch normalization parameters may be computed during the inference phase to compensate for the shift in variance that quantization introduces”), and
the quantized second neural network comprises the determined one or more layers have fixed-point parameters of the second bit precision and other layers with the fixed-point parameters of the first bit precision (See Kundu: Figs. 14A-F, and [0215], “FIG. 14D illustrates a flow diagram of operations associated with computational logic, according to an embodiment. The illustrated operations and logic can be applied before computation to quantize a pre-trained full precision model for use in inferencing computation on hardware optimized for INT8 compute. Portions of the logic can also be applied on a layer-by-layer basis for machine learning compute operations to re-quantize activation tensors between layers. In one embodiment conversion logic 1450 includes first conversion sub-logic 1453 that accepts as input a single precision floating point (FP32) weight tensor 1451 and second conversion sub-logic 1454 that accepts an FP32 activation tensor 1452. In one embodiment, the first conversion sub-logic 1453 converts the FP32 weight tensor 1451 to an INT8 ternary tensor 1455, while the second conversion sub-logic 1454 converts the input FP32 activation tensor 1452 to an INT8 tensor 1456. The INT8 ternary tensor 1455 has the formulation (α×Ŵ) and includes one or more sets of ternarized (e.g., 2-bit) weights (Ŵ) and one or more 8-bit scaling factors (α). In one embodiment, some instances of the INT8 ternary tensor 1455 can use 8-bit weights (Ŵ). For example, the weights associated with a first layer and the fully-connected layers of a neural network can use 8-bit weights to reduce the degree of loss accumulation across subsequent layers”).
Regarding claim 19, Yao, Kundu, and Wang teach all the features with respect to claim 11 as outlined above. Further, Kundu teaches that the apparatus of claim 11, wherein the processor is further configured to quantize layers other than the one or more layers, to layers of fixed-point parameters of a fourth bit precision that is lower than the first bit precision and higher than the second bit precision, in response to the first neural network having layers of floating-point parameters of the first bit precision (See Kundu: Figs. 14A-F, and [0216], “The quantized weight and activation data can be provided to a low precision parallel compute unit 1464. In one embodiment, the parallel compute unit 1464 performs 32-bit accumulation can store data internally within registers having at least 32-bits of precision. The internal registers can include an input register 1457 for ternary weights and an input register 1458 for INT8 activations. The parallel compute unit 1464 includes logic 1461 to perform a convolution operation that generates single precision floating point output in the form of an FP32 activation tensor 1452. In one embodiment, the FP32 activation tensor can be re-quantized by conversion logic 1462 to convert the output activation to INT8 for subsequent processing”), and
the quantized second neural network comprises the determined one or more layers having fixed-point parameters of the second bit precision and the layers have fixed-point parameters of the fourth bit precision (See Kundu: Figs. 4A-F, and [0103], “For the shared programming model, the graphics acceleration module 446 or an individual graphics processing engine 431-432, N selects a process element using a process handle. In one embodiment, process elements are stored in system memory 411 and are addressable using the effective address to real address translation techniques described herein. The process handle may be an implementation-specific value provided to the host process when registering its context with the graphics processing engine 431-432, N (that is, calling system software to add the process element to the process element linked list). The lower 16-bits of the process handle may be the offset of the process element within the process element linked list”).
Regarding claim 20, Yao, Kundu, and Wang teach all the features with respect to claim 11 as outlined above. Further, Yao teaches that the apparatus of claim 11, further comprising a memory storing instructions that, when executed, configures the processor to perform the learning, obtain the weight differences, analyze the statistic, determine the one or more layers, and generate the second neural network (See Yao: Fig. 10, and [0116], “FIG. 10 illustrates exemplary graphics software architecture for a data processing system 1000 according to some embodiments. In some embodiments, software architecture includes a 3D graphics application 1010, an operating system 1020, and at least one processor 1030. In some embodiments, processor 1030 includes a graphics processor 1032 and one or more general-purpose processor core(s) 1034. The graphics application 1010 and operating system 1020 each execute in the system memory 1050 of the data processing system”).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GORDON G LIU whose telephone number is (571)270-0382. The examiner can normally be reached Monday - Friday 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached on 571-272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GORDON G LIU/Primary Examiner, Art Unit 2612