DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 2/7/2022 has been entered. Presently, claims 1-5, 8-12, and 15-18 remain pending. Claims 
Response to Arguments
Applicant’s arguments with respect to claim(s) 1, 8, and 15 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 8-10, and 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over Wu et al. ("Google's neural machine translation system: Bridging the gap between human and machine translation.") in view of Rizvi et al. ("GPGPU accelerated deep object classification on a heterogeneous mobile platform.") and Zhu et al. ("Trained ternary quantization.").
Regarding Claim 1,
Wu teaches one or more non-transitory computer-readable storage mediums having stored thereon executable computer program instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: 
processing, …, a trained convolutional neural network (CNN) to generate a processed CNN (pg. 9; For example, in [43], it is demonstrated that a convolutional neural network model can be sped up by a factor of 4-6 with minimal loss on classification accuracy on the ILSVRC-12 benchmark.), the trained CNN having weights in a floating-point format (pg. 10; When doing quantized inference, we replace all the floating point operations in equations 10 and 11 with fixed-point integer operations with either 8-bit or 16-bit resolution. The weight matrix W above is represented using an 8-bit integer matrix WQ and a float vector s, as shown below: Weights are represented as floating point before conversion.), wherein the executable computer program instructions provide a machine learning framework to provide a library of machine learning primitives to accelerate machine-learning operations (pg. 10; All other operations, including all the activations (sigmoid, tanh) and elementwise operations (, +) are done using 16-bit integer operations. And pg. 1; To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. Activation function (i.e. primitive) see paragraph [0171] of Applicant’s specification.), processing the trained CNN includes quantizing the weights in the floating- point format to generate weights in an 8-bit integer format having a static precision (pg. 10; When doing quantized inference, we replace all the floating point operations in equations 10 and 11 with fixed-point integer operations with either 8-bit or 16-bit resolution. The weight matrix W above is represented using an 8-bit integer matrix WQ and a float vector s, as shown below:), wherein quantizing the weights includes: 
quantizing the weights from the floating-point format to the 8-bit integer format (pg. 10; When doing quantized inference, we replace all the floating point operations in equations 10 and 11 with fixed-point integer operations with either 8-bit or 16-bit resolution. The weight matrix W above is represented using an 8-bit integer matrix WQ and a float vector s, as shown below:); and 
performing an inference operation utilizing the processed CNN with the weights in the 8-bit integer format (pg. 10; In quantized inference, the weight matrix Ws is quantized into 8 bits as in equation 12, and the matrix multiplication is done using 8 bit arithmetic.).
While Wu discloses quantizing weights from a floating-point format to the 8 bit integer format, Wu does not explicitly disclose 
processing, via a graphics multiprocessor having a single instruction multiple thread (SIMT) architecture
generating a quantization table to enable non-uniform quantization of the weights, wherein generating the quantization table includes executing a quantization primitive provided by the machine learning framework, and 
quantizing the weights from the floating-point format to the 8-bit integer format using the quantization table;
However, Rizvi teaches
processing, via a graphics multiprocessor having a single instruction multiple thread (SIMT) architecture (pg. 1; CPUs are well-suited for sequential tasks due to higher operational frequencies, whereas GPUs can execute concurrent tasks efficiently thanks to their Single Instruction Multiple Threads (SIMT) architecture.)
Wu and Rizvi are analogous arts because both are directed towards the same field of endeavor of implementing CNNs using GPUs.
It would have been obvious to one of ordinary skill in the art to modify the GPUs of Wu with the SIMT architecture of Rizvi.
Doing so would allow for concurrently executing tasks for neural network operations. Task scheduling can reduce the training and testing time of neural networks (pg. 1).
Zhu teaches
generating a quantization table to enable non-uniform quantization of the weights, wherein generating the quantization table includes executing a quantization primitive provided by the machine learning framework (pg. 4; To learn the ternary value (codebook), we introduce two quantization factors W p l and Wn l for positive and negative weights in each layer l. During feed-forward, quantized ternary weights w t l are calculated as: eq (6). The codebook (i.e. quantization table) is generated during neural network feed-forward operation (i.e. primitive).), and 
quantizing the weights from the floating-point format to the 8-bit integer format using the quantization table (pg. 1; In this paper, we propose Trained Ternary Quantization which uses two full-precision scaling coefficients W p l , Wn l for each layer l, and quantize the weights to {−Wn l , 0, +W p l } instead of traditional {-1, 0, +1} or {-E, 0, +E} where E is the mean of the absolute weight value, which is not learned. And pg. 3, section 4.1; During gradient descent we learn both the quantized ternary weights (the codebook), and choose which of these values is assigned to each weight (choosing the codebook index).);
Wu and Zhu are analogous arts because they are directed towards the same field of endeavor of quantization of neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the quantization of weights of Wu with the codebook of Zhu.
Doing so would allow for reducing the precision of weights in neural networks. This method can lead to improved accuracy and are shown to outperform full-precision models (Abs.)
Regarding Claim 2,
Wu, Rizvi, and Zhu teach the one or more storage mediums of claim 1. Zhu further teaches wherein the quantization table is structured to maintain accuracy of inference by the processed CNN after quantization of the weights of the trained CNN (pg. 1; To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values. This method has very little accuracy degradation and can even improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet.).
Regarding Claim 3,
Wu, Rizvi, and Zhu teach the one or more storage mediums of claim 2. Wu further teaches  wherein the quantization of the weights of the trained CNN is performed without retraining (pg. 10; In quantized inference, the weight matrix Ws is quantized into 8 bits as in equation 12, and the matrix multiplication is done using 8 bit arithmetic. The calculations within the sof tmax function and the attention model are not quantized during inference. It is worth emphasizing that during training of the model we use full-precision floating point numbers. The only constraints we add to the model during training are the clipping of the RNN accumulator values into [−δ, δ] and softmax logits into [−γ, γ]. Quantization is performed without retraining.).
Regarding Claim 8,
Wu teaches a system comprising: 
a memory to store data including data relating to one or more convolutional neural networks (CNNs) (pg. 9; For example, in [43], it is demonstrated that a convolutional neural network model can be sped up by a factor of 4-6 with minimal loss on classification accuracy on the ILSVRC-12 benchmark.) and instructions associated with a machine learning framework to provide a library of machine learning primitives to accelerate machine-learning operations (pg. 10; All other operations, including all the activations (sigmoid, tanh) and elementwise operations (, +) are done using 16-bit integer operations. And pg. 1; To accelerate the final translation speed, we employ low-precision arithmetic during inference computations.); 
wherein the one or more graphics multiprocessors are to: 
process a trained CNN to generate a processed CNN (pg. 9, section 6; Many of those previous studies [19, 20, 43, 27] however mostly focus on CNN models with relatively few layers… To reduce quantization errors, additional constraints are added to our model during training so that it is quantizable with minimal impact on the output of the model.), the trained CNN having weights in a floating-point format, wherein processing the trained CNN includes for the one or more graphics multiprocessors (pg. 4; The model is partitioned into multiple GPUs to speed up training. In our setup, we have 8 encoder LSTM layers (1 bi-directional layer and 7 uni-directional layers), and 8 decoder layers.) to quantize the weights in the floating-point format to generate weights in an 8-bit integer format having a static precision (pg. 10; When doing quantized inference, we replace all the floating point operations in equations 10 and 11 with fixed-point integer operations with either 8-bit or 16-bit resolution. The weight matrix W above is represented using an 8-bit integer matrix WQ and a float vector s, as shown below: Weights are represented as floating point before conversion.), wherein quantizing the weights includes for the one or more graphics multiprocessors to: 
quantize the weights from the floating-point format to the 8-bit integer format (pg. 10; When doing quantized inference, we replace all the floating point operations in equations 10 and 11 with fixed-point integer operations with either 8-bit or 16-bit resolution. The weight matrix W above is represented using an 8-bit integer matrix WQ and a float vector s, as shown below:); and 
perform an inference operation utilizing the processed CNN with weights in the 8-bit integer format (pg. 10; In quantized inference, the weight matrix Ws is quantized into 8 bits as in equation 12, and the matrix multiplication is done using 8 bit arithmetic.).
Wu does not explicitly disclose
one or more processors including one or more graphics multiprocessors having a single instruction multiple thread (SIMT) architecture; and 
generate a quantization table to enable non-uniform quantization of the weights, wherein to generate the quantization table includes to accelerate operations associated with a quantizationAMENDMENT AND RESPONSE UNDER 37 CFR § 1.116Page 4Serial Number: 16/283,021Atty. Dkt. P116243-C1Filing Date: 2/22/19Title: CONVOLUTIONAL NEURAL NETWORK OPTIMIZATION primitive provided by the machine learning framework to cause generation of the quantization table via the one or more graphics multiprocessors, and 
quantize the weights from the floating-point format …using the quantization table; and 
However, Rizvi teaches
one or more processors including one or more graphics multiprocessors having a single instruction multiple thread (SIMT) architecture (pg. 1; CPUs are well-suited for sequential tasks due to higher operational frequencies, whereas GPUs can execute concurrent tasks efficiently thanks to their Single Instruction Multiple Threads (SIMT) architecture.);
Wu and Rizvi are analogous arts because both are directed towards the same field of endeavor of implementing CNNs using GPUs.
It would have been obvious to one of ordinary skill in the art to modify the GPUs of Wu with the SIMT architecture of Rizvi.
Doing so would allow for concurrently executing tasks for neural network operations. Task scheduling can reduce the training and testing time of neural networks (pg. 1).
Zhu teaches
generate a quantization table to enable non-uniform quantization of the weights, wherein to generate the quantization table includes to accelerate operations associated with a quantizationAMENDMENT AND RESPONSE UNDER 37 CFR § 1.116Page 4Serial Number: 16/283,021Atty. Dkt. P116243-C1Filing Date: 2/22/19Title: CONVOLUTIONAL NEURAL NETWORK OPTIMIZATION primitive provided by the machine learning framework to cause generation of the quantization table via the one or more graphics multiprocessors (pg. 4; To learn the ternary value (codebook), we introduce two quantization factors W p l and Wn l for positive and negative weights in each layer l. During feed-forward, quantized ternary weights w t l are calculated as: eq (6). The codebook (i.e. quantization table) is generated during neural network feed-forward operation (i.e. primitive).), and 
quantize the weights from the floating-point format …using the quantization table (pg. 1; In this paper, we propose Trained Ternary Quantization which uses two full-precision scaling coefficients W p l , Wn l for each layer l, and quantize the weights to {−Wn l , 0, +W p l } instead of traditional {-1, 0, +1} or {-E, 0, +E} where E is the mean of the absolute weight value, which is not learned. And pg. 3, section 4.1; During gradient descent we learn both the quantized ternary weights (the codebook), and choose which of these values is assigned to each weight (choosing the codebook index).); and 
Wu and Zhu are analogous arts because they are directed towards the same field of endeavor of quantization of neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the quantization of weights of Wu with the codebook of Zhu.
Doing so would allow for reducing the precision of weights in neural networks. This method can lead to improved accuracy and are shown to outperform full-precision models (Abs.)
Regarding Claim 9, 
Claim 9 is the system corresponding to the computer-readable storage medium of claim 1. Claim 9 is substantially similar to claim 2 and is rejected on the same grounds.
Regarding Claim 10, 
Claim 10 is the system corresponding to the computer-readable storage medium of claim 1. Claim 10 is substantially similar to claim 3 and is rejected on the same grounds.
Regarding Claim 15,
Wu teaches …the graphics multiprocessor comprising: 
a plurality of processing cores (pg. 11; In all cases, decoding is done on a single machine with two Intel Haswell CPUs, which consists in total of 88 CPU cores (hyperthreads).); and 
one or more cache memories to cache data for the plurality of processing cores (pg. 11; When it is decoded on TPU, certain operations, such as embedding lookup and attention module, remain on the CPU, and all other quantized operations are off-loaded to the TPU.); wherein the graphics multiprocessor is to: 
process a trained convolutional neural network (CNN) to generate a processed CNN (pg. 9, section 6; Many of those previous studies [19, 20, 43, 27] however mostly focus on CNN models with relatively few layers… To reduce quantization errors, additional constraints are added to our model during training so that it is quantizable with minimal impact on the output of the model.), the trained CNN having weights in a floating-point format, wherein processing theAMENDMENT AND RESPONSE UNDER 37 CFR § 1.116Page 5Serial Number: 16/283,021Atty. Dkt. P116243-C1Filing Date: 2/22/19Title: CONVOLUTIONAL NEURAL NETWORK OPTIMIZATION trained CNN includes to quantize, via the graphics multiprocessor, the weights in the floating-point format to generate weights in an 8-bit integer format having a static precision (pg. 10; When doing quantized inference, we replace all the floating point operations in equations 10 and 11 with fixed-point integer operations with either 8-bit or 16-bit resolution. The weight matrix W above is represented using an 8-bit integer matrix WQ and a float vector s, as shown below: Weights are represented as floating point before conversion.), wherein to quantize the weights includes, via the graphics multiprocessor, to: 
quantize the weights from the floating-point format to the 8-bit integer format (pg. 10; When doing quantized inference, we replace all the floating point operations in equations 10 and 11 with fixed-point integer operations with either 8-bit or 16-bit resolution. The weight matrix W above is represented using an 8-bit integer matrix WQ and a float vector s, as shown below:); and 
perform an inference operation utilizing the processed CNN with the weights in the 8-bit integer format (pg. 10; In quantized inference, the weight matrix Ws is quantized into 8 bits as in equation 12, and the matrix multiplication is done using 8 bit arithmetic. The calculations within the sof tmax function and the attention model are not quantized during inference.).
Wu does not explicitly disclose
a graphics multiprocessor having a single instruction multiple thread (SIMT) architecture,
generate a quantization table to enable non-uniform quantization of the weights, wherein to generate the quantization table includes to accelerate operations associated with a quantization primitive provided by a machine learning framework to cause generation of the quantization table via the one or more graphics multiprocessors, and 
quantize the weights…using the quantization table;
However, Rizvi teaches
a graphics multiprocessor having a single instruction multiple thread (SIMT) architecture (pg. 1; CPUs are well-suited for sequential tasks due to higher operational frequencies, whereas GPUs can execute concurrent tasks efficiently thanks to their Single Instruction Multiple Threads (SIMT) architecture.);
Wu and Rizvi are analogous arts because both are directed towards the same field of endeavor of implementing CNNs using GPUs.
It would have been obvious to one of ordinary skill in the art to modify the GPUs of Wu with the SIMT architecture of Rizvi.
Doing so would allow for concurrently executing tasks for neural network operations. Task scheduling can reduce the training and testing time of neural networks (pg. 1).
Zhu teaches
generate a quantization table to enable non-uniform quantization of the weights, wherein to generate the quantization table includes to accelerate operations associated with a quantization primitive provided by a machine learning framework to cause generation of the quantization table via the one or more graphics multiprocessors (pg. 4; To learn the ternary value (codebook), we introduce two quantization factors W p l and Wn l for positive and negative weights in each layer l. During feed-forward, quantized ternary weights w t l are calculated as: eq (6). The codebook (i.e. quantization table) is generated during neural network feed-forward operation (i.e. primitive).), and 
quantize the weights…using the quantization table (pg. 1; In this paper, we propose Trained Ternary Quantization which uses two full-precision scaling coefficients W p l , Wn l for each layer l, and quantize the weights to {−Wn l , 0, +W p l } instead of traditional {-1, 0, +1} or {-E, 0, +E} where E is the mean of the absolute weight value, which is not learned. And pg. 3, section 4.1; During gradient descent we learn both the quantized ternary weights (the codebook), and choose which of these values is assigned to each weight (choosing the codebook index).);
Wu and Zhu are analogous arts because they are directed towards the same field of endeavor of quantization of neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the quantization of weights of Wu with the codebook of Zhu.
Doing so would allow for reducing the precision of weights in neural networks. This method can lead to improved accuracy and are shown to outperform full-precision models (Abs.)
Regarding Claim 16, 
Claim 16 is the graphics multiprocessor corresponding to the computer-readable storage medium of claim 1. Claim 16 is substantially similar to claim 2 and is rejected on the same grounds.
Regarding Claim 17, 
Claim 17 is the graphics multiprocessor corresponding to the computer-readable storage medium of claim 1. Claim 17 is substantially similar to claim 3 and is rejected on the same grounds.

Claims 4-5, 11-12, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wu/Rizvi/Zhu and further in view of Yao et al. (US 20180046894 A1).
Regarding Claim 4,
Wu, Rizvi, and Zhu teach the one or more storage mediums of claim 1. 
	Wu, Rizvi, and Zhu do not explicitly disclose
wherein the floating-point format is a 32-bit floating-point format.
However, Yao (US 20180046894 A1) teaches
wherein the floating-point format is a 32-bit floating-point format (para [0118] For CaffeNet, as shown in Exp 1, the top-5 accuracy is 77.70% when 32-bit floating-point numbers are used.).
Wu, Rizvi, and Zhu are analogous arts because they are directed towards the same field of endeavor of quantizing neural network weights.
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the CNN of Wu with the weight quantization of Yao.
Doing so would allow for less energy and memory consumption when performing the operations of the CNN. This allows the model to be deployed to embedded systems with limited battery and resources (para [0005])
Regarding Claim 5,
Wu, Rizvi, and Zhu teach the one or more storage mediums of claim 1.
	 Wu, Rizvi, and Zhu do not explicitly disclose
wherein the floating-point format is a 16-bit floating-point format.
However, However, Yao (US 20180046894 A1) teaches
wherein the floating-point format is a 16-bit floating-point format (para [0118] When employing static-precision 16-bit quantization and 8/4-bit dynamic-precision quantization, the top-5 accuracy results are 77.12% and 76.64% respectively.).
Wu, Rizvi, and Zhu are analogous arts because they are directed towards the same field of endeavor of quantizing neural network weights.
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the CNN of Wu with the weight quantization of Yao.
Doing so would allow for less energy and memory consumption when performing the operations of the CNN. This allows the model to be deployed to embedded systems with limited battery and resources (para [0005])
Regarding Claim 11, 
Claim 11 is the system corresponding to the computer-readable storage medium of claim 1. Claim 11 is substantially similar to claim 4 and is rejected on the same grounds.
Regarding Claim 12, 
Claim 12 is the system corresponding to the computer-readable storage medium of claim 1. Claim 12 is substantially similar to claim 5 and is rejected on the same grounds.
Regarding Claim 18, 
Wu, Rizvi, and Zhu teach the graphics multiprocessor of claim 15.
	Wu, Rizvi, and Zhu do not explicitly disclose
wherein the floating-point format is a floating point format selected from a set of floating point formats including a 16-bit floating-point format and a 32-bit floating-point format.
However, Yao teaches
wherein the floating-point format is a floating point format selected from a set of floating point formats including a 16-bit floating-point format and a 32-bit floating-point format (para [0118] For CaffeNet, as shown in Exp 1, the top-5 accuracy is 77.70% when 32-bit floating-point numbers are used. When employing static-precision 16-bit quantization and 8/4-bit dynamic-precision quantization, the top-5 accuracy results are 77.12% and 76.64% respectively.).
Wu, Rizvi, and Zhu are analogous arts because they are directed towards the same field of endeavor of quantizing neural network weights.
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the CNN of Wu with the weight quantization of Yao.
Doing so would allow for less energy and memory consumption when performing the operations of the CNN. This allows the model to be deployed to embedded systems with limited battery and resources (para [0005])

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HENRY K NGUYEN whose telephone number is (571)272-0217. The examiner can normally be reached Mon - Fri 7:00am-4:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 5712723768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/H.N./Examiner, Art Unit 2121                                                                                                                                                                                                        
/NICHOLAS KLICOS/Primary Examiner, Art Unit 2145