DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
This office action is responsive to the amendment received 09/27/2022.

In the response to the Non-Final Office Action 07/01/2022, the applicant states that claims 1-20 are pending. Claims 1, 13, and 17 are independent claims.


Claims 1, 5-6, 11, 13, 15, 17, and 19-20 are amended. In summary, claims 1-20 are pending in current application.

Response to Arguments
Applicant's arguments filed 09/27/2022 have been fully considered.
Regarding to 35 U.S.C 112 (b) rejection, the amendment has cured the basis of 35 U.S.C 112 (b) rejection. Therefore, the 35 U.S.C 112 (b) rejection is hereby withdrawn.

	Regarding to claim 1, the application argues that Ould-ahmed-vall and Jia fail to disclose at least the newly-added features of the independent claims. The newly-added features are “to utilize the first set of arithmetic units to generate feedback for updating one or more parameters of the second set of arithmetic units based at least in part on determining a difference between (i) a first set of predictions obtained by performing inference on the first portion of the input data with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision”. The arguments according “determining a difference between (i) a first set of predictions obtained by performing inference on the first portion of the input data with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision” are persuasive. Therefore, 35 U.S.C 103 rejection have been withdrawn. However, upon further consideration, new grounds of rejection are made in newly applied art. The arguments according to “to utilize the first set of arithmetic units to generate feedback for updating one or more parameters of the second set of arithmetic units based at least in part on” are not persuasive. The examiner cannot concur with the applicant for following reasons:
Ould-ahmed-vall discloses “generate feedback for updating one or more parameters of the second set of arithmetic units based at least in part on the inferences performed on the first portion of the input data by the set of arithmetic units”. For example, in Fig. 10 and paragraph [0162], Ould-ahmed-vall teaches to implement a recurrent function and a feedback mechanism 1005 to enable a ‘memory’ of previous states, and an output layer 1006 to output a result. In Fig. 11 and paragraph [0166], Ould-ahmed-vall teaches errors are then propagated back through the system.
Jia teaches “to utilize the first set of arithmetic units to generate feedback”. For example, in page 3, Fig. 1, and 3 SYSTEM OVERVIEW, Jia teaches training module; Jia further teaches  forward/backward computation with mixed-precision and model update with LARS; 

    PNG
    media_image1.png
    315
    503
    media_image1.png
    Greyscale
 . 
In page 3, Fig. 2, and 4.1 Mixed-Precision Training with LARS, Jia teaches the weights and gradients are cast to single-precision (FP32) format before applying LARS and cast back to FP16 afterward; feedback from PF32 to PF16 as illustrated in Fig. 2. In page 6 and 5.3 Convergence Analysis, Jia teaches a master copy of weights was updated in FP32 to avoid loss of accuracy, and all tensors in the forward and backward passes were in FP16.

Claims 13 and 17 are not allowable due to the newly cited art and similar reasons as discussed above. 

Claim Objections
Claim 15 is objected to because of the following informalities:  the number 15 in claim 15 is crossed. The number 15 indicates claim 15. It should not be crossed .  Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5, 12-14, and 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Ould-ahmed-vall (US 20180307494 A1), in view of Jia (Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes), and further in view of Chai (US 20200134461 A1).
Regarding to claim 1 (Currently amended), Ould-ahmed-vall discloses an apparatus (Fig. 13; [0180]: inferencing system on a chip (SOC) 1300 is suitable for performing inferencing using a trained model) comprising: 
a memory (Fig. 13; [0180]: the SOC 1300 includes on-chip memory 1305 that enables a shared on-chip data pool that is accessible by each of the processing components); 
a processor coupled to the memory (Fig. 13; [0180]: the SOC 1300 includes a media processor 1302, a vision processor 1304, a GPGPU 1306 and a multi-core processor 1308; multiple processors are coupled to on chip memory as illustrated in Fig. 13; 
    PNG
    media_image2.png
    538
    581
    media_image2.png
    Greyscale
); 
the processor comprising a first set of arithmetic units having a first precision for floating-point computations and a second set of arithmetic units having a second precision for floating-point computations, the second precision being lower than the first precision ([0059]: some instances of the parallel processing unit 202 include higher precision floating point units relative to other instances; [0141]: a subset of the floating point units in each of the compute clusters 706A-706H is configured to 32-bit floating point operations; second subset of the floating point units in each of the compute clusters 706A-706H is configured to perform 16-bit; [00183]: the GPGPU 1306 support instructions to perform low precision computations such as 8-bit and 4-bit integer vector operations; [0184]: a compute unit is configured to perform an FP32 operation; another compute unit is configured to perform a dual FP16 operation; a compute unit configured to perform 32-bit integer operations; perform four simultaneous 8-bit integer operations; [0185]); 
the processor being configured (Fig. 13; [0180]: the SOC 1300 includes a media processor 1302, a vision processor 1304, a GPGPU 1306 and a multi-core processor 1308): 
to obtain a machine learning model trained in the first precision (Fig. 11; [0164]: the neural network is trained using a training dataset 1102; generate and obtain a trained neural net; 
    PNG
    media_image3.png
    511
    663
    media_image3.png
    Greyscale
 [0167]: trained neural network 1108 is capable of performing operations useful in reducing the dimensionality of data; [0198]: generate a trained neural network); 
to utilize the second set of arithmetic units to perform inference on at least a first portion of input data (Fig. 11; [0166]: the trained neural network 1108 is deployed to implement any number of machine learning operations to generate an inference result 1114 based on input of new data 1112; [0168]: incremental learning enables the trained neural network 1108 to adapt to the new data 1112); 
generate feedback for updating one or more parameters of the second set of arithmetic units based at least in part on the inferences performed on the first portion of the input data by the set of arithmetic units (Fig. 10; [0162]: to implement a recurrent function and a feedback mechanism 1005 to enable a ‘memory’ of previous states, and an output layer 1006 to output a result; Fig. 11; [0166]: errors are then propagated back through the system); 
to tune one or more parameters of the second set of arithmetic units based at least in part on the feedback (Fig. 11; [0166]: the training process occurs repeatedly as the weights of the network are adjusted to refine the output generated by the neural network; errors are then propagated back through the system; the training process continues until the neural network reaches a statistically desired accuracy associated with a trained neural net 1108); and 
to utilize the second set of arithmetic units with the tuned parameters to generate inference results for at least a second portion of the input data (Fig. 11; [0166]: the trained neural network 1108 is deployed to implement any number of machine learning operations to generate an inference result 1114 based on input of new data 1112; [0168]: incremental learning enables the trained neural network 1108 to adapt to the new data 1112 without forgetting the knowledge instilled within the network during initial training; [0196]: perform the inferencing operations).
Ould-ahmed-vall fails to explicitly disclose:
to utilize the first set of arithmetic units to generate feedback;
the inferences are  determining a difference between (i) a first set of predictions obtained by performing inference performed on the first portion of the input data  with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision;
the feedback generated by the first set of arithmetic units.
In same field of endeavor, Jia teaches:
to utilize the first set of arithmetic units to generate feedback (page 3; Fig. 1; 3 SYSTEM OVERVIEW: training module; forward/backward computation with mixed-precision and model update with LARS; 
    PNG
    media_image1.png
    315
    503
    media_image1.png
    Greyscale
 page 3; Fig. 2; 4.1 Mixed-Precision Training with LARS: the weights and gradients are cast to single-precision (FP32) format before applying LARS and cast back to FP16 afterward; feedback from PF32 to PF16 as illustrated in Fig. 2; page 6; 5.3 Convergence Analysis: a master copy of weights was updated in FP32 to avoid loss of accuracy, and all tensors in the forward and backward passes were in FP16);
the feedback generated by the first set of arithmetic units (page 3; 3 System Overview: training module; forward/backward computation with mixed-precision and model update with LARS; page 3; Fig. 2; 4.1 Mixed-Precision Training with LARS: the weights and gradients are cast to single-precision (FP32) format before applying LARS and cast back to FP16 afterward; feedback from PF32 to PF 16 as illustrated in Fig. 2
    PNG
    media_image4.png
    182
    483
    media_image4.png
    Greyscale
 ).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ould-ahmed-vall to include to utilize the first set of arithmetic units to generate feedback; the feedback generated by the first set of arithmetic units as taught by Jia. The motivation for doing so would have been to improve the single GPU performance S and the system scaling efficiency; to incorporate optimizations such as forward/backward computation with mixed-precision and model update with LARS; to improve top-1 accuracy from 71.9% to 76.2%, which meets the baseline test accuracy; to speedup single-node training performance as taught by Jia in page 1, page 3, 3. System Overview, page 7 and page 8.
Ould-ahmed-vall and in view of Jia fails to explicitly disclose:
the inferences are  determining a difference between (i) a first set of predictions obtained by performing inference performed on the first portion of the input data  with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision.
In same field of endeavor, Chai teaches:
the inferences are  determining a difference between (i) a first set of predictions obtained by performing inference performed on the first portion of the input data  with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision ([0033]]: concurrently determine values of high-precision weights 114, low-precision weights 116, and bit precision values 118; [0072]: calculate and determine a difference between the output generated by DNN 106 when machine learning system 104 runs DNN 106 on the same input using high-precision weights 114 (W) and using low-precision weights 116; [0077]: determine a set of quantized values for the layer;  determine a maximum value in the set of quantized values for the layer and a minimum value in the set of quantized values for the layer).
It would have been obvious to one of ordinary skill in the arts before the effective filing date of the claimed invention to modify Ould-ahmed-vall and  Jia to include the inferences are  determining a difference between (i) a first set of predictions obtained by performing inference performed on the first portion of the input data  with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision as taught by Chai. The motivation for doing so would have been to calculate and determine a difference between the output generated by DNN 106 when machine learning system 104 runs DNN 106 on the same input using high-precision weights 114 (W) and using low-precision weights 116; determine a maximum value in the set of quantized values for the layer and a minimum value in the set of quantized values for the layer; to improve DNN learning capability; to enable powerful DNNs for resource-constrained environments as taught by Chai in paragraphs [0072], [0077], and [0152].

Regarding to claim 2 (Original), Ould-ahmed-vall and in view of Jia and Chai discloses the apparatus of claim 1, wherein at least one of the first set of arithmetic units and the second set of arithmetic units comprise respective floating-point units (Ould-ahmed-vall; [0059]: some instances of the parallel processing unit 202 include higher precision floating point units relative to other instances; [0184]: a compute unit is configured to perform an FP32 operation; a compute unit is configured to perform a dual FP16 operation).

Regarding to claim 3 (Original), Ould-ahmed-vall and in view of Jia and Chai discloses the apparatus of claim 1, wherein at least one of the first set of arithmetic units and the second set of arithmetic units comprise respective algorithmic logic units (Ould-ahmed-vall; [0051]: the scheduler 210 allocates work to the clusters 214A-214N of the processing cluster array 212 using various scheduling and work distribution algorithms; Fig. 6; [0139]: a machine learning framework; create and optimize the main computational logic associated with the machine learning algorithm; perform the necessary computations using the primitives provided by the machine learning framework; [0166]: the training framework 1104 adjusts the weights that control the untrained neural network 1106).

Regarding to claim 4 (Original), Ould-ahmed-vall and in view of Jia and Chai discloses the apparatus of claim 1, wherein the first precision comprises 32-bit or higher floating-point computations, and the second precision comprises 16-bit or lower floating-point computations (Ould-ahmed-vall; [0141]: a subset of the floating point units in each of the compute clusters 706A-706H is configured to 32-bit floating point operations; second subset of the floating point units in each of the compute clusters 706A-706H is configured to perform 16-bit; [00183]: the GPGPU 1306 support instructions to perform low precision computations such as 8-bit and 4-bit integer vector operations; [0184]: a compute unit is configured to perform an FP32 operation; another compute unit is configured to perform a dual FP16 operation; a compute unit configured to perform 32-bit integer operations; perform four simultaneous 8-bit integer operations).

Regarding to claim 5 (Currently amended), Ould-ahmed-vall and in view of Jia and Chai discloses the apparatus of claim 1, wherein the processor further comprises a third set of arithmetic units having the first precision for floating-point computations (Ould-ahmed-vall; Fig. 13; [0180]: the SOC 1300 includes a media processor 1302, a vision processor 1304, a GPGPU 1306 and a multi-core processor 1308; processors are coupled to on chip memory as illustrated in Fig. 13), the processor being further configured to utilize the third set of arithmetic units to obtain the second set of predictions (Ould-ahmed-vall;  Fig. 11; [0166]: the trained neural network 1108 is deployed to implement any number of machine learning operations to generate an inference result 1114 based on input of new data 1112; [0168]: incremental learning enables the trained neural network 1108 to adapt to the new data 1112).

Regarding to claim 12 (Original), Ould-ahmed-vall and in view of Jia and Chai discloses the apparatus of claim 1, further comprising scratchpad memory coupling the memory, the first set of arithmetic units and the second set of arithmetic units (Ould-ahmed-vall; [0053]: on-chip memory; [0057]: various types of memory devices; Fig. 13; [0180]: the SOC 1300 includes on-chip memory 1305 that can enable a shared on-chip data pool that is accessible by each of the processing components).

Regarding to claim 13 (Currently amended), Ould-ahmed-vall discloses a method (Fig. 13; [0180]: inferencing system on a chip (SOC) 1300 is suitable for performing inferencing using a trained model) comprising: 
obtaining, in a memory of a processing device comprising a first set of arithmetic units having a first precision for floating-point computations and a second set of arithmetic units having a second precision for floating-point computations, a machine learning model trained in the first precision, the second precision being lower than the first precision ([0059]: some instances of the parallel processing unit 202 include higher precision floating point units relative to other instances; [0141]: a subset of the floating point units in each of the compute clusters 706A-706H is configured to 32-bit floating point operations; second subset of the floating point units in each of the compute clusters 706A-706H is configured to perform 16-bit; Fig. 11; [0164]: the neural network is trained using a training dataset 1102; generate and obtain a trained neural net; 
    PNG
    media_image3.png
    511
    663
    media_image3.png
    Greyscale
 [0167]: trained neural network 1108 is capable of performing operations useful in reducing the dimensionality of data; [0198]: generate a trained neural network; Fig. 13; [0180]: the SOC 1300 includes a media processor 1302, a vision processor 1304, a GPGPU 1306 and a multi-core processor 1308; multiple processors are coupled to on chip memory as illustrated in Fig. 13; [00183]: the GPGPU 1306 support instructions to perform low precision computations such as 8-bit and 4-bit integer vector operations; [0184]: a compute unit is configured to perform an FP32 operation; another compute unit is configured to perform a dual FP16 operation; a compute unit configured to perform 32-bit integer operations; perform four simultaneous 8-bit integer operations; [0185]); 
performing inference on at least a first portion of input data utilizing the second set of arithmetic units (Fig. 11; [0166]: the trained neural network 1108 is deployed to implement any number of machine learning operations to generate an inference result 1114 based on input of new data 1112; [0168]: incremental learning enables the trained neural network 1108 to adapt to the new data 1112); 
generating feedback for updating one or more parameters of the second set of arithmetic units based at least in part on the inferences performed on the portion of the set of input data by the set of arithmetic units (Fig. 10; [0162]: to implement a recurrent function and a feedback mechanism 1005 to enable a ‘memory’ of previous states, and an output layer 1006 to output a result; Fig. 11; [0166]: errors are then propagated back through the system); 
tuning one or more parameters of the second set of arithmetic units based at least in part on the feedback (Fig. 11; [0166]: the training process occurs repeatedly as the weights of the network are adjusted to refine the output generated by the neural network; errors are then propagated back through the system; the training process continues until the neural network reaches a statistically desired accuracy associated with a trained neural net 1108); and 
generating inference results for at least a second portion of the input data utilizing the second set of arithmetic units with the tuned parameters (Fig. 11; [0166]: the trained neural network 1108 is deployed to implement any number of machine learning operations to generate an inference result 1114 based on input of new data 1112; [0168]: incremental learning enables the trained neural network 1108 to adapt to the new data 1112 without forgetting the knowledge instilled within the network during initial training; [0196]: perform the inferencing operations).
Ould-ahmed-vall fails to explicitly disclose:
utilizing the first set of arithmetic units to generate feedback;
the inferences are  determining a difference between (i) a first set of predictions obtained by performing inference performed on the first portion of the input data  with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision;
the feedback generated by the first set of arithmetic units.
In same field of endeavor, Jia teaches:
utilizing the first set of arithmetic units to generate feedback (page 3; Fig. 1; 3 SYSTEM OVERVIEW: training module; forward/backward computation with mixed-precision and model update with LARS; 
    PNG
    media_image1.png
    315
    503
    media_image1.png
    Greyscale
 page 3; Fig. 2; 4.1 Mixed-Precision Training with LARS: the weights and gradients are cast to single-precision (FP32) format before applying LARS and cast back to FP16 afterward; feedback from PF32 to PF16 as illustrated in Fig. 2; page 6; 5.3 Convergence Analysis: a master copy of weights was updated in FP32 to avoid loss of accuracy, and all tensors in the forward and backward passes were in FP16);
the feedback generated by the first set of arithmetic units (page 3; 3 System Overview: training module; forward/backward computation with mixed-precision and model update with LARS; page 3; Fig. 2; 4.1 Mixed-Precision Training with LARS: the weights and gradients are cast to single-precision (FP32) format before applying LARS and cast back to FP16 afterward; feedback from PF32 to PF 16 as illustrated in Fig. 2
    PNG
    media_image4.png
    182
    483
    media_image4.png
    Greyscale
 ).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ould-ahmed-vall to include utilizing the first set of arithmetic units to generate feedback; the feedback generated by the first set of arithmetic units as taught by Jia. The motivation for doing so would have been to improve the single GPU performance S and the system scaling efficiency; to incorporate optimizations such as forward/backward computation with mixed-precision and model update with LARS; to improve top-1 accuracy from 71.9% to 76.2%, which meets the baseline test accuracy; to speedup single-node training performance as taught by Jia in page 1, page 3, 3. System Overview, page 7 and page 8.
Ould-ahmed-vall and in view of Jia fails to explicitly disclose:
the inferences are  determining a difference between (i) a first set of predictions obtained by performing inference performed on the first portion of the input data  with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision.
In same field of endeavor, Chai teaches:
the inferences are  determining a difference between (i) a first set of predictions obtained by performing inference performed on the first portion of the input data  with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision ([0033]]: concurrently determine values of high-precision weights 114, low-precision weights 116, and bit precision values 118;  [0072]: calculate and determine a difference between the output generated by DNN 106 when machine learning system 104 runs DNN 106 on the same input using high-precision weights 114 (W) and using low-precision weights 116; [0077]: determine a set of quantized values for the layer;  determine a maximum value in the set of quantized values for the layer and a minimum value in the set of quantized values for the layer).
It would have been obvious to one of ordinary skill in the arts before the effective filing date of the claimed invention to modify Ould-ahmed-vall and  Jia to include the inferences are  determining a difference between (i) a first set of predictions obtained by performing inference performed on the first portion of the input data  with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision as taught by Chai. The motivation for doing so would have been to calculate and determine a difference between the output generated by DNN 106 when machine learning system 104 runs DNN 106 on the same input using high-precision weights 114 (W) and using low-precision weights 116; determine a maximum value in the set of quantized values for the layer and a minimum value in the set of quantized values for the layer; to improve DNN learning capability; to enable powerful DNNs for resource-constrained environments as taught by Chai in paragraphs [0072], [0077], and [0152].

Regarding to claim 14 (Original), the claim limitations are similar to claim limitations recited in claim 4. Therefore, same rational used to reject claim 4 is also used to reject claim 14. 

Regarding to claim 17 (Currently amended), Ould-ahmed-vall discloses a system (Fig. 13; [0180]: inferencing system on a chip (SOC) 1300 is suitable for performing inferencing using a trained model) comprising: 
a first set of arithmetic units having a first precision for floating-point computations ([0059]: some instances of the parallel processing unit 202 include higher precision floating point units relative to other instances; [0141]: a subset of the floating-point units in each of the compute clusters 706A-706H is configured to 32-bit floating point operations); 
a second set of arithmetic units coupled to the first set of arithmetic units, the second set of arithmetic units having a second precision for floating-point computations, the second precision being lower than the first precision ([0059]: some instances of the parallel processing unit 202 include higher precision floating point units relative to other instances; [0141]: a subset of the floating point units in each of the compute clusters 706A-706H is configured to 32-bit floating point operations; second subset of the floating point units in each of the compute clusters 706A-706H is configured to perform 16-bit; [00183]: the GPGPU 1306 support instructions to perform low precision computations such as 8-bit and 4-bit integer vector operations; [0184]: a compute unit is configured to perform an FP32 operation; another compute unit is configured to perform a dual FP16 operation; a compute unit configured to perform 32-bit integer operations; perform four simultaneous 8-bit integer operations; [0185]); and 
a memory coupled to the second set of arithmetic units (Fig. 13; [0180]: the SOC 1300 includes on-chip memory 1305 that enables a shared on-chip data pool that is accessible by each of the processing components); 
the second set of arithmetic units being configured to perform inference on input data stored in the memory utilizing a copy of a machine learning model obtained from the memory (Fig. 11; [0166]: the trained neural network 1108 is deployed to implement any number of machine learning operations to generate an inference result 1114 based on input of new data 1112; [0168]: incremental learning enables the trained neural network 1108 to adapt to the new data 1112), the machine learning model being trained for the first precision for floating-point computations (Fig. 11; [0164]: the neural network is trained using a training dataset 1102; generate and obtain a trained neural net; 
    PNG
    media_image3.png
    511
    663
    media_image3.png
    Greyscale
 [0167]: trained neural network 1108 is capable of performing operations useful in reducing the dimensionality of data; [0198]: generate a trained neural network); 
the first set of arithmetic units being configured to tune one or more parameters of the second set of arithmetic units based at least in part on inference results generated by the second set of arithmetic units for at least a portion of the input data (Fig. 11; [0166]: the training process occurs repeatedly as the weights of the network are adjusted to refine the output generated by the neural network; errors are then propagated back through the system; the training process continues until the neural network reaches a statistically desired accuracy associated with a trained neural net 1108; the trained neural network 1108 is deployed to implement any number of machine learning operations to generate an inference result 1114 based on input of new data 1112; [0168]: incremental learning enables the trained neural network 1108 to adapt to the new data 1112 without forgetting the knowledge instilled within the network during initial training; [0196]: perform the inferencing operations).
Ould-ahmed-vall fails to explicitly disclose:
to utilize the first set of arithmetic units to generate feedback;
the inference results are  determining a difference between (i) a first set of predictions obtained by performing inference performed on the first portion of the input data  with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision;
the feedback generated by the first set of arithmetic units.
In same field of endeavor, Jia teaches:
to utilize the first set of arithmetic units to generate feedback (page 3; Fig. 1; 3 SYSTEM OVERVIEW: training module; forward/backward computation with mixed-precision and model update with LARS; 
    PNG
    media_image1.png
    315
    503
    media_image1.png
    Greyscale
 page 3; Fig. 2; 4.1 Mixed-Precision Training with LARS: the weights and gradients are cast to single-precision (FP32) format before applying LARS and cast back to FP16 afterward; feedback from PF32 to PF16 as illustrated in Fig. 2; page 6; 5.3 Convergence Analysis: a master copy of weights was updated in FP32 to avoid loss of accuracy, and all tensors in the forward and backward passes were in FP16);
the feedback generated by the first set of arithmetic units (page 3; 3 System Overview: training module; forward/backward computation with mixed-precision and model update with LARS; page 3; Fig. 2; 4.1 Mixed-Precision Training with LARS: the weights and gradients are cast to single-precision (FP32) format before applying LARS and cast back to FP16 afterward; feedback from PF32 to PF 16 as illustrated in Fig. 2
    PNG
    media_image4.png
    182
    483
    media_image4.png
    Greyscale
 ).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ould-ahmed-vall to include to utilize the first set of arithmetic units to generate feedback; the feedback generated by the first set of arithmetic units as taught by Jia. The motivation for doing so would have been to improve the single GPU performance S and the system scaling efficiency; to incorporate optimizations such as forward/backward computation with mixed-precision and model update with LARS; to improve top-1 accuracy from 71.9% to 76.2%, which meets the baseline test accuracy; to speedup single-node training performance as taught by Jia in page 1, page 3, 3. System Overview, page 7 and page 8;
Ould-ahmed-vall and in view of Jia fails to explicitly disclose:
the inference results are  determining a difference between (i) a first set of predictions obtained by performing inference performed on the first portion of the input data  with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision.
In same field of endeavor, Chai teaches:
the inference results are  determining a difference between (i) a first set of predictions obtained by performing inference performed on the first portion of the input data  with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision ([0033]]: concurrently determine values of high-precision weights 114, low-precision weights 116, and bit precision values 118; [0072]: calculate and determine a difference between the output generated by DNN 106 when machine learning system 104 runs DNN 106 on the same input using high-precision weights 114 (W) and using low-precision weights 116; [0077]: determine a set of quantized values for the layer;  determine a maximum value in the set of quantized values for the layer and a minimum value in the set of quantized values for the layer).
It would have been obvious to one of ordinary skill in the arts before the effective filing date of the claimed invention to modify Ould-ahmed-vall and  Jia to include the inference results are  determining a difference between (i) a first set of predictions obtained by performing inference performed on the first portion of the input data  with the second set of arithmetic units and (ii) a second set of predictions obtained by performing inference on the first portion of the input data with one or more arithmetic units having the first precision as taught by Chai. The motivation for doing so would have been to calculate and determine a difference between the output generated by DNN 106 when machine learning system 104 runs DNN 106 on the same input using high-precision weights 114 (W) and using low-precision weights 116; determine a maximum value in the set of quantized values for the layer and a minimum value in the set of quantized values for the layer; to improve DNN learning capability; to enable powerful DNNs for resource-constrained environments as taught by Chai in paragraphs [0072], [0077], and [0152].

Regarding to claim 18 (Original), the claim limitations are similar to claim limitations recited in claim 4. Therefore, same rational used to reject claim 4 is also used to reject claim 18. 

Allowable Subject Matter
Claims 6-11, 15-16, and 19-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
                                                                                                                                                         
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Hai Tao Sun whose telephone number is (571)272-5630. The examiner can normally be reached 9:00AM-6:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached on 5712727794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/HAI TAO SUN/Primary Examiner, Art Unit 2616