DETAILED ACTION
1. 	This action is in response to claims filed 11/04/2020 for application 16744037
filed 01/15/2020. Currently claims 1-20 are pending. Claims 1, 3, 8, 11, 12, and 16 have been amended and claims 18-20 are new. 
2. 	In response to the amendments and arguments, the rejections under 35 USC §102, and 112(d) have been withdrawn. Furthermore the objections to the Drawings and claims 11 and 12 are withdrawn. 
3. 	The Double Patenting rejection will be maintained and reassessed in the case of allowance.
4. 	Amendments to the specification and drawings are acknowledged. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:

2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 7-12, 16, 18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Malaya et al, (2019/0171420 Al, "Malaya") in view of Sharma et al. "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network." 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018.
 Regarding claim 1, Malaya teaches a hardware accelerator for training quantized data, comprising: memory including multilevel memory hierarchy (see Para. 0002, “In an FPGA, the logic blocks can include elements such as lookup tables (LUTs)…that are programmed by inserting values into small Static Random Access Memories (SRAMs) or registers[i.e., internal memory].” & see Para. 0014, fig.1 elements 110, 121,122, 123, “[T]he host device 110 can be located in the same physical chassis as the FPGA devices 121-123 or on the same carrier board. The host device 110 can include a processor and memory [i.e., external memory].”); and a plurality of heterogenous precision compute units coupled to the memory (see Para. 0002 and 00014 and fig. 1),the plurality of heterogenous precision compute units to perform computations of mixed precision data types (see Para. 0025, fig. 2 elements 210, 211, 220, 221, “In one embodiment, each of the computational units 210 and 220 includes mixed precision logic 211 and 221, respectively, for performing mixed precision calculations, in which operations are performed with values represented according to different number representations[i.e., precisions].” Note: The underlined portion is interpreted as each computational unit can include 
Malaya does not teach: having an input buffer to store input with a first precision type and a private buffer for each compute unit to store data with the first precision type and a second precision type; including a first systolic array having a first type of heterogeneous precision compute units; wherein at least one heterogenous compute unit of the first systolic array is configured to perform an operation on input with the first precision type and data including a weight with the second precision type.
However Sharma teaches having an input buffer to store input with a first precision type and a private buffer for each compute unit to store data with the first precision type and a second precision type (Sharma, pg. 766, fig. 3, fig.4, see fig. 4 in which each input buffer IBUF stores four 8-bit elements and each weight buffer WBUF stores four 2-bit elements that are read into the fusion Unit to do a matrix-vector multiplication and see fig. 3 in which each fusion unit has its own register. Note: It is being interpreted that each register attached to each fusion unit represent the private buffer); including a first systolic array having a first type of heterogeneous precision compute units; wherein at least one heterogenous compute unit of the first systolic array is configured to perform an operation on input with the first precision type and data including a weight with the second precision type (Sharma, pg. 767, sec. C. Bit Fusion Execution 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Malaya’s hardware accelerator in view of Sharma to teach having an input buffer to store input with a first precision type and a private buffer for each compute unit to store data with the first precision type and a second precision type; including a first systolic array having a first type of heterogeneous precision compute units; wherein at least one heterogenous compute unit of the first systolic array is configured to perform an operation on input with the first precision type and data including a weight with the second precision type. The motivation to do so would be have different bit-widths for different DNN layers(Sharma, pg. 764, sec. I Introduction, “To that end, we leverage the following three algorithmic properties of DNNs to introduce a novel acceleration architecture, called Bit Fusion… DNNs are mostly a collection of massively parallel multiply-adds…[and] [t]he bitwidth of these operations can be reduced with no loss in accuracy…However, to preserve accuracy, the bitwidth varies significantly across DNNs and may even be adjusted for each layer individually. Thus, a fixed-bitwidth accelerator design would either yield limited benefits to accommodate the worst-case bitwidth requirements, or inevitably lead to a degradation in final accuracy. To alleviate these deficiencies, Bit Fusion introduces the concept of runtime bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing a bit-flexible accelerator, which comprises an array of processing 
Regarding claim 7, Malaya in view of Sharma teaches the hardware accelerator of claim 1 wherein: the plurality of heterogenous precision compute units dynamically match varying precision requirements of NN training(see Malaya Para. 0050, fig. 2, element 230, fig.4, element 400, fig. 6 elements 609, 617,  “At block 617 the adjustment logic 230 reconfigures each of the computational units in the neural network 400 to use the next number representation as determined for the computational unit at block 609. In one embodiment, this reconfiguration is performed dynamically.”).
Regarding claim 8, Malaya teaches a data processing system comprising: 
a hardware processor(see Para. 0014, fig.1 elements 110, 121, 122, 123, “Alternatively, the host device 110 can be located in the same physical chassis as the FPGA devices 121-123 or on the same carrier board. The host device 110 can include a processor and memory storing instructions that can be executed by the processor.”); and a hardware accelerator that includes a plurality of heterogenous symmetric precision compute units (see Para. 0019, fig.2, elements 210, 220, 202(i), 230, “In one embodiment, the adjustment logic 230 determines which number representations are used in the computational units 210 and 220…[and] select[s] the same number representations[i.e., symmetric precision]…for both of the computational units 210 and 220.”) and asymmetric precision compute units to perform computations of mixed precision data types (see Para. 0019, fig.2, elements 210, 220, 202(i), 230,“In one embodiment, the adjustment logic 230 determines which number representations are used in the computational units 210 and 220…[and] may select different number representations[i.e., asymmetric precision]… for each of the computational units 210 and 220.”) for a backward propagation phase of training quantized data of a neural network(see Para. 0036. Fig. 4 elements 401-420 400, 426(i), 427(i),  “During the training phase, input weights for the neurons 401-420 are adjusted based on applying training data to the inputs of the neural network 400 and comparing the resulting outputs 426(i) and 427(i) with expected or desired outputs.”), wherein the hardware accelerator comprises a multilevel memory hierarchy (see Para. 0002, “In an FPGA, the logic blocks can include elements such as lookup tables (LUTs)…that are programmed by inserting values into small Static Random Access Memories (SRAMs) or registers[i.e., internal memory].” & see Para. 0014, fig.1 elements 110, 121,122, 123, “[T]he host device 110 can be located in the same physical chassis as the FPGA devices 121-123 or on the same carrier board. The host device 110 can include a processor and memory [i.e., external memory].”).
	Malaya does not teach: having an input buffer to store input with a first precision type and a private buffer for each compute unit to store data with the first precision type and a second precision type, wherein at least one asymmetric precision compute unit is configured to support a first type of precision for a first operand and a second type of precision for a second operand.
However, Sharma teaches: having an input buffer to store input with a first precision type and a private buffer for each compute unit to store data with the first precision type and a second precision type(Sharma, pg. 766, fig. 3, fig.4, see fig. 4 in which each input buffer IBUF stores four 8-bit elements and each weight buffer WBUF stores four 2-bit elements that are read into the fusion Unit to do a matrix-vector multiplication and see fig. 3 in which each fusion unit has its own register. Note: It is being interpreted that each register attached to each fusion unit represent the private buffer), wherein at least one asymmetric precision compute unit is configured to support a first type of precision for a first operand and a second type of precision for a second operand (Sharma, pg. 767, sec. C. Bit Fusion Execution Model, figure.  4, “Figure 4 illustrates the Bit Fusion systolic execution in the mixed bit-width mode using when an input vector is multiplied to a weight matrix. The input vector has 4 × N 8-bit elements that are being multiplied to a matrix with 4×N×M 2-bit elements[of data including a weight]. As such, the 16-BitBricks in a Fusion Unit logically compose to form four 8×2 Fused-PEs.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Malaya’s hardware accelerator in view of Sharma to teach having an input buffer to store input with a first precision type and a private buffer for each compute unit to store data with the first precision type and a second precision type, wherein at least one asymmetric precision compute unit is configured to support a first type of precision for a first operand and a second type of precision for a second operand. The motivation to do so would be have different bit-widths for different DNN layers(Sharma, pg. 764, sec. I Introduction, “To that end, we leverage the following three algorithmic properties of DNNs to introduce a novel acceleration architecture, called Bit Fusion… DNNs are mostly a collection of massively parallel multiply-adds…[and] [t]he bitwidth of these operations can be reduced with no loss in accuracy…However, to preserve accuracy, the bitwidth varies significantly across DNNs and may even be adjusted for each layer individually. Thus, a fixed-bitwidth accelerator design would either yield limited benefits to accommodate the worst-case bitwidth requirements, or inevitably lead to a degradation in final accuracy. To alleviate these deficiencies, Bit Fusion introduces the concept of runtime bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing a bit-
	Regarding claim 9, Malaya in view of Sharma teaches the data processing system of claim 8 wherein: the hardware accelerator is implemented on a Field Programmable Gate Array (FPGA) (see Malaya Para 0017, fig. 1 elements 121, 122, 123, fig.2, elements 210, 220 “FIG. 2 illustrates a functional block diagram for a set of computational units 210 and 220 implemented in one or more of the FPGAs 121-123…”).
	Regarding claim 10, Malaya in view of Sharma teaches the data processing system of claim 8 wherein: the heterogenous symmetric precision compute units include a first compute unit to support a first type of precision for first and second operands (see Malaya Para. 0034, fig. 4 elements 400, 401-402, 421, “In the neural network 400, each neuron in the same layer[i.e., computational units] uses the same number representation[i.e., symmetric precision]…In particular, the neurons 401-402 of the input layer 421 perform calculations using an 8 bit floating point (FP8) representation.”) and a second compute unit to support a second type of precision for third and fourth operands (see Malaya Para. 0034, fig. 4 elements 400, 403-408, “In the neural network 400, each neuron in the same layer [i.e., computational units] uses the same number representation [i.e., symmetric precision]… Neurons 403-408 utilize a 16 bit floating point (FP16) representation.”).
	Regarding claim 11, Malaya in view of Sharma teaches the data processing system of claim 9 wherein:the heterogenous asymmetric precision compute units include a first compute unit to support a first type of precision for a first operand and a second type of precision for a second operand for a computation (see Malaya Para. 0017, fig. 2 elements 210, 220, 203(1), 203(2), “In one embodiment, one or both of the computational units 210 and  calculate a first output value 203(1) using a first number representation, then calculate a second output value 203(2) using a second number representation.”). 
	Regarding claim 12, Malaya in view of Sharma teaches the data processing system of claim 8 wherein: the heterogenous asymmetric precision compute units include a second compute unit to support a third type of precision for a first operand and a fourth type of precision for a second operand (see Malaya Para. 0017, fig. 2 elements 210, 220, 203(1), 203(2), “In one embodiment, one or both of the computational units 210 and 220 calculate a first output value 203(1) using a first number representation[i.e., one type of precision], then calculate a second output value 203(2) using a second number representation[i.e., another type of precision].”).
Regarding claim 16, Malaya teaches a hardware accelerator for training quantized data, comprising: memory including multilevel hierarchy (see Para. 0002, “In an FPGA, the logic blocks can include elements such as lookup tables (LUTs)…that are programmed by inserting values into small Static Random Access Memories (SRAMs) or registers[i.e., internal memory].” & Para. 0014, fig.1, “[T]he host device 110 can be located in the same physical chassis as the FPGA devices 121-123 or on the same carrier board. The host device 110 can include a processor and memory [i.e., external memory].”); and a first plurality of heterogenous precision compute units coupled to the memory(see Para. 0017, fig. 2 elements 210, 220, 121-123  “FIG. 2 illustrates a functional block diagram for a set of computational units 210 and 220 implemented in one or more of the FPGAs 121-123, according to an embodiment…Each of the computational units[that contain internal memory] 210 and 220 are circuits implemented by configuring one or more configurable logic blocks (CLBs )of one or more of the FPGA devices 121-123[ that contain external , the first plurality of heterogenous precision compute units to perform computations of mixed precision data types(see Para. 0025, fig. 2 elements 210, 220, 211, 221, “In one embodiment, each of the computational units 210 and 220 includes mixed precision logic 211 and 221, respectively, for performing mixed precision calculations, in which operations are performed with values represented according to different number representations[i.e., precisions].”) for a backward propagation phase of training quantized data of the neural network (NN) (see Para. 0036. Fig. 4 elements 401-420, 426(i), 427(i), “During the training phase, input weights for the neurons 401-420 are adjusted based on applying training data to the inputs of the neural network 400 and comparing the resulting outputs 426(i) and 427(i) with expected or desired outputs.”); and software-programmable precision, wherein the precision for the first plurality of heterogenous precision compute units to vary dynamically and be programmed through software,(see Para. 0018, fig. 2 elements 210, 220, 230, “The different number representations (corresponding to different levels of precision) selected by the adjustment logic 230 to be used in the computational units 210 and 220 can be specified by a user [through programmable software] during creation of the architecture, determined at runtime based on repeated iterations of executing a particular task ( e.g., a machine learning 'inference' task)…").
Malaya does not teach: having an input buffer to store input with a first precision type and a private buffer for each compute unit to store data with the first precision type and a second precision type; wherein at least one heterogenous compute unit of the first plurality of heterogenous precision compute units is configured to perform an operation on input with the first precision type and data including a weight with the second precision type. 
having an input buffer to store input with a first precision type and a private buffer for each compute unit to store data with the first precision type and a second precision type(Sharma, pg. 766, fig. 3, fig.4, see fig. 4 in which each input buffer IBUF stores four 8-bit elements and each weight buffer WBUF stores four 2-bit elements that are read into the fusion Unit to do a matrix-vector multiplication and see fig. 3 in which each fusion unit has its own register. Note: It is being interpreted that each register attached to each fusion unit represent the private buffer); wherein at least one heterogenous compute unit of the first plurality of heterogenous precision compute units is configured to perform an operation on input with the first precision type and data including a weight with the second precision type(Sharma, pg. 767, sec. C. Bit Fusion Execution Model, figure.  4, “Figure 4 illustrates the Bit Fusion systolic execution in the mixed bit-width mode using when an input vector is multiplied to a weight matrix. The input vector has 4 × N 8-bit elements that are being multiplied to a matrix with 4×N×M 2-bit elements[of data including a weight]. As such, the 16-BitBricks in a Fusion Unit logically compose to form four 8×2 Fused-PEs.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Malaya’s hardware accelerator in view of Sharma to teach having an input buffer to store input with a first precision type and a private buffer for each compute unit to store data with the first precision type and a second precision type; including a first systolic array having a first type of heterogeneous precision compute units; wherein at least one heterogenous compute unit of the first systolic array is configured to perform an operation on input with the first precision type and data including a weight with the second precision type. The motivation to do so would be have different bit-widths for different 
Regarding claim 18, Malaya in view of Sharma teaches the hardware accelerator of claim 16, wherein the first precision type comprises an 8-bit fixed-point value read from the input buffer and the second precision type comprises a 4 bit value for the weight (Sharma, pg. 766, fig. 3, fig.4, see fig. 4 in which each input buffer IBUF stores four 8-bit elements & see Sharma pgs. 768, B. Mapping Variable Bitwidth Operations to BitBricks, “When one of the operand’s bitwidths is larger, we use the formulation below:                         
                            
                                
                                    A
                                
                                
                                    2
                                    n
                                
                            
                            ×
                            
                                
                                    B
                                
                                
                                    n
                                
                            
                            =
                            
                                
                                    2
                                
                                
                                    n
                                
                            
                            ×
                            
                                
                                    
                                        
                                            
                                                
                                                    A
                                                
                                                
                                                    2
                                                    n
                                                
                                            
                                        
                                    
                                
                                
                                    h
                                    i
                                
                            
                            ×
                            
                                
                                    B
                                
                                
                                    n
                                
                            
                            +
                            
                                
                                    2
                                
                                
                                    0
                                
                            
                            ×
                            
                                
                                    
                                        
                                            
                                                
                                                    A
                                                
                                                
                                                    2
                                                    n
                                                
                                            
                                        
                                    
                                
                                
                                    l
                                    o
                                
                            
                            ×
                            
                                
                                    B
                                
                                
                                    n
                                
                            
                        
                     [where                         
                            
                                
                                    A
                                
                                
                                    2
                                    n
                                
                            
                        
                     represents a 2n-bit operand and                         
                            
                                
                                    B
                                
                                
                                    n
                                
                            
                        
                     represents a n-bit operand]… The next subsection details the design of a Fusion Unit that uses BitBricks to execute multiply-adds with variable bitwidths, up to 16-bit.” Note: It is being interpreted that                         
                            
                                
                                    B
                                
                                
                                    n
                                
                            
                        
                     in the formulation is a 4 bit value that represents the weight).
the at least one hetergenous compute unit of the first plurality of heterogeneous precision compute units(Sharma, pg. 767, sec. C. Bit Fusion Execution Model, figure.  4, “Figure 4 illustrates the Bit Fusion systolic execution in the mixed bit-width mode using when an input vector is multiplied to a weight matrix. The input vector has 4 × N 8-bit elements that are being multiplied to a matrix with 4×N×M 2-bit elements[of data including a weight]. As such, the 16-BitBricks in a Fusion Unit logically compose to form four 8×2 Fused-PEs.”) is configured with the software-programmable precision with a same precision for a first operand and a second operand or is configured with the first precision type for the first operand and the second precision type for the second operand(Sharma, pgs. 769, Table I, as table I details the setup opcode instruction has the operand specification of op0.bitwidth for one type of bit precision for the first operand and op1.bitwidth for another type of bit precision for the second operand & see Sharma pgs. 768, B. Mapping Variable Bitwidth Operations to BitBricks, “When one of the operand’s bitwidths is larger, we use the formulation below:                         
                            
                                
                                    A
                                
                                
                                    2
                                    n
                                
                            
                            ×
                            
                                
                                    B
                                
                                
                                    n
                                
                            
                            =
                            
                                
                                    2
                                
                                
                                    n
                                
                            
                            ×
                            
                                
                                    
                                        
                                            
                                                
                                                    A
                                                
                                                
                                                    2
                                                    n
                                                
                                            
                                        
                                    
                                
                                
                                    h
                                    i
                                
                            
                            ×
                            
                                
                                    B
                                
                                
                                    n
                                
                            
                            +
                            
                                
                                    2
                                
                                
                                    0
                                
                            
                            ×
                            
                                
                                    
                                        
                                            
                                                
                                                    A
                                                
                                                
                                                    2
                                                    n
                                                
                                            
                                        
                                    
                                
                                
                                    l
                                    o
                                
                            
                            ×
                            
                                
                                    B
                                
                                
                                    n
                                
                            
                        
                     [where                         
                            
                                
                                    A
                                
                                
                                    2
                                    n
                                
                            
                        
                     represents a 2n-bit operand and                         
                            
                                
                                    B
                                
                                
                                    n
                                
                            
                        
                     represents a n-bit operand].” ).

Claims 2-5, 13, 15, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Malaya et al, (2019/0171420 Al, "Malaya") in view of Sharma et al. "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network." 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018 and in view of Jacob (Yaakov) et al (US 2020/0097442 A1, “Jacob”). 
Regarding claim 2, Malaya in view of Sharma teaches the hardware accelerator of claim 1 further comprising: having a first type of heterogeneous precision compute units (Malaya Para. 0034, fig. 4 elements 400, 409-415, “In the neural network 400, each neuron in the same layer [i.e., computational units] uses the same number representation… neurons 409-415 utilize a 32 bit floating point (FP32) representation.”); having a second type of heterogeneous precision compute units (Malaya Para. 0034, fig. 4 400, 403-408, “In the neural network 400, each neuron in the same layer [i.e., computational units] uses the same number representation… Neurons 403-408 utilize a 16 bit floating point (FP16) representation.”); having a third type of heterogeneous precision compute units (see Malaya Para. 0034, fig. 4 400, 401-402, 421, “In the neural network 400, each neuron in the same layer[i.e., computational units] uses the same number representation…In particular, the neurons 401-402 of the input layer 421 perform calculations using an 8 bit floating point (FP8) representation.”).
Malaya in view of Sharma does not teach: a first systolic array; a second systolic array; and a third systolic array. 
However Jacob teaches a first systolic array (Jacob,  see fig. 7 in which Group0  is a 4 by 32 systolic array); a second systolic array (Jacob, see fig. 7 in which Group1 is a 4 by 32 systolic array); and a third systolic array (Jacob, see fig. 7 in which Group2 is a 4 by 32 systolic array).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute FPGA arrays and compute unit based neural network hardware accelerator as taught by Malaya in view of Sharma with the systolic 
Regarding claim 3, Malaya in view of Sharma and in view of Jacob teaches the hardware accelerator of claim 2, wherein: the first type of heterogeneous precision compute units support 32 bit floating point precision (Malaya Para. 0034, fig. 4 elements 400, 409-415, “In the neural network 400, each neuron in the same layer [i.e., computational units] uses the same number representation… neurons 409-415 utilize a 32 bit floating point (FP32) representation.”). 
Malaya in view of Sharma does not teach: the second type of heterogeneous precision compute units support 16 bit fixed point precision, and the third type of heterogeneous precision compute units support binary precision.
However Jacob teaches the second type of heterogeneous precision compute units support 16 bit fixed point precision, and the third type of heterogeneous precision compute units support binary precision (see Jacob Para. 0051, Fig., 2, Data elements…may refer to any type of data, in any required precision, including for example, 8-bit data, 16-bit data, 32-bit data, etc., in any applicable format, e.g., binary, signed, unsigned, floating point, etc.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Malaya’s hardware accelerator in view of Sharma and Jacob to teach that the second type of heterogeneous precision compute units support 16 bit fixed point precision, and that the third type of heterogeneous precision compute units support binary precision. The motivation to do so would be to store different data types in internal registers for different purposes (see Jacob para. 0051, “As used herein, cycle may refer to a sequence of operations performed on data elements that end with storing the results in  and storing the fetched data elements in data shadow registers…,or performing a single stage of a calculation and storing the results in partial sum (PSUM) resisters…”).
Regarding claim 4, Malaya in view of Sharma teaches the hardware accelerator of claim 1 wherein:the plurality of heterogenous precision compute units comprise mixed-precision bit width (see Malaya Para. 0025, fig. 2 elements 210, 220, 211, 221, “In one embodiment, each of the computational units 210 and 220 includes mixed precision logic 211 and 221, respectively, for performing mixed precision calculations, in which operations are performed with values represented according to different number representations[i.e., asymmetric precision]. Mixed precision calculations can include, for example, adding a 16 bit floating point number with a 32 bit floating point number.”) that support a flexible range of precision for floating point and fixed point inputs (see Malaya Para 0029, “In addition to the floating point representations as described above, other types of standard and non-standard number representations are also supported, such as integer and fixed point representations.”) including activations in a forward phase of training quantized data (see Malaya Para. 0036. Fig. 4 elements 421-423,  “[T]he training phase includes varying the precision of the different layers 421-423 by changing the number representations used by the layers 421-423 over multiple iterations of the training phase, and checking for a decrease in accuracy after each change.”) and gradients in the backward propagation phase (see Malaya Para. 0036. Fig. 4 elements 400, 401-420, 426(i), 427(i),  “During the training phase, input weights for the neurons 401-420 are adjusted based on applying training data to the inputs of the neural network 400 and comparing the resulting outputs 426(i) and 427(i) with expected or desired outputs.”).
	Malaya in view of Sharma does not teach: multiply-accumulate computation units.
multiply-accumulate computation units (see Jacob Para. 0058, fig. 4 elements 120, 480, (“After data elements are multiplied by the appropriate weights and the results of the multiplications are summed [and then]…stored in PSUM registers 480….”).
	Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute FPGA arrays and compute unit based neural network hardware accelerator as taught by Malaya in view of Sharma with the multiply-accumulate computation unit as taught by Jacob. The motivation to do so would be speed-up the calculations for a convolutional neural network, since convolutional operations require many multiplications and summations (see Jacob Para. 0060, “Calculating a single element in a single…[output map] may be performed by summing multiplications of each element in a filter matrix by corresponding elements in all…[input maps] in an…[input map] layer. For example, element {0,0} of the filter may be multiplied by element {0,0} of each…[input map]  in a layer of…[input maps] , element {0,1} of the filter may be multiplied by element{ 0,1} of each…[input map] in the layer of…[input maps] , and so on until element {n-1,n-1}, and the results of the multiplications may be summed to obtain a single element in the…[output map].”)
Regarding claim 5, Malaya in view of Sharma teachesthe hardware accelerator of claim 1 comprising:  having the plurality of heterogenous precision compute units(Malaya Para. 0030, fig. 4 element 400, “[M]ultiple computational units can be arranged in a neural network 400, as illustrated in FIG. 4”.) including a first type of precision compute units (Malaya Para. 0034, fig. 4 elements 400, 401-402, 421,  “In the neural network 400, each neuron in the same layer[i.e. computational units] uses the same number representation…In particular, the neurons 401-402 of the input layer 421 perform calculations using an 8 bit floating point (FP8) representation.”), a second type of precision compute units(see Malaya Para. 0034, fig. 4 , each neuron in the same layer [i.e., computational units] uses the same number representation… Neurons 403-408 utilize a 16 bit floating point (FP16) representation.”), and a third type of precision compute units(see Malaya Para. 0034, fig. 4 elements 400, 409-415, “In the neural network 400, each neuron in the same layer [i.e., computational units] uses the same number representation… neurons 409-415 utilize a 32 bit floating point (FP32) representation.”).
Malaya in view of Sharma does not teach: a fourth systolic array.
However Jacob teaches a fourth systolic array (Jacob, see fig. 7 element 750 which is a systolic array that contains 8 groups of systolic arrays within it).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute FPGA arrays and compute unit based
neural network hardware accelerator as taught by Malaya in view of Sharma with the systolic arrays as taught by Jacob. The motivation to do so would be to speed-up the computations performed by neural networks (Jacob, Para. 0003, “Systolic arrays may be used for accelerating deep-learning neural networks calculations….”).
Regarding claim 13, Malaya in view of Sharma teaches the data processing system of claim 8 wherein: the plurality of heterogenous precision compute units comprise mixed-precision bit width (see Malaya Para. 0025, fig. 2 elements 210, 220, 211, 221,  “In one embodiment, each of the computational units 210 and 220 includes mixed precision logic 211 and 221, respectively, for performing mixed precision calculations, in which operations are performed with values represented according to different number representations[i.e., asymmetric precision]. Mixed precision calculations can include, for example, adding a 16 bit floating point number with a 32 bit floating point number.”) that support a flexible range of precision for floating point and fixed point inputs (see Malaya Para 0029, “In addition to the floating point representations as described above, other types of standard and non-standard number representations are also supported, such as integer and fixed point representations.”) including activations in a forward phase of training quantized data (see Malaya Para. 0036. Fig. 4 elements 421-423,  “[T]he training phase includes varying the precision of the different layers 421-423 by changing the number representations used by the layers 421-423 over multiple iterations of the training phase, and checking for a decrease in accuracy after each change.”) and gradients in the backward propagation phase (see Malaya Para. 0036. Fig. 4 elements 401-420, 426(i), 427(i), 400, “During the training phase, input weights for the neurons 401-420 are adjusted based on applying training data to the inputs of the neural network 400 and comparing the resulting outputs 426(i) and 427(i) with expected or desired outputs.”).
Malaya in view of Sharma does not teach: a first systolic array and a second systolic array. 
However Jacob teaches a first systolic array (Jacob, see fig. 7 in which Group0  is a 4 by 32 systolic array) and a second systolic array (Jacob, see fig. 7 in which Group1 is a 4 by 32 systolic array).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute FPGA arrays and compute unit based neural network hardware accelerator as taught by Malaya in view of Sharma with the systolic arrays as taught by Jacob. The motivation to do so would be to speed-up the computations performed by neural networks (Jacob, Para. 0003, “Systolic arrays may be used for accelerating deep-learning neural networks calculations….”).  
having heterogenous asymmetric precision compute units that support a first type of precision for a first operand and a second type of precision for a second operand(see Malaya Para. 0017, fig. 2 elements 210, 220, 203(1), 203(2), “In one embodiment, one or both of the computational units 210 and 220 calculate a first output value 203(1) using a first number representation[i.e., one type of precision], then calculate a second output value 203(2) using a second number representation[i.e., another type of precision].”); having heterogenous asymmetric precision compute units that support a third type of precision for a first operand and a fourth type of precision for a second operand (see Malaya Para. 0017, fig. 2 elements 210, 220, 203(1), 203(2). “In one embodiment, one or both of the computational units 210 and 220 calculate a first output value 203(1) using a first number representation[i.e., one type of precision], then calculate a second output value 203(2) using a second number representation[i.e., another type of precision].”)
Malaya in view of Sharma does not teach: a first systolic array and a second systolic array. 
However Jacob teaches a first systolic array (Jacob, see fig. 7 in which Group0  is a 4 by 32 systolic array) and a second systolic array (Jacob, see fig. 7 in which Group1 is a 4 by 32 systolic array).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute FPGA arrays and compute unit based neural network hardware accelerator as taught by Malaya in view of Sharma with the systolic arrays as taught by Jacob. The motivation to do so would be to speed-up the computations 
Regarding claim 19, Malaya in view of Sharma teaches the hardware accelerator of claim 16, further comprising: having the first plurality of heterogeneous precision compute units(Malaya Para. 0017, fig. 2 elements 210, 220, 203(1), 203(2). “In one embodiment, one or both of the computational units 210 and 220 calculate a first output value 203(1) using a first number representation[i.e., one type of precision], then calculate a second output value 203(2) using a second number representation[i.e., another type of precision].”) having a second plurality of heterogeneous precision compute units (Malaya Para. 0017, fig. 2 elements 210, 220, 203(1), 203(2). “In one embodiment, one or both of the computational units 210 and 220 calculate a first output value 203(1) using a first number representation[i.e., one type of precision], then calculate a second output value 203(2) using a second number representation[i.e., another type of precision].”).  
Malaya in view of Sharma does not teach: a first systolic array and a second systolic array. 
However Jacob teaches a first systolic array (Jacob, see fig. 7 in which Group0  is a 4 by 32 systolic array) and a second systolic array (Jacob, see fig. 7 in which Group1 is a 4 by 32 systolic array).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute FPGA arrays and compute unit based neural network hardware accelerator as taught by Malaya in view of Sharma with the systolic arrays as taught by Jacob. The motivation to do so would be to speed-up the computations .  

Claims 6, 14,  and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Malaya et al, (2019/0171420 Al, "Malaya") in view of Sharma et al. "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network." 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018 and in view of Langhammer (US 8706790 B1, “Langhammer”).
Regarding claim 6, Malaya in view of Sharma teaches the hardware accelerator of claim 1 wherein:the plurality of heterogenous precision compute units perform computations on mixed precision datatypes (see Malaya Para. 0025, fig. 2 elements 210,220, 211, 21,  “In one embodiment, each of the computational units 210 and 220 includes mixed precision logic 211 and 221, respectively, for performing mixed precision calculations, in which operations are performed with values represented according to different number representations[i.e., asymmetric precision]. Mixed precision calculations can include, for example, adding a 16 bit floating point number with a 32 bit floating point number.”). 
Malaya in view of Sharma does not teach: without converting a low bit width precision datatype into a high bit width precision datatype to improve performance and energy-efficiency.
However Langhammer teaches without converting a low bit width precision datatype into a high bit width precision datatype(see Langhammer Col 5. lines 63-66, fig. 6 elements 60, 65, “In an alternative embodiment 60, a mixed fixed-and floating-point single-precision operation could be carried out without converting the fixed-point number to floating point 65 representation, as shown in FIG. 6.”) to improve performance and energy-efficiency (see 
 Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Malaya’s hardware accelerator in view of Sharma and Langhammer to teach: without converting a low bit width precision datatype into a high bit width precision datatype to improve performance and energy-efficiency. The motivation to do so would be to have the exponents of a mixed-type computation be handled by a different logic element to save on energy consumption (see Langhammer Col. 8 lines 61-63, “Exponents and other elements can be handled by the higher-precision logic as they do not consume significant resources.”).  
Regarding claim 14, Malaya in view of Sharma teaches the data processing system of claim 8, wherein: the plurality of heterogenous precision compute units perform computations on mixed precision datatypes (see Malaya Para. 0025, fig. 2 elements 210,220, 211, 21, “In one embodiment, each of the computational units 210 and 220 includes mixed precision logic 211 and 221, respectively, for performing mixed precision calculations, in which operations are performed with values represented according to different number representations[i.e., asymmetric precision]. Mixed precision calculations can include, for example, adding a 16 bit floating point number with a 32 bit floating point number.”). 
Malaya in view of Sharma does not teach: without converting a low bit width precision datatype into a high bit width precision datatype to improve performance and energy-efficiency.
 Langhammer teaches without converting a low bit width precision datatype into a high bit width precision datatype(see Langhammer Col 5. lines 63-66, fig. 6 elements 60, 65, “In an alternative embodiment 60, a mixed fixed-and floating-point single-precision operation could be carried out without converting the fixed-point number to floating point 65 representation, as shown in FIG. 6.”) to improve performance and energy-efficiency (see Langhammer Col. 8 lines 56-61, “The resources needed-particularly in a programmable device-when carrying out a mixed-precision multiplication based floating-point operation (i.e., multiplication or division)are reduced by maintaining the mantissas of the operands in their native precisions instead of promoting the lower precision number to the higher precision.”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Malaya’s hardware accelerator in view of Sharma and Langhammer to teach: without converting a low bit width precision datatype into a high bit width precision datatype to improve performance and energy-efficiency. The motivation to do so would be to have the exponents of a mixed-type computation be handled by a different logic element to save on energy consumption (see Langhammer Col. 8 lines 61-63, “Exponents and other elements can be handled by the higher-precision logic as they do not consume significant resources.”). 
Regarding claim 17, Malaya in view of Sharma teaches the hardware accelerator of claim 16, wherein: the plurality of heterogenous precision compute units perform computations on mixed precision datatypes (see Malaya Para. 0025, fig. 2 elements 210,220, 211, 21,  “In one embodiment, each of the computational units 210 and 220 includes mixed precision logic 211 and 221, respectively, for performing mixed precision calculations, in which operations are performed with values represented according to different number representations[i.e., 
Malaya in view of Sharma does not teach: without converting a low bit width precision datatype into a high bit width precision datatype.
 However Langhammer teaches without converting a low bit width precision datatype into a high bit width precision datatype (see Langhammer Col 5. lines 63-66, fig. 6 elements 60, 65, “In an alternative embodiment 60, a mixed fixed- and floating-point single-precision operation could be carried out without converting the fixed-point number to floating point 65 representation, as shown in FIG. 6.”). 
 Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Malaya’s hardware accelerator in view of Sharma and Langhammer to teach: without converting a low bit width precision datatype into a high bit width precision datatype. The motivation to do so would be to have the exponents of a mixed-type computation be handled by a different logic element to save on energy consumption (see Langhammer Col. 8 lines 61-63, “Exponents and other elements can be handled by the higher-precision logic as they do not consume significant resources.”). 


Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159.  See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1, 8, and 16 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 12, and 19 of copending Application No. 16/744039 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because claims 1, 8, and 16 anticipate the independent claims of 1, 12, and 19 of the copending application of 16/744,039 since the broad scope of claims 1, 8, and 16 encompasses claims 1, 12, and 16 of copending application 16/744,039. 

Application 16/744,037
Copending Application 16/744,039 
Claim 1: A hardware accelerator for training quantized data, comprising:
memory including multilevel memory hierarchy to store data; and
a plurality of heterogenous precision compute units coupled to the memory, the
plurality of heterogenous precision compute units to perform computations of mixed precision data types for training and inference in neural networks (NN).


Claim 1: A hardware accelerator for training quantized data, comprising: 
software controllable multilevel memory to store data; and a mixed precision array coupled to the memory, the mixed precision array includes an input buffer, detect logic to detect zero value operands, and a plurality of heterogenous precision compute units to perform computations of mixed precision data types for a backward propagation phase of training quantized data of a neural network.

Claim 8: A data processing system comprising:
a hardware processor; and
a hardware accelerator that includes a plurality of heterogenous symmetric precision compute units and asymmetric precision compute units to perform computations of mixed precision data types for a backward propagation phase of training quantized data of a neural network.
Claim 12: A data processing system comprising: a hardware processor; memory; and a hardware accelerator coupled to the memory, the hardware accelerator includes a mixed precision array having an input buffer, detect logic to detect zero value operands, and a plurality of heterogenous precision compute units to perform computations of mixed precision data types for a backward propagation phase of training quantized data of a neural network.
Claim 16: A hardware accelerator for training quantized data, comprising:
memory including multilevel hierarchy to store data; and
a plurality of heterogenous precision compute units coupled to the memory, the plurality of

 A computer implemented method for quantized neural network training comprising: storing data in a software controllable multilevel memory; 
receiving data for training with a mixed precision array; detecting zero value operands with detect logic of the mixed precision array; .


This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.
Response to Arguments
Applicant’s arguments with respect to claims 1, 8, and 16 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ADAM CLARK STANDKE whose telephone number is (571)270-1806.  The examiner can normally be reached on 9:30am-6:30pm M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ADAM CLARK STANDKE/Examiner, Art Unit 2122                                                                                                                                                                                                         
/ERIC NILSSON/Primary Examiner, Art Unit 2122