DETAILED ACTION
1.	This communication is in response to Application No. 16/744,039 filed on January 15, 2020 in which claims 1-26 are presented for examination.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
3.	The information disclosure statements submitted on 11/04/2020, 05/11/2021, and 02/28/2022 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner. 

Claim Objections
4.	Claim 26 is objected to because of the following informalities: Claim 26 recites “The computer implemented method of Claim 5 […]”, however Claim 5 does not recite a method and instead recites a hardware accelerator. Claim 26 may instead be corrected to recite “The computer implemented method of Claim 25 […]”.  Appropriate correction is required.

Claim Interpretation
5.	The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

6.	The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
7.	This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  
Such claim limitation(s) is/are: 
“Compute units” in Independent Claims 1, 12, and their dependents.
“Non-zero selector” in Claims 7, 11, 18 and their dependents.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 112
8.	The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

9.	Claims 1, 7, 11, 12, 18 and their dependents are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

10.	Claim limitations “Compute units” in Claims 1, 12 and their dependents and “Non-zero selector” in Claims 7, 11, 18 and their dependents invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. The applicant’s specification does not appear to provide sufficient structure for “compute units” and “non-zero selector”. The specification simply recites the components without describing how the structure performs the entire function in the claim language. Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph.
Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.

Claim Rejections - 35 USC § 102
11.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


12.	Claims 1, 12, and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Mellempudi et al. (hereinafter Mellempudi) (US PG-PUB 20180322607).
Regarding Claim 1, Mellempudi teaches a hardware accelerator for training quantized data, comprising: 
software controllable multilevel memory to store data (Mellempudi, Par. [0045], “The computing system 100 includes a processing subsystem 101 having one or more processor(s) 102 and a system memory 104 communicating via an interconnection path that may include a memory hub 105. The memory hub 105 may be a separate component within a chipset component or may be integrated within the one or more processor(s) 102. The memory hub 105 couples with an I/O subsystem 111 via a communication link 106. The I/O subsystem 111 includes an I/O hub 107 that can enable the computing system 100 to receive input from one or more input device(s) 108. Additionally, the I/O hub 107 can enable a display controller, which may be included in the one or more processor(s) 102, to provide outputs to one or more display device(s) 110A.”, thus, a software controllable multilevel memory to store data is disclosed, more information on specific multi-core and processor memory can be found in par. [0089]); and 
a mixed precision array coupled to the memory, the mixed precision array includes an input buffer (Mellempudi, Par. [0188], “A method and apparatus are described to perform quantization and data representation of low-precision tensors in deep learning applications. Each low-precision tensor may contain a data buffer and associated metadata represented as a data structure. The metadata may contain information pertaining to data type (integer, fixed-point, float or any other custom data type), precision and shared exponent(s)/scaling factor(s) necessary for performing data conversions and arithmetic operations.”, thus a tensor, which may represent a multidimensional array, is comprised of mixed precision data and includes a data buffer for input), 
detect logic to detect zero value operands (Mellempudi, Par. [0195], “The common scale factor 1522 can be used to convert each magnitude integer 1524 (I.sub.x) into a floating-point value after a round of fixed-point computations. The magnitude integer 1524 (I.sub.x) is converted by passing the value into a leading zero detector logic 1530 (LZD) to generate an LZ.sub.x value 1532 that indicates the number of leading zeros in the magnitude integer 1524. The magnitude integer 1524 (I.sub.x) is left shifted by shift logic based on the LZ.sub.x value 1532 and the explicitly stored leading bit is restored to an implicit leading bit 1533. As with quantization, the sign bit 1520 is unchanged”, thus, detect logic to detect zero values is disclosed), and 
a plurality of heterogenous precision compute units to perform computations of mixed precision data types (Mellempudi, Par. [0219], “The instruction can cause one or more of the compute units (e.g., compute unit 1810) of the SIMT unit 1809 to perform computational operations associated with a neural network as described herein. In one embodiment, the instruction causes a compute unit 1810 to perform a dynamic precision computation using a dynamic fixed-point data type as described herein.”, therefore, a plurality of compute units are able to perform dynamic computations on mixed precision types) for a backward propagation phase of training quantized data of a neural network (Mellempudi, Par. [0158], “Backpropagation of errors is a common method used to train neural networks. An input vector is presented to the network for processing. The output of the network is compared to the desired output using a loss function and an error value is calculated for each of the neurons in the output layer. The error values are then propagated backwards until each neuron has an associated error value which roughly represents its contribution to the original output. The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the of the neural network.”, thus, backward propagation phase of a neural network is disclosed, and further details on the quantized data begins in Par. [0188]).

Regarding Claim 12, Mellempudi teaches a data processing system (Mellempudi, Fig. 1, label 100 depicts the computing system) comprising: a hardware processor (Mellempudi, Fig. 1, label 101 depicting a plurality of processors); memory (Mellempudi, Fig. 1, label 105 depicting a memory hub and label 104 depicting system memory); and a hardware accelerator coupled to the memory (Mellempudi, Par. [0098], “One embodiment of the accelerator integration circuit 436 supports multiple (e.g., 4, 8, 16) graphics accelerator modules 446 and/or other accelerator devices. The graphics accelerator module 446 may be dedicated to a single application executed on the processor 407 or may be shared between multiple applications.”, thus, as also shown in Figures 4A and 4B, a hardware accelerator device is disclosed and is coupled to memory), the hardware accelerator includes […]
	The rest of the claim language in Claim 12 recites substantially the same limitations as
Claim 1, in the form of data processing system, therefore it is rejected under the same rationale.
The reasons of obviousness have been noted in the rejection of Claim 1 above and applicable herein.

Regarding Claim 19, Mellempudi teaches a computer implemented method for quantized neural network training comprising: 
storing data in a software controllable multilevel memory (Mellempudi, Par. [0045], “The computing system 100 includes a processing subsystem 101 having one or more processor(s) 102 and a system memory 104 communicating via an interconnection path that may include a memory hub 105. The memory hub 105 may be a separate component within a chipset component or may be integrated within the one or more processor(s) 102. The memory hub 105 couples with an I/O subsystem 111 via a communication link 106. The I/O subsystem 111 includes an I/O hub 107 that can enable the computing system 100 to receive input from one or more input device(s) 108. Additionally, the I/O hub 107 can enable a display controller, which may be included in the one or more processor(s) 102, to provide outputs to one or more display device(s) 110A.”, thus, a software controllable multilevel memory to store data is disclosed, more information on specific multi-core and processor memory can be found in par. [0089]); 
receiving data for training with a mixed precision array (Mellempudi, Par. [0188], “A method and apparatus are described to perform quantization and data representation of low-precision tensors in deep learning applications. Each low-precision tensor may contain a data buffer and associated metadata represented as a data structure. The metadata may contain information pertaining to data type (integer, fixed-point, float or any other custom data type), precision and shared exponent(s)/scaling factor(s) necessary for performing data conversions and arithmetic operations.”, thus a tensor, which may represent a multidimensional array, is comprised of mixed precision data and includes a data buffer for input); 
detecting zero value operands with detect logic of the mixed precision array (Mellempudi, Par. [0195], “The common scale factor 1522 can be used to convert each magnitude integer 1524 (I.sub.x) into a floating-point value after a round of fixed-point computations. The magnitude integer 1524 (I.sub.x) is converted by passing the value into a leading zero detector logic 1530 (LZD) to generate an LZ.sub.x value 1532 that indicates the number of leading zeros in the magnitude integer 1524. The magnitude integer 1524 (I.sub.x) is left shifted by shift logic based on the LZ.sub.x value 1532 and the explicitly stored leading bit is restored to an implicit leading bit 1533. As with quantization, the sign bit 1520 is unchanged”, thus, detect logic to detect zero values is disclosed); and 
performing, with a plurality of heterogenous precision compute units (Mellempudi, Par. [0219], “The instruction can cause one or more of the compute units (e.g., compute unit 1810) of the SIMT unit 1809 to perform computational operations associated with a neural network as described herein. In one embodiment, the instruction causes a compute unit 1810 to perform a dynamic precision computation using a dynamic fixed-point data type as described herein.”, therefore, a plurality of compute units are able to perform dynamic computations on mixed precision types), computations of mixed precision data types for a backward propagation phase of training quantized data of a neural network (Mellempudi, Par. [0158], “Backpropagation of errors is a common method used to train neural networks. An input vector is presented to the network for processing. The output of the network is compared to the desired output using a loss function and an error value is calculated for each of the neurons in the output layer. The error values are then propagated backwards until each neuron has an associated error value which roughly represents its contribution to the original output. The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the of the neural network.”, thus, backward propagation phase of a neural network is disclosed, and further details on the quantized data begins in Par. [0188]).

Claim Rejections - 35 USC § 103
13.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

14. 	Claims 2, 13, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Mellempudi et al. (hereinafter Mellempudi) (US PG-PUB 20180322607), in view of Chen et al. (hereinafter Chen) (“Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks”).
Regarding Claim 2, Mellempudi teaches the hardware accelerator of claim 1.
Mellempudi does not explicitly disclose wherein the mixed precision array utilizes low-overhead desynchronized encoding for skipping zero value operands.
However, Chen teaches wherein the mixed precision array utilizes low-overhead desynchronized (Chen, Pg. 128-129, A. Overview, “Fig. 2 shows the top-level architecture and memory hierarchy of the Eyeriss system. It has two clock domains: the core clock domain for processing, and the link clock domain for communication with the off-chip DRAM through a 64-b bidirectional data bus. The two domains run independently and communicate through an asynchronous FIFO interface. The core clock domain consists of a spatial array of 168 PEs organized as a 12 × 14 rectangle, a 108-kB GLB, an RLC CODEC, and an ReLU module”, therefore, the array may utilize an asynchronous interface) encoding for skipping zero value operands (Chen, Pg. 132, B. Exploit Data Statistics, “RLC is used in Eyeriss to exploit the zeros in fmaps and save DRAM bandwidth. Fig. 8 shows an example of RLC encoding. Consecutive zeros with a maximum run length of 31 are represented using a 5-b number as the Run. The next value is inserted directly as a 16-b Level, and the count for run starts again. Every three pairs of run and level are packed into a 64-b word, with the last bit indicating if the word is the last one in the code. Based on our experiments using AlexNet with the ImageNet data set, the compression rate of RLC only adds 5%–10% overhead to the theoretical entropy limit”, thus, low-overhead RLC (run-length compression) encoding is used for skipping zero value operands (shown in Figure 12 on Pg. 134)).

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Claim 1 comprising a hardware accelerator for training quantized data of a neural network including memory and a mixed precision array, as disclosed by Mellempudi, to include utilizing low-overhead desynchronized encoding for skipping zero value operands, as disclosed by Chen. One of ordinary skill in the art would have been motivated to make this modification to save processing power by skipping unnecessary computations (Chen, Pg. 132, B. Exploit Data Statistics, “Even though the RS dataflow optimizes data movement for all data types, the intrinsic amount of data and the corresponding computation are still high. To further improve energy efficiency, data statistics of CNN is explored to: 1) reduce DRAM accesses using compression, which is the most energy consuming data movement per access, on top of the optimized dataflow; and 2) skip the unnecessary computations to save processing power (Section V-C).”)

Claim 13 recites substantially the same limitations as Claim 2 in the form of data processing system, therefore it is rejected under the same rationale.

Claim 20 recites substantially the same limitations as Claim 2 in the form of a computer implemented method, therefore it is rejected under the same rationale.

15. 	Claims 3-11, 14-18, and 21-26 are rejected under 35 U.S.C. 103 as being unpatentable over Mellempudi et al. (hereinafter Mellempudi) (US PG-PUB 20180322607), in view of Chen et al. (hereinafter Chen) (“Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks”), further in view of Han et al. (hereinafter Han) (US PG-PUB 20190196788).
Regarding Claim 3, Mellempudi teaches the hardware accelerator of claim 1.
Mellempudi does not explicitly disclose with desynchronized encoding and uses a desynchronization tag to remove synchronization between rows of the mixed precision array.
However, Chen teaches with desynchronized encoding (Chen, Pg. 128-129, A. Overview, “Fig. 2 shows the top-level architecture and memory hierarchy of the Eyeriss system. It has two clock domains: the core clock domain for processing, and the link clock domain for communication with the off-chip DRAM through a 64-b bidirectional data bus. The two domains run independently and communicate through an asynchronous FIFO interface. The core clock domain consists of a spatial array of 168 PEs organized as a 12 × 14 rectangle, a 108-kB GLB, an RLC CODEC, and an ReLU module”, therefore, the array may utilize an asynchronous interface and RLC (run-length compression) encoding) and uses a desynchronization tag (Chen, Pg. 133, B. Network-on-Chip, “The challenge is that the group of destination PEs varies across layers due to the differences in data type, convolution stride, and mapping. Broadcasting each data with a bit-vector tag of the same size of the PE array (i.e., 168 b), which indicates the IDs of destination PEs, can support any arbitrary mapping.”, thus, a tag is assigned to indicate destination processing elements) to remove synchronization between rows of the mixed precision array (Chen, Pgs. 128-129, A. Overview, “Fig. 2 shows the top-level architecture and memory hierarchy of the Eyeriss system. It has two clock domains: the core clock domain for processing, and the link clock domain for communication with the off-chip DRAM through a 64-b bidirectional data bus. The two domains run independently and communicate through an asynchronous FIFO interface. The core clock domain consists of a spatial array of 168 PEs organized as a 12 × 14 rectangle, a 108-kB GLB, an RLC CODEC, and an ReLU module. To transfer data for computation, each PE can either communicate with its neighbor PEs or the GLB through an NoC, or access a memory space that is local to the PE called spads (Section V-C).”, therefore, synchronization is removed through the use of an asynchronous interface and further, it is mentioned on Pgs. 133-134 and Figures 10 and 11 that only active rows that have a tag match are delivered to the processing elements, thus, enabling the asynchronous interface).  
Mellempudi nor Chen do not explicitly disclose wherein the detect logic comprises a multi-lane adder logic.
However, Han teaches wherein the detect logic comprises a multi-lane adder logic (Han, Par. [0025], “FIG. 2 illustrates an exemplary architecture of a multiply-add array with 4 lanes in parallel, wherein the array comprises of four multipliers M1-M4 and four adders A1-A4. It should be noted that figures in the present disclosure will be illustrated with a 4-way SIMD, but the 4-way SIMD concept is scalable to be narrower or wider than 4 lanes.”, therefore, multi-lane adder logic with regards to an array is disclosed).
		
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the hardware accelerator of Claim 1 comprising a hardware accelerator for training quantized data of a neural network including memory and a mixed precision array, as disclosed by Mellempudi, to include utilizing desynchronized encoding and a desynchronization tag to remove synchronization between rows, as disclosed by Chen. One of ordinary skill in the art would have been motivated to make this modification to save processing power and reduce memory accesses by enabling desynchronized/asynchronous encoding (Chen, Pg. 127, I. Introduction, “To address these challenges, it is crucial to design a compute scheme, called a dataflow, that can support a highly parallel compute paradigm while optimizing the energy cost of data movement from both on-chip and off-chip. The cost of data movement is reduced by exploiting data reuse in a multilevel memory hierarchy, and the hardware needs to be reconfigurable to support different shapes. To further improve energy efficiency, data statistics can also be exploited. Specifically, CNN data contains many zeros. Techniques such as compression and data adaptive processing can be applied to save both memory bandwidth and processing power”).

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the hardware accelerator of Claim 3 comprising a hardware accelerator for training quantized data of a neural network including memory and a mixed precision array and the use of desynchronized encoding and a desynchronization tag to remove synchronization between rows, as disclosed by Mellempudi in view of Chen to include wherein the detect logic comprises a multi-lane adder logic, as disclosed by Han. One of ordinary skill in the art would have been motivated to make this modification to allow for different lanes in different rows to simultaneously work on producing different outputs, hence enabling power savings and increasing efficiency (Han, Par. [0004], “Embodiments of the present disclosure provide an architecture of a software programmable connection between a multiplier array and an adder array to enable reusing of the adders to perform either multiply-accumulate or multiply-reduce. As compared to conventional solutions, this architecture is more area- and power- efficient, which is important for neural network processing units where a substantial number of data lanes are implemented.”)

Regarding Claim 4, Mellempudi in view of Chen further in view of Han teaches the hardware accelerator of claim 3, wherein the detect logic is configured to encode non-zero value operands as value, offset, and desynchronization tag to specify an identification (ID) of a sparse-vector that operates on each row (Chen, Pg. 133, “Instead, we implemented the GIN,
as shown in Fig. 10, with two levels of hierarchy: Y-bus and X-bus. A vertical Y-bus consists of 12 horizontal X-buses, one at each row of the PE array, and each X-bus connects to 14 PEs in the row. Each X-bus has a row ID, and each PE has a col ID. These IDs are all reconfigurable, and a unique ID is given to each group of X-buses or PEs that receives the same data in a given CNN layer. Each data read from the GLB is augmented with a (row, col) tag by the top-level controller, and the GIN guarantees that the data are delivered to all and only the X-buses and then PEs with the ID that matches the tag within a single cycle”, therefore, a tag is produced, consisting of the non-zero value and the according row and column (which may indicate offset), such that an ID is specified. Further, on Pg. 134, Figure 11 illustrates how the ID values are used and how rows and columns of the vectors are processed asynchronously depending on when the ID matches the according tags).
The reasons of obviousness have been noted in the rejection of Claim 3 above and applicable herein.

Regarding Claim 5, Mellempudi in view of Chen further in view of Han teaches the hardware accelerator of claim 4, wherein the multi-lane adder (See multi-lane adder teaching of Han reference in Claim 3 above) logic uses two tag-lanes within each column, the compute units (See compute unit teaching of Mellempudi reference in Claim 1 above) in each column share tag-lanes (Chen, Pgs. 133-134, “Each data read from the GLB is augmented with a (row, col) tag by the top-level controller, and the GIN guarantees that the data are delivered to all and only the X-buses and then PEs with the ID that matches the tag within a single cycle. The tag-ID matching is done using the Multicast Controller (MC). Y-bus to compare the row tag with the row ID of each X-bus, and 14 MCs on each of the X-buses to compare the col tag with the col ID of each PE.”, thus, each y-bus contains 12 multicast controllers and each x-bus contains 14 multicast controllers which may process the tag-ID pairs and in which each column shares lanes) and within each column, compute units forward their results to one of the tag-lanes using the least significant bit (LSB) of the desynchronization tag (Chen, Pg. 134, “The numbers of filters (p) and channels (q) that the PE processes at once are statically configured into the control of a PE, which determines the state of processing. This configuration controls the pattern with which the PE steps through the three spads. The datapath is pipelined into three stages: one stage for spad access, and the remaining two for computation. The computation consists of a 16-b two-stage pipelined multiplier and adder. Since the multiplication results are truncated from 32 to 16 b, the selection of 16 b out of the 32 b is configurable, and can be decided by the dynamic range of a layer from offline experiments. Spads are separated for three data types to provide enough access bandwidth”, thus, as also shown in Figure 12, bit results are forwarded to the corresponding lanes for computation accordingly, and results are truncated accordingly. Similarly, within the Mellempudi reference Par. [0225] describes a leading bit detector which can be used to detect leading ones or zeroes in a sum and can be used to determine if an output needs to be shifted.).
The reasons of obviousness have been noted in the rejection of Claim 3 above and applicable herein.

Regarding Claim 6, Mellempudi in view of Chen further in view of Han teaches the hardware accelerator of claim 5, wherein each lane includes select logic to determine whether the tag for a current row matches a previous row's tag for either odd or even tag-lanes, the values are added together and forwarded to the next row when the tags match, and results are stored locally when the tags do not match (Chen, Pgs. 133-134, “The tag-ID matching is done using the Multicast Controller (MC). There are 12 MCs on the Y-bus to compare the row tag with the row ID of each X-bus, and 14 MCs on each of the X-buses to compare the col tag with the col ID of each PE. The unmatched X-buses and PEs are gated to save energy. For flow control, the data are passed from the GLB down to the GIN only when all destination PEs have issued a ready signal. An example of the row and col ID setup for ifmap delivery using GIN in AlexNet is shown in Fig. 11.”, thus, as also shown in Figures 10 and 11, tag-ID matching is performed by the multicast controller, such that processing may be continued/completed if the tag-ID pair matches and where the unmatched tag-ID pairs are gated to save energy).
The reasons of obviousness have been noted in the rejection of Claim 3 above and applicable herein.

Regarding Claim 7, Mellempudi in view of Chen further in view of Han teaches the hardware accelerator of claim 6, wherein the detect logic includes zero value operand detector logic and non-zero selector (Chen, Pg. 135, “Data gating logic is implemented to exploit zeros in the ifmap for saving processing power. An extra 12-b Zero Buffer is used to record the position of zeros in the ifmap spad. If a zero ifmap value is detected from the zero buffer, the gating logic will disable the read of the filter spad and prevent the MAC datapath from switching. Compared with the PE design without the data gating logic, it can save the PE power consumption by 45%.”, thus, data gating logic includes zero value operand detection, such that the zero buffer records the position of zeroes. Further, the only non-zero values are selected for processing, analogous to the non-zero selector, as shown in the processing element architecture in Figure 12).
The reasons of obviousness have been noted in the rejection of Claim 3 above and applicable herein.

Regarding Claim 8, Mellempudi in view of Chen further in view of Han teaches the hardware accelerator of claim 7, wherein the zero value operand detector logic includes comparators to generate a bit-vector that corresponds to using a single bit for each bit of a sub-vector (Chen, Pg. 133, “Broadcasting each data with a bit-vector tag of the same size of the PE array (i.e., 168 b), which indicates the IDs of destination PEs, can support any arbitrary mapping.”, thus, a bit-vector corresponding to each bit of a processing element array/vector is generated).
The reasons of obviousness have been noted in the rejection of Claim 3 above and applicable herein.

Regarding Claim 9, Mellempudi in view of Chen further in view of Han teaches the hardware accelerator of claim 8, wherein each bit in the bit-vector specifies if a corresponding value in the sub-vector is a zero value or a non-zero value (Chen, Pg. 132, “RLC is used in Eyeriss to exploit the zeros in fmaps and save DRAM bandwidth. Fig. 8 shows an example of RLC encoding. Consecutive zeros with a maximum run length of 31 are represented using a 5-b number as the Run. The next value is inserted directly as a 16-b Level, and the count for run starts again. Every three pairs of run and level are packed into a 64-b word, with the last bit indicating if the word is the last one in the code.”, therefore, as also shown in Figure 8, using run length compression encoding allows for the zeros to be exploited. Further zero values are denoted as “Run” and non-zero values are denoted as “Level”, such that consecutive zeroes can be skipped).
The reasons of obviousness have been noted in the rejection of Claim 3 above and applicable herein.

Regarding Claim 10, Mellempudi in view of Chen further in view of Han teaches the hardware accelerator of claim 9, wherein when all bits in the bit-vector are zero values then the sub-vector is skipped entirely (Chen, Pg. 134, Figure 12, which depicts that zero values are skipped once detected), wherein if at least one bit in the bit-vector is non-zero value then the sub-vector is pushed to a FIFO queue (Chen, Pgs. 134-135, “Fig. 12 shows the architecture of a PE. FIFOs are used at the I/O of each PE to balance the workload between the NoC and the computation. The numbers of filters (p) and channels (q) that the PE processes at once are statically configured into the control of a PE, which determines the state of processing.”, therefore, FIFO queues are used for workload processing and as reiterated by Fig. 12 and on Pg. 135, the data gating logic exploits zeroes for saving processing power, such that only non-zero values are processed), along with its bit-vector and a desynchronization tag for identifying an input ID (Chen, Pgs. 133-134, “Broadcasting each data with a bit-vector tag of the same size of the PE array (i.e., 168 b),
which indicates the IDs of destination PEs, can support any arbitrary mapping.”, thus, each non-zero value, its bit-vector, and tag form an ID which is also pushed to the FIFO queue).
The reasons of obviousness have been noted in the rejection of Claim 3 above and applicable herein.

Regarding Claim 11, Mellempudi in view of Chen further in view of Han teaches the hardware accelerator of claim 10, wherein the non-zero selector is configured to cause the FIFO queue to dequeue at least one sub-vector and to read only those sub-vectors that have at least one non-zero value (Chen, Pgs. 134-135, “Data gating logic is implemented to exploit zeros in the ifmap for saving processing power. An extra 12-b Zero Buffer is used to record the position of zeros in the ifmap spad. If a zero ifmap value is detected from the zero buffer, the gating logic will disable the read of the filter spad and prevent the MAC datapath from switching. Compared with the PE design without the data gating logic, it can save the PE power consumption by 45%.”, thus, if a zero value is detected from the zero buffer, the gating logic disables the read and the zero value is not read by the FIFO queue, hence only non-zero values are read)
The reasons of obviousness have been noted in the rejection of Claim 3 above and applicable herein.

Claim 14 recites substantially the same limitations as Claim 3 in the form of data processing system, therefore it is rejected under the same rationale.

Claim 15 recites substantially the same limitations as Claim 4 in the form of data processing system, therefore it is rejected under the same rationale.

Claim 16 recites substantially the same limitations as Claim 5 in the form of data processing system, therefore it is rejected under the same rationale.

Claim 17 recites substantially the same limitations as Claim 6 in the form of data processing system, therefore it is rejected under the same rationale.

Claim 18 recites substantially the same limitations as Claim 7 in the form of data processing system, therefore it is rejected under the same rationale.

Claim 21 recites substantially the same limitations as Claim 3 in the form of a computer implemented method, therefore it is rejected under the same rationale.

Claim 22 recites substantially the same limitations as Claim 4 in the form of a computer implemented method, therefore it is rejected under the same rationale.

Claim 23 recites substantially the same limitations as Claim 5 in the form of a computer implemented method, therefore it is rejected under the same rationale.

Claim 24 recites substantially the same limitations as Claim 6 in the form of a computer implemented method, therefore it is rejected under the same rationale.

Claim 25 recites substantially the same limitations as Claim 7 in the form of a computer implemented method, therefore it is rejected under the same rationale.

Claim 26 recites substantially the same limitations as Claims 8 and 9 in the form of a computer implemented method, therefore it is rejected under the same rationale.

Conclusion
16.	The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure:
Nurvitadhi et al. (US PG-PUB 20190205746) disclosed a hardware accelerator that facilitates processing a sparse matrix for an arbitrary neural network, including the ability to perform mixed precision operations.
Burger et al. (US PG-PUB 20190057303) disclosed a hardware node with a mixed-signal matrix vector unit, that comprises vector lanes.
Malaya et al. (US PG-PUB 20190171420) disclosed computational units within a field programmable gate array (FPGA) configured to generate output values based on input values, also comprising mixed precision logic.
Liu et al. (US PG-PUB 20220164666) disclosed a method for performing efficient mixed-precision search for quantizers in an artificial neural network.
Daga et al. (US PG-PUB 20190325303) disclosed an apparatus to facilitate acceleration of machine learning operations, including mixed precision operations.
Duong et al. (US Patent 11170289) disclosed a neural network inference circuit (NNIC) for executing a neural network with multiple computation nodes.
Moradi et al. (“A Scalable Multicore Architecture With Heterogeneous Memory Structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)”) disclosed a routing methodology that employs hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing memory and latency requirements to support a neural network.
Nandakumar et al. (“Mixed-precision training of deep neural networks using computational memory”) disclosed a mixed-precision architecture that combines a computational memory unit with a digital processing unit.
Moradi et al. (“An Event-Based Neural Network Architecture With an Asynchronous Programmable Synaptic Memory”) disclosed analog circuits and asynchronous digital circuits to implement networks of spiking neurons.
Mukkara et al. (“SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks”) disclosed a sparse convolutional neural network accelerator architecture which exploits zero-valued weights.
Han et al. (“Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”) disclosed a deep compression method for neural networks, consisting of pruning, trained quantization, and Huffman coding.
Yazdanbakhsh et al. (“ReLeQ: An Automatic Reinforcement Learning Approach for Deep Quantization of Neural Networks”) disclosed an end-to-end framework to automate deep quantization of neural networks.
Gupta et al. (“Deep Learning with Limited Numerical Precision”) disclosed the effect of limited precision data representation and computation on  neural network training.

17.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to Devika S Maharaj whose telephone number is 571-272-0829. The examiner can normally be reached Monday - Thursday 7:30am - 4:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached on 571-270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/D.S.M./Examiner, Art Unit 2123                                                                                                                                                                                                        
/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123