Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on December 27, 2021, in which claims 1-20 are amended. Claims 1-20 are currently pending.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on December 14, 2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Arguments
Applicant’s arguments with respect to rejection of claims 1-20 under U.S.C. 103(a) based on amendment have been considered and are persuasive. The argument is moot in view of a new ground of rejection set forth below.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of 
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-3, 6, 8, 9-11, 14, 16-18, and 20 are rejected under 35 U.S.C. 102 as being unpatentable over Moshovos (US 2021/0004668 A1).

Regarding claim 1, Moshovos teaches A deep learning network accelerator comprising: an encoder to compress an input activation vector and a weight vector to reduce sparsity ([Abstract] “Described is a neural network accelerator tile for exploiting input sparsity. The tile includes a weight memory to supply each weight lane with a weight and a weight selection metadata, an activation selection unit to receive a set of input activation values and rearrange the set of input activation values to supply each activation lane with a set of rearranged activation values” [¶0071] " Embodiments use compression to reduce off-chip and on-chip traffic" Exploiting input sparsity interpreted as analogous to reducing sparsity.).
thereby generating a compressed input activation vector and a compressed weight vector ([¶0063] "The compression scheme considers input values, weights or activations, into groups of a fixed number of elements such as for example 16 or 256" Group interpreted as synonymous with vector.).
a parallelism discovery unit to apply coordinate indexes for the compressed weight vector and for the compressed input activation vector to generate matching pairs of coordinate indexes ([¶0005] “ the set of multiplexers including at least one multiplexer per pair of activation and weight lanes, each multiplexer configured to select a combination activation value for the activation lane from the activation lane set of rearranged activation values based on the weight lane weight selection metadata" Parallelism discovery unit interpreted as synonymous with multiplexer containing combination units.  Rearranging the activation lane with respect to the weight lane weight selection metadata is interpreted as synonymous with generating matching pairs of coordinate indexes.).
a decoder to generate column selects and row selects from the matching pairs and an array of computing elements to receive the column selects and the row selects from the decoder and to transform the column selects, the row selects, the compressed input activation vector, and the compressed weight vector into output activations of a deep learning network. ([¶0005]  "and a set of combination units, the set of combination units including at least one combination unit per multiplexer, each combination unit configured to combine the activation lane combination value with the weight lane weight to output a weight lane product." [¶0091] "Cambricon-X...exploits ineffectual weights (IW) in an inner product based accelerator. Non-zero weights are compacted in memory and tagged with deltas (distance between weights). Each cycle one PE (equivalent to our inner product unit) fetches 16 weights and selects the corresponding 16 activations from a vector of 256. Chained adders are used to decode the deltas into absolute offsets.  It uses a 256-wide input activation crossbar to pair up activations with the corresponding weights." The combination unit is taught as receiving the column selects and row selects and transforming them into an output activation of a deep learning network.  ). 

Regarding claim 2, Moshovos teaches The deep learning network accelerator of claim 1, wherein the parallelism discovery unit is adapted to: associate the compressed weight vector and the compressed input activation vector each with a corresponding vector of channel indices, and execute a parallel search utilizing the corresponding vector of channel indices to determine pairs of reducible input activations and weights. ([¶0005]  "and a set of combination units, the set of combination units including at least one combination unit per multiplexer, each combination unit configured to combine the activation lane combination value with the weight lane weight to output a weight lane product." With respect to the instant specification, the result of executing a parallel search to determine pairs of reducible input activations is to obtain a product of the weight and input activation to be used as a partial sum, which is identical to the output result of the matched pairs input in the combination units in Moshovos.). 

Regarding claim 3, Moshovos teaches The deep learning network accelerator of claim 2, the parallelism discovery unit comprising an array of comparators to execute the parallel search ([¶0005] “a set of combination units, the set of combination units including at least one combination unit per multiplexer” Set interpreted as synonymous with array.  Combination unit interpreted as synonymous with comparator.  Parallelism discovery unit interpreted as synonymous with multiplexer.). 

Regarding claim 6, Moshovos teaches The deep learning network accelerator of claim 1, further comprising:
operating a sequence decoder for each computing element of the array of computing elements to decode an encoded sequence generated by the parallelism discovery unit to obtain a corresponding vector of channel indices ([¶0091] “Non-zero weights are compacted in memory and tagged with deltas (distance between weights). Each cycle one PE (equivalent to our inner product unit) fetches 16 weights and selects the corresponding 16 activations from a vector of 256. Chained adders are used to decode the deltas into absolute offsets. It uses a 256-wide input activation crossbar to pair up activations with the corresponding weights."). 

Regarding claim 8, Moshovos teaches A deep learning network acceleration method comprising: receiving an input activation vector and a weight vector; ([Abstract] “Described is a neural network accelerator tile for exploiting input sparsity. The tile includes a weight memory to supply each weight lane with a weight and a weight selection metadata, an activation selection unit to receive a set of input activation values and rearrange the set of input activation values to supply each activation lane with a set of rearranged activation values” [¶0071] " Embodiments use compression to reduce off-chip and on-chip traffic" Exploiting input sparsity interpreted as analogous to reducing sparsity.).
compressing the input activation vector and the weight vector to omit one or more missing or null values, thus generating a compressed input activation vector and a compressed weight vector; ([¶0064] "An alternative scheme includes a bitmap where each bit represents whether a value within the group is equal to or different from zero as shown in Table 3. If the value is equal to zero, it is not coded at all. Therefore, the number of coded elements per group vary. This allows for higher compression ratios for data with large number of zeros." [¶0065] "FIG. 14 indicates the effectiveness of both compression schemes for both weight and activation data" Missing or null value is interpreted as synonymous with zero value.).
providing the compressed input activation vector and the compressed weight vector to an array of computing elements configured spatially based on a particular coordinate dimension of the compressed input activation vector and the compressed weight vector; ([¶0005] “ the set of multiplexers including at least one multiplexer per pair of activation and weight lanes, each multiplexer configured to select a combination activation value for the activation lane from the activation lane set of rearranged activation values based on the weight lane weight selection metadata" Rearranging the activation lane with respect to the weight lane weight selection metadata is interpreted as synonymous with generating matching pairs of coordinate indexes configured spatially based on a particular coordinate dimension.).
providing channel indices for the compressed input activation vector and the compressed weight vector to a parallelism discovery unit, the channel indices oriented in an input channel direction; ([¶0005] “ the set of multiplexers including at least one multiplexer per pair of activation and weight lanes, each multiplexer configured to select a combination activation value for the activation lane from the activation lane set of rearranged activation values based on the weight lane weight selection metadata" Parallelism discovery unit interpreted as synonymous with multiplexer containing combination units.  Rearranging the activation lane with respect to the weight lane weight selection metadata is interpreted as synonymous with generating matching pairs of coordinate indexes.).
operating the parallelism discovery unit on the channel indices to determine matching pairs of channel inputs utilizing an array of comparators, each matching pair comprising a coordinate index for the compressed weight vector and a coordinate index for the compressed input activation vector ([¶0005]  "and a set of combination units, the set of combination units including at least one combination unit per multiplexer, each combination unit configured to combine the activation lane combination value with the weight lane weight to output a weight lane product." [¶0091] "Cambricon-X...exploits ineffectual weights (IW) in an inner product based accelerator. Non-zero weights are compacted in memory and tagged with deltas (distance between weights). Each cycle one PE (equivalent to our inner product unit) fetches 16 weights and selects the corresponding 16 activations from a vector of 256. Chained adders are used to decode the deltas into absolute offsets.  It uses a 256-wide input activation crossbar to pair up activations with the corresponding weights.").
generating column selects and row selects from the matching pairs of channel inputs and providing the column selects and the row selects along with the compressed input activation vector and the compressed weight vector to the array of computing elements to generate output activations for the deep learning network. ([¶0005]  "and a set of combination units, the set of combination units including at least one combination unit per multiplexer, each combination unit configured to combine the activation lane combination value with the weight lane weight to output a weight lane product." [¶0091] "Cambricon-X...exploits ineffectual weights (IW) in an inner product based accelerator. Non-zero weights are compacted in memory and tagged with deltas (distance between weights). Each cycle one PE (equivalent to our inner product unit) fetches 16 weights and selects the corresponding 16 activations from a vector of 256. Chained adders are used to decode the deltas into absolute offsets.  It uses a 256-wide input activation crossbar to pair up activations with the corresponding weights." The combination unit is taught as receiving the column selects and row selects and transforming them into an output activation of a deep learning network.). 

Regarding claim 9, Moshovos teaches The method of claim 8, wherein compressing omits all missing values from the input activation vector and the weight vector. ([¶0064] "An alternative scheme includes a bitmap where each bit represents whether a value within the group is equal to or different from zero as shown in Table 3. If the value is equal to zero, it is not coded at all. Therefore, the number of coded elements per group vary. This allows for higher compression ratios for data with large number of zeros." [¶0065] "FIG. 14 indicates the effectiveness of both compression schemes for both weight and activation data" Missing or null value is interpreted as synonymous with zero value.). 

Regarding claim 10, claim 10 effectively mirrors claim 2 and is therefore rejected under a similar interpretation.

Regarding claim 11, claim 11 effectively mirrors claim 3 and is therefore rejected under a similar interpretation.

Regarding claim 14, claim 14 effectively mirrors claim 6 and is therefore rejected under a similar interpretation.

Regarding claim 16, claim 16 effectively mirrors claim 1 and is therefore rejected under a similar interpretation.

Regarding claim 17, Moshovos teaches The system of claim 16, further comprising two or more computing elements and two or more parallel discovery units, each of the two or more computing elements sending the compressed input activation vector and the compressed weight vector to a corresponding one of the two or more parallel discovery units. (FIG. 9 shows that each tile which contains the computing elements and multiplexers outputs to a local activation memory which can be dispatched to a corresponding second tile.). 

Regarding claim 18, Moshovos teaches The system of claim 16, further comprising two or more computing elements having a spatial layout generating a set of output activations and a spatial compressor to reduce a number of the output activations based on the spatial layout. ([¶0005] “ the set of multiplexers including at least one multiplexer per pair of activation and weight lanes, each multiplexer configured to select a combination activation value for the activation lane from the activation lane set of rearranged activation values based on the weight lane weight selection metadata" Rearranging the activation lane with respect to the weight lane weight selection metadata is interpreted as synonymous with generating matching pairs of coordinate indexes configured spatially based on a particular coordinate dimension.  See also FIG. 9). 

Regarding claim 20, Moshovos teaches The system of claim 16, wherein: the decoder operates for one or more cycles on the column sequence, each cycle determining a column select from the column sequence and a row select from the row sequence; and ([¶0068] "For example, where all activations that are processed as a group by an accelerator employing a PRA structure happen to be zero, the accelerator will process them in a single cycle" [¶0091] "Each cycle one PE (equivalent to our inner product unit) fetches 16 weights and selects the corresponding 16 activations from a vector of 256. Chained adders are used to decode the deltas into absolute offsets.  It uses a 256-wide input activation crossbar to pair up activations with the corresponding weights." Cycle one PE interpreted as processing element performing operation in a single cycle.  Determining column and row selects to pair activation and weight interpreted as synonymous with decoding deltas into absolute offsets to pair up activations with corresponding weights. ).
the multiply and accumulate unit operates for the one or more cycles to utilize the column select and the row select to determine one of the matched pairs of the compressed weight vector and the compressed input activation vector to generate a partial sum of a current cycle, the partial sum associated with a final cycle being the output activation. ([¶0005]  "and a set of combination units, the set of combination units including at least one combination unit per multiplexer, each combination unit configured to combine the activation lane combination value with the weight lane weight to output a weight lane product." [¶0032] "Each input activation is multiplied with k weights, one per filter of the set of filters 1200 as follows: each IPU 3100 accepts a vector of N weights per cycle, one per input activation, calculates N products, reduces them via an adder tree, and accumulates the result into an output register. Once a full window has been processed, usually over multiple cycles, the output register contains the corresponding output activation."). 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: 
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 4, 5, 12, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Moshovos and in view of Wuxi. 

Regarding claim 4, Moshovos teaches The deep learning network accelerator of claim 3, the array of comparators ([¶0005] “a set of combination units, the set of combination units including at least one combination unit per multiplexer” Set interpreted as synonymous with array.  Combination unit interpreted as synonymous with comparator.  Parallelism discovery unit interpreted as synonymous with multiplexer.). However, Moshovos does not explicitly teach adapted to generate a binary output at each junction indicative of a comparison result  

Wuxi teaches adapted to generate a binary output at each junction indicative of a comparison result ([¶0042] “Step 6: When it is necessary to compare the similarity between the picture N and the picture M, first calculate the binary sequence code corresponding to the picture N and the picture M" In image similarity computations images are reduced to vectors, therefore picture is being interpreted as synonymous with vector which is what is compared at each junction of the instant). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to use binary weights to accelerate a neural network. The combination of Moshovos and Wuxi would have been obvious because a person of ordinary skill in the art would be able to determine from Wuxi that creating a binary representation of a weight matrix decreases space complexity of a neural network.

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Moshovos and in view of Wuxi. 

Regarding claim 5, Moshovos teaches The deep learning network accelerator of claim 3, further comprising: a plurality of encoders coupled to columns of the array of comparators to determine the row selects of the matching pairs of channel inputs ([Abstract] "a set of multiplexers including at least one multiplexer per pair of activation and weight lanes, where each multiplexer is configured to select a combination activation value for the activation lane from the activation lane set of rearranged activation values based on the weight lane weight selection metadata, and a set of combination units including at least one combination unit per multiplexer" [¶0091] "Chained adders are used to decode the deltas into absolute offsets. It uses a 256-wide input activation crossbar to pair up activations with the corresponding weights. This approach is similar to the weight skipping accelerator of the present invention with a very large 16×16 lookahead window and encoded mux selects." Encoded mux (multiplexer) selects is interpreted as teaching an encoder coupled to the columns of the array of multiplexers which contains the combination units.). However, Moshovos does not explicitly teach to generate a valid bit per row select to indicate whether one of the matching pairs of channel inputs was found or not  

Wuxi teaches to generate a valid bit per row select to indicate whether one of the matching pairs of channel inputs was found or not ([¶0036] “Step 4: P represents the probability of activation of the i-th element of the output layer, P (Oi=1) indicates the probability of Oi=1. Oi only has two values, namely 1 or 0. 1 means activation, 0 means inactive, and Formula 2 gives the probability of activation of the i-th neuron of the output layer"). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to use binary weights to accelerate a neural network. The combination of Moshovos and Wuxi would have been obvious because a person of ordinary skill in the art would be able to determine from Wuxi that creating a binary representation of a weight matrix decreases space complexity of a neural network.

Regarding claim 12, claim 12 effectively mirrors claim 4 and is therefore rejected under a similar interpretation.

Regarding claim 13, claim 13 effectively mirrors claim 5 and is therefore rejected under a similar interpretation.

Claims 7 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Moshovos and in view of Kang. 

Regarding claim 7, Moshovos teaches The deep learning network accelerator of claim 6, further comprising:
the sequence decoder adapted to separate the encoded sequence into a column sequence for the compressed weight vector and a row sequence for the compressed input activation vector ([¶0034] “FIGS. 4A and 4B depict an example of how the IPU 3100 of accelerator 3000 of FIG. 3 would process activations and steps, and in which N is set to 4 and k is set to 1. Activations are denoted as astep lane and weights are denoted as wstep lane, where lane designates the activation column and weight row they appear at, and step designates the order in time in which they are multiplied." Moshovos explicitly teaches having a column for activations and row for weights.  Using a column for weights and row for activations is a trivial change and would lead to an obvious and expected outcome.). While Moshovos implicitly teaches iteratively encoding the sequence, Moshovos does not explicitly teach to iteratively encode the column sequence.

Kang teaches to iteratively encode the column sequence ([¶0029] “step b decoding the data received…iterating the above Step a), b), c), d), e) to obtain the RNN activation sequences and computing the output sequence").). 

Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, that matrix sparsity may be exploited for compression, and that the intended use of the accelerator include column and row selects and transforms. At the time of filing, it would have been obvious to a person of ordinary skill in the art to combine the compressed deep neural network system of Moshovos with the iterative encoding disclosed by Kang because of the benefit disclosure of Kang ([¶0157]) “the device and method according to the present invention achieves better computation efficiency while reduces processing delay”.  Iteratively encoding the column sequence would also be obvious to one of ordinary skill in the art since Moshovos feeds the activation and weight vector sequences into the combination units iteratively.

Regarding claim 15, claim 15 effectively mirrors claim 7 and is therefore rejected under a similar interpretation.

Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Moshovos in view of Andri (“YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights”,2016).

Regarding claim 19, Moshovos teaches The system of claim 18.  However, Moshovos does not explicitly teach wherein the spatial compressor is a spatial adder tree.

Andri discusses a neural network accelerator and teaches wherein the spatial compressor is a spatial adder tree ([p. 238 B] “Considering that with the 12-bit MAC implementation 40% of the total total chip area is used for the filterbank and 40% are needed for the 12×12-bit multipliers and the accumulating adder trees, this leads to an enormously reduced area cost and complexity" area cost and complexity is being interpreted as synonymous with spatial complexity.).

Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to use an adder tree to accelerate a neural network. The combination of Moshovos and Andri would have been obvious because a person of ordinary skill in the art would be able to determine from Andri that using an adder tree in a neural network accelerator ([p. 238 B] “leads to an enormously reduced area cost and complexity”).

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124