Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


EXAMINER’S NOTE
Features from Moshovos et al. (US 2021/0125046) that are relied on for the following rejections are supported by provisional applications (62/668,363, filed on May 8, 2018).


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-11 and 13-20 are rejected under 35 U.S.C. 103 as being unpatentable over Moshovos et al. (US 2021/0125046, Moshovos hereinafter) in view of  Kartik Hegde et al. “UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition” , Kartik hereinafter, April 18, 2018 and Ovsiannikov et al. (US 2019/0392287).

As to claim 1, Moshovos teaches a semiconductor package (Se FIGs. 3B and C) comprising: 
 	a plurality of dice each (See FIGs. 3B and C. Also, see FIG. 9) comprising: 
 	a central controller (e.g., para 40, “a bit-serial engine 3200”); 
 	a global memory buffer (e.g. see FIG. 3B. Also, see , FIG. 9, “PSpad, 9400”); and 
 	a plurality of processing elements (e.g., see FIG, 3B. Also, see  FIG. 9, “PE, 9110”) each comprising: 
 		a weight buffer (See FIG. 4A)  to receive from the central controller weight values (See FIG. 3A-3C, para 41, “activations and weights are processed bit-serially, engine 3200 results in 16 output activations in P.sub.a.times.P.sub.w cycles, where P.sub.a and P.sub.w are the activation and weight precisions, respectively”)  for a neural network; 
 		an activation buffer (See FIG. 4A) to receive activation values for the neural network(See FIG. 3A-3C, para 41, “activations and weights are processed bit-serially, engine 3200 results in 16 output activations in P.sub.a.times.P.sub.w cycles, where P.sub.a and P.sub.w are the activation and weight precisions, respectively”) ; 
 	an accumulation memory buffer (e.g., “4600”, FIG. 4A) to collect partial sum values (e.g., “Psum”, FIG. 4A and B, para 53, “Accumulation sub-element 4600 then accumulates the newly received partial sum with any partial sum held in an accumulator. Also, see FIG. 9, “FIG. 9 is a schematic diagram of a tile 9000 containing 32 processor elements 9100 organized in an 8 by 4 grid. Input scratchpads, i.e., small local memory, provide activation and weight inputs, an activation pad 9200 providing activation inputs and a weight pad 9300 providing weight inputs. In some layers the activation pad 9200 provides weight inputs and the weight pad 9300 provides activation inputs. A third scratchpad, storage pad 9400, is used to store partial or complete output neurons”); 
 	a plurality of multiply-accumulate units to combine (see FIGs. 6, 9-11), in parallel (e.g., “inter-value bit-level parallelism”), the weight values and the activation values into the partial sum values (e.g.,  see FIG. 9, para 71, “Tile 9000 implicitly treats all concurrently processed activation and weight pairs as a group and synchronizes processing across different groups; tile 9000 starts processing the next group when all the processing elements are finished processing all the terms of the current group” , “The concatenated values are then added via a 6-input adder tree to produce a 38b partial sum to be output to accumulation sub-element 4600”, “Embodiments presented above exploit inter-value bit-level parallelism” in para 56 and 85), 
 	However , Moshovos  does not explicitly teach each of the multiply- accumulate units comprising: a weight collection buffer disposed in a data flow between the multiply-accumulate unit and the weight buffer; and a partial sum collection buffer disposed in the data flow between the multiply-accumulate unit and the accumulation memory buffer.  
 	Kartik teaches processing elements each comprising: a weight buffer to receive from the central controller weight values for a neural network; an activation buffer to receive activation values for the neural network; an accumulation memory buffer to collect partial sum values; a plurality of multiply-accumulate units to combine, in parallel, the weight values and the activation values into the partial sum values (e.g., see pages 5-7, “e PE is made up of an input buffer, weight buffer, partial sum buffer, control logic and MAC unit (the non-grey components in Figure 6)”, “Every element of the filter is element-wise multiplied to every input element in the corresponding region, and the results are accumulated to provide a single partial sum. The partial sum is stored in the local partial sum buffer and is later accumulated with results of the dot products over the next RSCt-size filter tile” , “We assume accumulator ➁’s state is reset at the start of each sub-activation group, so the accumulator implicitly calculates 0 + z here. Both wiT1 and wiT2 read 0s, thus we proceed without further accumulations. 2) iiT reads 6 and wiT1 and wiT2 read 0 and 1, respectively. This means we are at the end of the sub-activation group (for filter k2), but not the activation group (for filter k1). Sum z+m is formed in accumulator ➁, which is sent (1) to accumulator ➂—as this represents the sum of only a part of the activation group for filter k1—and (2) to the MAC unit ➀ to multiply with a for filter k2. 3) Both wiT1 and wiT2 read 0s, accumulator ➁ starts accumulating the sub-activation group containing l. 4) Both wiT1 and wiT2 read 0s, accumulator ➁ builds l +y. 5) Both wiT1 and wiT2 read 1s, signifying the end of both the sub-activation and activation groups. Accumulator ➁ calculates l +y+h, while accumulator ➂ contains z+m for filter k1. The result from accumulator ➁ is sent (1) to the MAC Unit ➀—to multiply with b for filter k2—and (2) to accumulator ➂ to generate z+m+l +y+h. The result from accumulator ➂ finally reaches the MAC Unit ➀ to be multiplied with a “ for  “reading each input and weight from memory, and performing a multiply-accumulate (MAC) on that input-weight pair”, “store weights”, “store activations” in see “ I. INTRODUCTION”, pages 1-2,  FIG. 2), each of the multiply- accumulate units comprising: a weight collection buffer disposed between the multiply- accumulate unit and the weight buffer; and a partial sum collection buffer disposed between the multiply- accumulate unit and the accumulation memory buffer, each of the multiply- accumulate units comprising: a weight collection buffer disposed between the multiply- accumulate unit and the weight buffer; and a partial sum collection buffer disposed between the multiply- accumulate unit and the accumulation memory buffer (e.g., see page 5-6, “the PE is made up of an input buffer, weight buffer, partial sum buffer, control logic and MAC unit (the non-grey components in Figure 6).”).  Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Moshovos by adopting the teachings of Kartik  to have  “ exploits weight repetition to reduce on-chip multiplies/memory reads and to compress network model size ” (See Kartik, VIII. CONCLUSION).

Ovsiannikov teaches each of the multiply- accumulate units comprising: a weight collection buffer disposed in a data flow (e.g., “data transfers “,  “an input value to the contents”)between the multiply-accumulate unit and the weight buffer (e.g., see FIG. 1B, para 262, “Each MU 103 may include a plurality of registers, e.g. a register file 127 containing 18 9-bit registers that may be referred to as “weight registers”, and a multiplier 126” for “data transfers” in para 260); and a partial sum collection buffer(e.g., “the register with the sum”)  disposed in the data flow between the multiply-accumulate unit and the accumulation memory buffer (e.g., See FIG. 1B,  para 262, “adder trees 128A and 128B in each MR column 133 sum up (reduce) resulting products from the sixteen MUs in the column to form a dot product” and  “Each MR column also contains accumulators 130A and 130B, one for each adder tree 128A and 128B. As used herein, an “accumulator” is a combination of an adder and a register that is configured to add an input value to the contents of the register, and overwrite the contents of the register with the sum” in para 266). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further modify the method of Moshovos and Kartik by adopting the teachings of  Ovsiannikov to allow  “ Storing weights in a compressed format may be beneficial to reduce amount of SRAM (and off-chip DDR) storage required to store weights, to reduce SRAM (and off-chip DDR) power associated with fetching weights and to speed up weight loading, in particular during fully-connected layer computation. ” (See Ovsiannikov, para 513).


As to claim 2, Moshovos teaches logic to distribute the weight values and the activation values among the processing elements spatially by a depth of an input of the neural network (e.g., para 86, “depth-separable convolutional layers”), and temporally by a height and a width (See FIG. 2) of the input to the neural network (e.g., para 37, “FIG. 2, which shows a c.times.x.times.y input activation block 2100 and a set of N c.times.h.times.k filters 2200. The layer dot products each of these N filters (denoted f.sup.0, f.sup.1, . . . , f.sup.N-1) 2200 by a c.times.h.times.k subarray of input activation (or `window`), such as window 2300, to generate a single o.sub.h.times.o.sub.k output activation 2400” and “neural networks required 16b data widths or precisions only for some layers. Some neural networks require 16b data widths or precisions only for the activations, and few values require more than 8b. In some embodiments, a tile supports the worst-case data width required across all layers and all values” in para 77).  

As to claim 3, Moshovos teaches  wherein the height and the width are image dimensions (See e.g., see FIG. 2 ).  

As to claim 4, Moshovos does not  teach wherein the multiply-accumulate units of each processing element compute a portion of a wide dot-product-accumulate as a partial result and forward the partial result to neighboring processing elements. However, Kartik teaches  wherein the multiply-accumulate units of each processing element compute a portion of a wide dot-product-accumulate as a partial result and forward the partial result to neighboring processing elements (e.g., see FIGs. 5 , 6 and 7, page 4-7, “an accumulator added to sum activation groups for dot product factorization. ➂ is an additional set of accumulators for storing subactivation group partial sums. There are G and G−1 accumulator registers in components ➀ and ➂, respectively”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Moshovos by adopting the teachings of Kartik  to have  “ exploits weight repetition to reduce on-chip multiplies/memory reads and to compress network model size ” (See Kartik, VIII. CONCLUSION).

As to claim 5, Moshovos does not  teach where the partial results are transformed into a final result by the processing elements and communicated to the global buffer.  However, Kartik teaches  where the partial results are transformed into a final result by the processing elements and communicated to the global buffer (e.g., See FIG. 5,  see “V. ARCHITECTURE AND DATAFLOW”, “the DCNN and UCNN architectures consist of multiple Processing Elements (PEs) connected to a shared global buffer (L2)”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Moshovos by adopting the teachings of Kartik  to have  “ exploits weight repetition to reduce on-chip multiplies/memory reads and to compress network model size ” (See Kartik, VIII. CONCLUSION).


As to claim 6, Moshovos teaches  logic to utilize the global memory buffer as a second-level buffer for the activation values (e.g., See FIG. 9).  

As to claim 7, Moshovos teaches  the global buffer staging the final results between layers of the deep neural network (see FIG. 9). 

As to claim 8, see rejection of claims 1 and 2 above. Moshovos teaches further a semiconductor chip,  to distribute weights and activations for layers of a neural network spatially across the processing elements by input output channel dimensions and temporally by input height and width (See FIG. 2) ; and the at least one controller  to utilize the global memory buffer as a second-level buffer for the activation values (e.g.,  see FIG. 3B and 3C).
  
As to claim 9, see rejection of claims 4 and 5 above. 

As to claim 10, Moshovos teaches logic to multicast one or both of the weights and the activations to the processing elements ( e.g., see FIGs. 9 and 10).  

As to claim 11, see rejection of claim 1 above . However, Moshovos does not  teach wherein each of the weight buffer and the activation buffer each comprise a configurable address generator to enable different data flow computations by the multiply-accumulate units.  Kartik   teaches wherein each of the weight buffer and the activation buffer each comprise a configurable address generator to enable different data flow computations by the multiply-accumulate units ( e.g., see FIG 7, pages 5-7, “Once the data is available in the input and weight buffers, the control unit feeds the datapath with a weight and input element every cycle. They are MACed into a register that stores a partial sum over the convolution operation before writing back to the partial sum buffer “ “performing a multiply-accumulate (MAC) on that input-weight pair”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Moshovos by adopting the teachings of Kartik  to have  “ exploits weight repetition to reduce on-chip multiplies/memory reads and to compress network model size ” (See Kartik, VIII. CONCLUSION).


As to claim 13, Moshovos teaches  wherein a depth of the weight collector and a depth of the accumulation collector are configurable by the controller (e.g., para 86, “depth-separable convolutional layers” for “both activations and weights”).  

As to claim 14, Moshovos does not teach explicitly teach wherein a number N of the at least one multiply- accumulate units to make operational is configurable by the controller.  However, Kartik teaches wherein a number N of the at least one multiply- accumulate units to make operational is configurable by the controller (e.g., see page 6, “A local counter triggers early MACs along with weight buffer ‘peeks’ at group boundaries”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Moshovos by adopting the teachings of Kartik to have “ exploits weight repetition to reduce on-chip multiplies/memory reads and to compress network model size ” (See Kartik, VIII. CONCLUSION).


As to claim 15, Moshovos does not  teach wherein a number V of multiplications and additions by the at least one multiply-accumulate unit per clock cycle  is configurable . However, Kartik teaches wherein a number V of multiplications and additions by the at least one multiply-accumulate unit per clock cycle (e.g., “every cycle “)  is configurable (e.g., see right column of page 5, “the datapath with a weight and input element every cycle. They are MACed into a register that stores a partial sum over the convolution operation before writing back to the partial sum buffer.” For “MACs along with weight buffer ‘peeks’ at group boundaries”).  Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Moshovos by adopting the teachings of Kartik  to have  “ exploits weight repetition to reduce on-chip multiplies/memory reads and to compress network model size ” (See Kartik, VIII. CONCLUSION).


As to claim 16, see rejection of claims 1 and 2 above.  Moshovos teaches further a semiconductor neural network processor, comprising: a plurality of chips, each chip (see FIG. 3B and 3C).

As to claim 17, Moshovos does not explicitly  teach the weight buffer having a configurable operational size. However, Kartik teaches the weight buffer having a configurable operational size (e.g., see left column of page 8, “D. Spatial Vectorization”,  “a new parameter VW to indicate the spatial vector size” for  “the input buffer into VW banks and carefully architect the buffer so that exactly VW activations can be read every cycle”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Moshovos by adopting the teachings of Kartik  to have  “ exploits weight repetition to reduce on-chip multiplies/memory reads and to compress network model size ” (See Kartik, VIII. CONCLUSION).


As to claim 18, Moshovos does not explicitly  teach the accumulation memory buffer having a configurable operational size.   However, Kartik teaches the accumulation memory buffer having a configurable operational size (e.g., see right column of  page 5, “The partial sum is stored in the local partial sum buffer and is later accumulated with results of the dot products over the next RSCt-size filter tile.”, “, we can vectorize across output channels (amortizing input buffer reads) by replicating the lane and growing the weight buffer capacity”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Moshovos by adopting the teachings of Kartik  to have  “ exploits weight repetition to reduce on-chip multiplies/memory reads and to compress network model size ” (See Kartik, VIII. CONCLUSION).


As to claim 19, Moshovos does not explicitly  teach wherein a number of operational multiply-accumulate units for the processing elements is configurable.  However, Kartik teaches a number of operational multiply-accumulate units for the processing elements is configurable ( e.g., see right column of page 6, “A local counter triggers early MACs along with weight buffer ‘peeks’ at group boundaries. In this work, we assume a maximum activation group size of 16. This means we can reduce multiplies by 16× in the best case, and the multiplier is 4 bits wider on one input.”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Moshovos by adopting the teachings of Kartik  to have  “ exploits weight repetition to reduce on-chip multiplies/memory reads and to compress network model size ” (See Kartik, VIII. CONCLUSION).


As to claim 20, Moshovos does not explicitly  teach each processing element further comprising an activation buffer of configurable operational size. However, Kartik teaches each processing element further comprising an activation buffer of configurable operational size (e.g., see page 6, “the activation group size is  input tile size. Therefore, we set a maximum limit for the activation group size”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Moshovos by adopting the teachings of Kartik  to have  “ exploits weight repetition to reduce on-chip multiplies/memory reads and to compress network model size ” (See Kartik, VIII. CONCLUSION).
Response to Arguments
	Response to Claim Rejections under 35 U.S.C. § 103 
 	Applicant argues that:
 	“Moshovos in view of Kartik does not teach or obviously suggest that the vector MAC unit (602) includes a weight collection buffer (636) disposed in a data flow between the multiply- accumulate unit (640) of the vector MAC (602) and the weight buffer (604), and also includes a partial sum collection buffer (638) disposed in a data flow between the multiply-accumulate unit (640) of the vector MAC (602) and the accumulation memory buffer (612).”
    
 	In response, Ovsiannikov et al. (US 2019/0392287) is added only as directly corresponding evidence to support the prior common knowledge finding as stated above.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDOU K SEYE whose telephone number is (571)270-1062. The examiner can normally be reached M-F 9-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hyung SOUGH can be reached on 5712726799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ABDOU K SEYE/Examiner, Art Unit 2194     
/CRAIG C DORAIS/Primary Examiner, Art Unit 2194