DETAILED ACTION
This action is in response to the claims filed 11/01/2021. Claims 1-20 are pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.



Claim(s) 1-3, 5-7, 9-11, 13-15, 17-19 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Du et al “A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things” hereinafter Du.

Regarding claim 1
Du teaches, A device for performing computations of a convolutional neural network, the device comprising: a processing chip including: (pg 8 results ¶ 1 “The accelerator was implemented in TSMC 65nm technology and the layout characteristics of the accelerator are shown in Fig. 13… the core can support both arbitrary sized convolution layer and the pooling function, it can be used to accelerate major CNNs”) a first arrangement of a plurality of tensor arrays including circuitry configured for performing computations according to a default convolutional filter size; (Fig. 3 and pg 3 column 2 “CU engine is composed of sixteen convolution units to enable highly parallel convolution computation. Each unit can support the convolution with a kernel size up to three” the convolutional engine is composed of 16 convolution units or tensor arrays capable of performing computations of a default size, that size being 3.)
a second arrangement of a plurality of memory cells configured to store outputs of corresponding ones of the tensor arrays; (Figure 3. The ACCU buffer stores the output of the tensor arrays, pg 7 column 2 “The ping-pong buffer is separated into two different subbuffers. During the convolution, only one buffer will be pointed to the accumulator while the other buffer will be connected to the pooling blocks and the readout blocks…When the accumulator finished accumulating one output feature, the ping-pong buffer will switch its sub-buffers directions, pointing the buffer that stores the output feature to the pooling blocks and the readout blocks.” the buffer is composed of two memory cells buffer A and buffer B, they store the output from the convolution as well as the pooling operations, the output of the tensor arrays) a plurality of interconnects connecting particular ones of the tensor arrays to particular ones of the memory cells, wherein each interconnect of the plurality of interconnects connects at least two tensor arrays and at least two memory cells; (Figure 9. As depicted in the figure each CU unit, tensor array, is connected to the ADD blocks which represent the addition in the ACCU buffer. As shown in the figure some of CU units are connected to output A, some are connected to output B. These outputs correspond to the separate sub buffers of the ACCU buffer (See figure 11). Two interconnects each connecting particular tensor arrays to particular memory is depicted in the annotated figure below:

    PNG
    media_image1.png
    394
    466
    media_image1.png
    Greyscale
)
a computer-readable memory storing instructions for configuring the processing chip to perform computations of the convolutional neural network; (Results pg 8 para. 1-4 “The accelerator was implemented in TSMC 65nm technology and the layout characteristics of the accelerator are shown in Fig. 13….The area estimation includes the logic cells, registers, and single port/dual port SRAMs  generated by the ARM compiler…To verify the performance of the accelerator, we have downloaded the hardware accelerator IP into the Xilinx Zynq-7200 FPGA and demonstrate the core’s functions using modified LeNet-5” the accelerator design has memory for storing logic and instructions. The design is tested on a FPGA ) and a controller configured by the instructions to: ( pg 8 Results para. 5 “Through using the DMA controller inside the FPGA, the accelerator can successfully access the data and the weights stored in the DRAM” the controller is part of the FPGA and directs data flow to implement to operations.) determine, for a particular convolution of a convolutional layer of the convolutional neural network,  a particular convolutional filter size used for the particular convolution; (pg 4 Section A para. 2 “To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel sized (>3 × 3) convolution through using only 3×3-sized CU. The algorithm begins with examining the kernel size of the Filter” given a particular kernel of convolutional layer examine the particular size of the kernel) when the particular convolutional filter size equals the default convolutional filter size, configure a processing unit to include one of the tensor arrays, and configure the processing unit to perform the particular convolution using the default convolutional filter size; 3Attorney Docket No.: WDA-363 1 *C-USwhen the particular convolutional filter size is less than the default convolutional filter size, configure the processing unit to include one of the tensor arrays, and configure the processing unit to perform the particular convolution using the default convolutional filter size padded with zeros such that a padded portion with an unpadded portion of the default convolutional filter corresponds to the particular convolutional filter size;  and when the particular convolutional filter size is greater than the default convolutional filter size, configure the processing unit to include multiple tensor arrays to perform the particular convolution.  (pg 4 Section A para. 2 “To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel sized (>3 × 3) convolution through using only 3×3-sized CU. The algorithm begins with examining the kernel size of the filter. If the original filter’s kernel size is not an exact multiple of three, zero padding weights will be added in the original filter’s kernel boundary to extend the original filter’s kernel size to be a multiple of three. Because the added weights in the boundary are 0, so the extended filter will result in same output value compared with the original filter during the computation. Next, the extended filters will be decomposed into several 3 × 3-sized filters. Each filter will be assigned a shift address based on its top left weight’s relative position in the original filter. For example, Fig. 5 is an example of decomposing a 5 × 5 filter into four 3 × 3 filters. One row and column zero padding are added in the original filter” as described when a particular filter is too large it is decomposed into filters of the default size. When a filter is too small it is padded with zeros as shown in figure 5. The decomposed filters are used by the CU engine containing 16 particular CU tensor array to perform the particular convolution.)

Regarding claim 2 
Du teaches claim 1
Du teaches, wherein the default convolutional filter size is 5x5x1.  (Section 3 A “The filter’s kernel size in a typical CNN network can range from very small size (1 × 1) to very large size (11 × 1) …To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel sized (>3 × 3) convolution through using only 3×3-sized CU….the accelerator needs to either leave the software to do the computation or add additional hardware unit for large kernel-sized filter convolution.” Although the exemplary  hardware in designed for 3x3 convolutional filters, the author notes that additional hardware could be added to suit larger kernels which includes kernels size 5x5 corresponding to 5x5x1)

Regarding claim 3 
Du teaches claim 1
Du teaches, wherein the device is configured to: provide input into the processing chip for the processing chip to perform the particular convolution using the processing unit; and provide an output of the processing chip as an output of the convolutional neural network.  (Section 3 para. 1 and Figure 3 “The overall streaming architecture of the CNN [convolutional neural network] accelerator is shown in Fig. 3. It is already proved that deep networks can be represented with 16-bit fixed-point number with stochastic rounding and incur little to no degradation in the classification accuracy…The accelerator includes a 96 Kbyte single port SRAM as the buffer bank to store the intermediate data and exchange data with the DRAM. The buffer bank is separated into two sets. One for the input data of the current layer and the other one is to store the output data” the buffer bank provides input data to the accelerator and stores output data, this is depicted in figure 3. The accelerator chip or cu Engine array performs convolutions for the convolutional neural network.)
Regarding claim 5 
Du teaches claim 1
Du teaches, wherein at least one tensor array of the plurality of tensor arrays includes circuitry to perform a single multiplication operation. (Section 5 A “The CU engine array includes nine processing engines (PE) and an adder to combine the output. The PE provides a multiplication function for its input data and the filter’s weight and meanwhile passes its input data to the next stage’s PE through a D flip-flop….In the 3 × 3 convolution, the multiplied result will send to the adder in the CU to perform the summation and deliver the summed result to the final output” the tensor array contains multiple PE elements each capable of performing multiplication operations.)

Regarding claim 6 
Du teaches claim 1
Du teaches, wherein at least one tensor array of the plurality of tensor arrays includes circuitry to perform a plurality of multiplication operations.  (Section 5 A “The CU engine array includes nine processing engines (PE) and an adder to combine the output. The PE provides a multiplication function for its input data and the filter’s weight and meanwhile passes its input data to the next stage’s PE through a D flip-flop….In the 3 × 3 convolution, the multiplied result will send to the adder in the CU to perform the summation and deliver the summed result to the final output” the tensor array contains multiple PE elements each capable of performing multiplication operations.)
Regarding claim 7 
Du teaches claim 1
Du teaches, wherein the controller is further configured by the instructions to configure the processing chip into a plurality of processing units that collectively perform the computations of multiple layers of the convolutional neural network. (Section 5 A “As described in Section IV, the accelerator uses nine multipliers to form a CU and sixteen CUs to compose a CU engine.” The CU engine is composed of 16 processing units each composed of 9 multipliers.” Section Results para. 03 “To verify the performance of the accelerator, we have downloaded the hardware accelerator IP into the Xilinx Zynq-7200 FPGA and demonstrate the core’s functions using modified LeNet-5” the system is used to implement a LeNet-5 which is a multi-layer convolutional neural network.)
Regarding claim 9 
Du teaches, A method for performing computations of a neural network, (pg 8 results para. 1 “The accelerator was implemented in TSMC 65nm technology and the layout characteristics of the accelerator are shown in Fig. 13… the core can support both arbitrary sized convolution layer and the pooling function, it can be used to accelerate major CNNs”) identifying a default filter size of a plurality of tensor arrays included in a processing chip, (Fig. 3 and pg 3 column 2 “CU engine is composed of sixteen convolution units to enable highly parallel convolution computation. Each unit can support the convolution with a kernel size up to three” the convolutional engine is composed of 16 convolution units or tensor arrays capable of performing computations of a default size, that size being 3.) wherein the processing chip further includes a plurality of memory cells for storing outputs of corresponding ones of the tensor arrays (Figure 3. The ACCU buffer stores the output of the tensor arrays,  pg 7 column 2 “The ping-pong buffer is separated into two different subbuffers. During the convolution, only one buffer will be pointed to the accumulator while the other buffer will be connected to the pooling blocks and the readout blocks…When the accumulator finished accumulating one output feature, the ping-pong buffer will switch its sub-buffers directions, pointing the buffer that stores the output feature to the pooling blocks and the readout blocks.” the buffer is composed of two memory cells buffer A and buffer B, they store the output from the convolution as well as the pooling operations.) a plurality of interconnects connecting particular ones of the tensor arrays to particular ones of the memory cells, and wherein each interconnect of the plurality of interconnects connects at least two tensor arrays and at least two memory cells (Figure 9. As depicted in the figure each CU unit, tensor array, is connected to the ADD blocks which represent the addition in the ACCU buffer. As shown in the figure some of CU units are connected to output A, some are connected to output B. These outputs correspond to the separate sub buffers of the ACCU buffer (See figure 11). Two interconnects each connecting particular tensor arrays to particular memory is depicted in the annotated figure below:

    PNG
    media_image1.png
    394
    466
    media_image1.png
    Greyscale
)
determining, for a particular set of operations of a layer of the neural network, a particular filter size used for the particular set of operations; (pg 4 Section A para. 2 “To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel sized (>3 × 3) convolution through using only 3×3-sized CU. The algorithm begins with examining the kernel size of the Filter” given a particular kernel of convolutional layer examine the particular size of the kernel) determining that a particular filter size is greater than the default filter size; and in response to determining that the particular filter size is greater than the default filter size, configuring a processing unit to include multiple tensor arrays to perform the particular set of operations.  (pg 4 Section A para. 2 “To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel sized (>3 × 3) convolution through using only 3×3-sized CU. The algorithm begins with examining the kernel size of the filter. If the original filter’s kernel size is not an exact multiple of three, zero padding weights will be added in the original filter’s kernel boundary to extend the original filter’s kernel size to be a multiple of three. Because the added weights in the boundary are 0, so the extended filter will result in same output value compared with the original filter during the computation. Next, the extended filters will be decomposed into several 3 × 3-sized filters. Each filter will be assigned a shift address based on its top left weight’s relative position in the original filter. For example, Fig. 5 is an example of decomposing a 5 × 5 filter into four 3 × 3 filters. One row and column zero padding are added in the original filter” as described when a particular filter is too large it is decomposed into filters of the default size. When a filter is too small it is padded with zeros as shown in figure 5. The decomposed filters are used by the CU engine containing 16 particular CU tensor array to perform the particular convolution.)
Regarding claim 10
	Claim 10 is rejected for the reasons set forth in claim 9 and claim 2
Regarding claim 11
	Claim 11 is rejected for the reasons set forth in claim 9 and claim 3
Regarding claim 13
	Claim 13 is rejected for the reasons set forth in claim 9 and claim 5
Regarding claim 14
	Claim 14 is rejected for the reasons set forth in claim 9 and claim 6
Regarding claim 15
	Claim 15 is rejected for the reasons set forth in claim 9 and claim 7

Regarding claim 17
Du teaches,  A controller comprising one or more processors: (pg 8 results para. 1 “The accelerator was implemented in TSMC 65nm technology and the layout characteristics of the accelerator are shown in Fig. 13… the core can support both arbitrary sized convolution layer and the pooling function, it can be used to accelerate major CNNs”) identify a default filter size of a plurality of tensor arrays included in a processing chip (Fig. 3 and pg 3 column 2 “CU engine is composed of sixteen convolution units to enable highly parallel convolution computation. Each unit can support the convolution with a kernel size up to three” the convolutional engine is composed of 16 convolution units or tensor arrays capable of performing computations of a default size, that size being 3.)
wherein the processing chip further includes a plurality of memory cells for storing outputs of corresponding ones of the tensor arrays (Figure 3. The ACCU buffer stores the output of the tensor arrays,  pg 7 column 2 “The ping-pong buffer is separated into two different subbuffers. During the convolution, only one buffer will be pointed to the accumulator while the other buffer will be connected to the pooling blocks and the readout blocks…When the accumulator finished accumulating one output feature, the ping-pong buffer will switch its sub-buffers directions, pointing the buffer that stores the output feature to the pooling blocks and the readout blocks.” the buffer is composed of two memory cells buffer A and buffer B, they store the output from the convolution as well as the pooling operations.) a plurality of interconnects connecting particular ones of the tensor arrays to particular ones of the memory cells, and wherein each interconnect of the plurality of interconnects connects at least two tensor arrays and at least two memory cells (Figure 9. As depicted in the figure each CU unit, tensor array, is connected to the ADD blocks which represent the addition in the ACCU buffer. As shown in the figure some of CU units are connected to output A, some are connected to output B. These outputs correspond to the separate sub buffers of the ACCU buffer (See figure 11). Two interconnects each connecting particular tensor arrays to particular memory is depicted in the annotated figure below:

    PNG
    media_image1.png
    394
    466
    media_image1.png
    Greyscale
)
determine, for a particular set of operations of a layer of a neural network, a particular filter size used for the particular set of operations (pg 4 Section A para. 2 “To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel sized (>3 × 3) convolution through using only 3×3-sized CU. The algorithm begins with examining the kernel size of the Filter” given a particular kernel of convolutional layer examine the particular size of the kernel) when the particular filter size equals the default filter size, configure a processing unit to include one of the tensor arrays, and configure the processing unit to perform the particular set of operations using the default filter size; when the particular filter size is less than the default filter size, configure the processing unit to include one of the tensor arrays, and configure the processing unit to perform the particular set of operations using the default filter size padded with zeros such that an unpadded portion of the default filter corresponds to the particular filter size; and when the particular filter size is greater than the default filter size, configure the processing unit to include multiple tensor arrays to perform the particular set of operations. (pg 4 Section A para. 2 “To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernelsized (>3 × 3) convolution through using only 3×3-sized CU. The algorithm begins with examining the kernel size of the filter. If the original filter’s kernel size is not an exact multiple of three, zero padding weights will be added in the original filter’s kernel boundary to extend the original filter’s kernel size to be a multiple of three. Because the added weights in the boundary are 0, so the extended filter will result in same output value compared with the original filter during the computation. Next, the extended filters will be decomposed into several 3 × 3-sized filters. Each filter will be assigned a shift address based on its top left weight’s relative position in the original filter. For example, Fig. 5 is an example of decomposing a 5 × 5 filter into four 3 × 3 filters. One row and column zero padding are added in the original fi lter” as described when a particular filter is too large it is decomposed into filters of the default size. When a filter is too small it is padded with zeros as shown in figure 5. The decomposed filters are used by the CU engine containing 16 particular CU tensor array to perform the particular convolution.)

Regarding claim 18
	Claim 18 is rejected for the reasons set forth in claim 17 and claim 2
Regarding claim 19
	Claim 19 is rejected for the reasons set forth in claim 17 and claim 3

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim 4, 8, 12, 16, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Du. Further in view of Young et al. US Document ID US 20180165577 A1, hereinafter Young.

Regarding claim 4
	Du teaches claim 1
Du teaches,  wherein the controller is further configured by the instructions to configure a particular tensor array to perform a computation of a fully connected layer of the convolutional neural network (pg 2 Column 1 “Together with the integrated pooling function, our proposed accelerator architecture can support completed one-stop CNN acceleration with both arbitrarily sized convolution and reconfigurable pooling.” )
Du does not explicitly teach, by instructing the tensor array to use a center value of the default convolutional filter for processing input data and pad remaining values of the default convolutional filter with zeros.  
Young when addressing tensor operations in a systolic computation array teaches, by instructing the tensor array to use a center value of the default convolutional filter for processing input data and pad remaining values of the default convolutional filter with zeros.  ( para. 071 “In some implementations, the neural network implementation engine 150 of the system may zero-pad the input tensor, and may provide the zero-padded input tensor to the special-purpose hardware circuit 110.” 0072 “For example, for an 8×8 input tensor and a 3×3 window for an average pooling layer, a zero-padded input tensor would be a 10×10 tensor”)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate zero padding tensor data with a center value relating to the default filter as taught by Young to the disclosed invention of Du.
	One of ordinary skill in the arts would have been motivated to make this modification in order to implement hardware that “allows for an inference of a neural network that includes an average pooling layer to be determined efficiently without modifying the hardware architecture of the special-purpose hardware circuit.” (¶0010 Young)

Regarding claim 8
	Du teaches claim 1
Du teaches, Outputs generated by a first subset of the processing units for one layer of the convolutional neural network to a second subset of processing units assigned to a next layer of the convolutional neural network. (Section 3 para. 5 “After finishing the computation of the 1st feature, the CNN accelerator will duplicate the convolution procedure described above with updated filter weights from the DRAM, to generate the next output feature. This procedure will be continuously reproduced till all the features are calculated. The overall diagram showing this procedure is drawn as Fig. 4.” As shown in figure 4, the engine for computing filter computations for layers, is output through a buffer and passed to the processing unit assigned to the next layer of computation.)
Du does not explicitly teach, wherein the plurality of processing units form an array, wherein the array comprises a plurality of systolic transfer structures to systolically transfer outputs
Young when addressing tensor operations in a systolic computation array teaches, wherein the plurality of processing units form an array, wherein the array comprises a plurality of systolic transfer structures to systolically transfer outputs. (0061 “The result is that the systolic array 406 can provide output vectors corresponding to element-wise multiplication of activation inputs and weights.” 0026 “This specification describes special-purpose hardware circuitry that processes neural network layers, and optionally performs pooling on outputs of one or more neural network layers.”)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate systolic transfer between processing elements to implement neural network operations as taught by Young to the disclosed invention of Du.
	One of ordinary skill in the arts would have been motivated to make this modification in order to implement hardware that “allows for an inference of a neural network that includes an average pooling layer to be determined efficiently without modifying the hardware architecture of the special-purpose hardware circuit.” (¶0010 Young)

Regarding claim 12
	Claim 12 is rejected for the reasons set forth in claim 9 and claim 4
Regarding claim 16
	Claim 16 is rejected for the reasons set forth in claim 9 and claim 13
Regarding claim 20
	Claim 20 is rejected for the reasons set forth in claim 9 and claim 4

Response to Arguments
Applicant's arguments filed 11/01/2021 have been fully considered but they are not persuasive. 
First applicant states that “Du…is not seen to disclose ‘each interconnect connecting at least two Cus and at least two memory cells’” instead the “Cus… provide their output in parallel to the ACCU buffer, rather than being interconnected to other Cus and memory cells.” Examiner notes that the present claims do not require that the individual CUs are connected directly to “other CUs.” The CUs in the prior art which “provide their output in parallel to the ACCU buffer” does read on the claim limitation when given the broadest reasonable interpretation. The present claims simply require a plurality of tensor arrays (CUs) and a plurality of memory cells (ACCU buffer cells) and a plurality of interconnects connects the tensor arrays and the memory cells. As described in the rejection of claim 1/9/17. The Data flow diagram depicts the connections between multiple tensor arrays and at least two memory cells. Although the CU elements do not appear to feed into each other in a systolic manner, the processing elements inside each CU element are connected to each other systolically. 
Second Applicant states that Young does also does not teach these interconnections stating "Young…not seen to disclose…at least two matrix computation units and at least two memory cells.” Examiner agrees, that this appears to be true, however as described previously Du teaches the elements claimed.


Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNATHAN R GERMICK whose telephone number is (571)272-8363. The examiner can normally be reached on Monday-Friday 7:30 am – 4:00 pm (EST).
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki, can be reached at telephone number 5712723719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://portal.uspto.gov/external/portal. Should you have questions about access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
	
/J.R.G./Examiner, Art Unit 2122                                                                                                                                                                                              
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122