DETAILED ACTION
This action is in response to the claims filed 01/05/2022. Claims 1-20 are pending and have been examined. Claims 1, 9, and 17 were amended. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
3.	A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 11/10/2021 has been entered.

Response to Arguments
Applicant’s arguments with respect to independent claim(s) 1, 9 and 17 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Examiner agrees the amendments are not taught by the previously applied art Du et al. or Young et al. However, the art Shafiee et al. teaches the amended limitations.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1-3, 5-11, 13-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Shafiee et al “ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars” hereinafter Shafiee, and further in view of Du et al “A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things” hereinafter Du.



Regarding claim 1
Shafiee teaches, A device for performing computations of a convolutional neural network, the device comprising: a processing chip including: (pg 3 section 3 “We first present an overview of the ISAAC architecture… At a high level (Figure 2), an ISAAC chip is composed of a number of tiles” pg 4 ¶02 “During inference, inputs are provided to ISAAC through an I/O interface and routed to the tiles implementing the first layer of the CNN” the processing chip performs computations for at least the first layer of the convolutional neural network) a first arrangement of a plurality of tensor arrays (pg 3 Section 3 ¶01 “Each tile is composed of…a number of in-situ multiply-accumulate (IMA) units” a collection of tensor processors form a “tensor array”. Each tile contains a tensor array, thus the chip includes a plurality of tensor arrays.) a second arrangement of a plurality of memory cells configured to store outputs of corresponding tensor arrays;  (pg 3 section 3 ¶01 “Each tile is composed of eDRAM buffers to store input values, a number of in-situ multiply-accumulate (IMA) units, and output registers to aggregate results, all connected with a shared bus” the output registers correspond to the memory cells as shown in figure 2, each tile contains an output register to store outputs.) a plurality of intra-element buses, each intra-element bus  connecting a tensor array to a memory cell to form a core  a plurality of inter-element buses, each inter-element bus connecting at least two cores; (See annotated Figure 2. Pg 4 ¶02 “At a high level, ISAAC implements a hierarchy of chips/tiles/IMAs/arrays and c-mesh/bus… The I/O interface is also used to communicate with other ISAAC chips” examiner notes that although the figure only indicates a single chip, the ISAAC implements “a hierarchy of chips” thus a plurality of chips each containing the Inter element bus shown.
    PNG
    media_image1.png
    522
    997
    media_image1.png
    Greyscale

Shafiee does not explicitly teach, circuitry configured for performing computations according to a default convolutional filter size;  a computer-readable memory storing instructions for configuring the processing chip to perform computations of the convolutional neural network; and a controller configured by the instructions to: determine, for a particular convolution of a convolutional layer of the convolutional neural network, a particular convolutional filter size used for the particular convolution; 2Attorney Docket No.: WDA-363 1 *C-US when the particular convolutional filter size equals the default convolutional filter size, configure a processing unit to include one of the tensor arrays, and configure the processing unit to perform the particular convolution using the default convolutional filter size; when the particular convolutional filter size is less than the default convolutional filter size, configure the processing unit to include one of the tensor arrays, and configure the processing unit to perform the particular convolution using the default convolutional filter size padded with zeros such that a padded portion with an unpadded portion of the default convolutional filter corresponds to the particular convolutional filter size; and when the particular convolutional filter size is greater than the default convolutional filter 
	Du however when addressing convolutional accelerators that supports arbitrary convolutional window size teaches, circuitry configured for performing computations according to a default convolutional filter size; (Fig. 3 and pg 3 column 2 “CU engine is composed of sixteen convolution units to enable highly parallel convolution computation. Each unit can support the convolution with a kernel size up to three” the convolutional engine is composed of 16 convolution units or tensor arrays capable of performing computations of a default size, that size being 3.) a computer-readable memory storing instructions for configuring the processing chip to perform computations of the convolutional neural network; (Results pg 8 para. 1-4 “The accelerator was implemented in TSMC 65nm technology and the layout characteristics of the accelerator are shown in Fig. 13….The area estimation includes the logic cells, registers, and single port/dual port SRAMs  generated by the ARM compiler…To verify the performance of the accelerator, we have downloaded the hardware accelerator IP into the Xilinx Zynq-7200 FPGA and demonstrate the core’s functions using modified LeNet-5” the accelerator design has memory for storing logic and instructions. The design is tested on a FPGA ) and a controller configured by the instructions to: ( pg 8 Results para. 5 “Through using the DMA controller inside the FPGA, the accelerator can successfully access the data and the weights stored in the DRAM” the controller is part of the FPGA and directs data flow to implement to operations.) determine, for a particular convolution of a convolutional layer of the convolutional neural network, a particular convolutional filter size used for the particular convolution; (pg 4 Section A para. 2 “To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel sized (>3 × 3) convolution through using only 3×3-sized CU. The algorithm begins with examining the kernel size of the Filter” given a particular kernel of convolutional layer examine the particular size of the kernel) when the particular convolutional filter size equals the default convolutional filter size, configure a processing unit to include one of the tensor arrays, and configure the processing unit to perform the particular convolution using the default convolutional filter size; 3Attorney Docket No.: WDA-363 1 *C-USwhen the particular convolutional filter size is less than the default convolutional filter size, configure the processing unit to include one of the tensor arrays, and configure the processing unit to perform the particular convolution using the default convolutional filter size padded with zeros such that a padded portion with an unpadded portion of the default convolutional filter corresponds to the particular convolutional filter size;  and when the particular convolutional filter size is greater than the default convolutional filter size, configure the processing unit to include multiple tensor arrays to perform the particular convolution.  (pg 4 Section A para. 2 “To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel sized (>3 × 3) convolution through using only 3×3-sized CU. The algorithm begins with examining the kernel size of the filter. If the original filter’s kernel size is not an exact multiple of three, zero padding weights will be added in the original filter’s kernel boundary to extend the original filter’s kernel size to be a multiple of three. Because the added weights in the boundary are 0, so the extended filter will result in same output value compared with the original filter during the computation. Next, the extended filters will be decomposed into several 3 × 3-sized filters. Each filter will be assigned a shift address based on its top left weight’s relative position in the original filter. For example, Fig. 5 is an example of decomposing a 5 × 5 filter into four 3 × 3 filters. One row and column zero padding are added in the original filter” as described when a particular filter is too large it is decomposed into filters of the default size. When a filter is too small it is padded with zeros as shown in figure 5. The decomposed filters are used by the CU engine containing 16 particular CU tensor array to perform the particular convolution.)
	It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate the reconfigurable accelerator as taught by Du to the disclosed invention of Shafiee.
One of ordinary skill in the arts would have been motivated to combine these references because both references disclose convolutional neural network accelerators composed of convolutional engines which are in turn composed of processing elements. Du improves the neural accelerator of Shafiee by utilizing a framework which “optimizes the energy efficiency by reducing unnecessary data movement. It also supports arbitrary window sized convolution by using filter decomposition technique… The result shows that this accelerator can support most popular CNNs and achieve 434GOPS/W energy efficiency” (Du Conclusion)

Regarding claim 2 
Shafiee/Du teaches claim 1
Shafiee does not explicitly teach, wherein the default convolutional filter size is 5x5x1
Further Du teaches, wherein the default convolutional filter size is 5x5x1.  (Section 3 A “The filter’s kernel size in a typical CNN network can range from very small size (1 × 1) to very large size (11 × 11) …To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel sized (>3 × 3) convolution through using only 3×3-sized CU….the accelerator needs to either leave the software to do the computation or add additional hardware unit for large kernel-sized filter convolution.” Although the exemplary hardware in designed for 3x3 convolutional filters, the author notes that additional hardware could be added to suit larger kernels which includes kernels size 5x5 corresponding to 5x5x1)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate the specified convolutional filter size as taught by Du to the disclosed invention of Shafiee.
One of ordinary skill in the arts would have been motivated to combine these references because implementing convolutional neural networks on physical hardware require circuitry for implementing filters/kernel operations. While Du teaches a kernel hardware for a 3x3x1 filter, it would be obvious for one of ordinary skill in the art to implement a filter circuit from a finite set of integer sizes in the range of 1 to 11, which would include a filter of size 5x5x1. “The filter’s kernel size in a typical CNN network can range from very small size (1 × 1) to very large size (11 ×11)… the accelerator needs add additional hardware unit for large kernel-sized filter convolution”  (Du section 3A)

Regarding claim 3 
Shafiee/Du teaches claim 1
Further Shafiee teaches, wherein the device is configured to: provide input into the processing chip for the processing chip to perform the particular convolution using the processing unit; and provide an output of the processing chip as an output of the convolutional neural network. (Shafiee pg 4 column 1 para. 02 “During inference, inputs are provided to ISAAC through an I/O interface and routed to the tiles implementing the first layer of the CNN…. The dot-product operations involved in convolutional and classifier layers are performed on crossbar arrays; The aggregated result is then sent through the sigmoid operator and stored in the eDRAM banks of the tiles processing the next. layer. The process continues until the final layer generates an output that is sent to the I/O interface” inputs are provided to the chip to perform convolutions. The output of the final layer is the output of the CNN. )

Regarding claim 5 
Shafiee/Du teaches claim 1
Further Shafiee teaches, wherein at least one tensor array of the plurality of tensor arrays includes circuitry to perform a single multiplication operation. (Shafiee pg 3 column 2 para. 2 “The crossbar shown in Figure 1b achieves very high levels of parallelism… it performs vector-matrix multiplication in a single step.” Section 3 para. 1 “Each tile is composed of eDRAM buffers to store input values, a number of in-situ multiply-accumulate (IMA) units” each IMA unit performs multiplication operations, the IMA is circuitry contained within a tensor array. See also Du.)

Regarding claim 6 
Shafiee/Du teaches claim 1
Further Shafiee teaches, wherein at least one tensor array of the plurality of tensor arrays includes circuitry to perform a single multiplication operation. (Shafiee pg 3 column 2 para. 2 “The crossbar shown in Figure 1b achieves very high levels of parallelism… it performs vector-matrix multiplication in a single step.” Section 3 para. 1 “Each tile is composed of eDRAM buffers to store input values, a number of in-situ multiply-accumulate (IMA) units” each IMA unit performs multiple multiplication operations, the IMA is circuitry contained within a tensor array. And the IMA unit contains circuits described in Figure 1b which perform vector matrix multiplications. See also Du.)

Regarding claim 7 
Shafiee/Du teaches claim 1
Further Shafiee teaches, wherein the controller is further configured by the instructions to configure the processing chip into a plurality of processing units that collectively perform the computations of multiple layers of the convolutional neural network. ( pg 4 column 2 para. 1 “the tiles/IMAs of ISAAC have to be partitioned across the different CNN layers. For example, tiles 0-3 may be assigned to layer 0, tiles 4-11 may be assigned to layer 1, and so on” the plurality of tiles collectively are assigned different operations according to different layers of the CNN.)

Regarding claim 9 
Shafiee teaches, A method for performing computations of a neural network, the method comprising: (pg 3 section 3 “We first present an overview of the ISAAC architecture… At a high level (Figure 2), an ISAAC chip is composed of a number of tiles” pg 4 ¶02 “During inference, inputs are provided to ISAAC through an I/O interface and routed to the tiles implementing the first layer of the CNN” the processing chip performs computations for at least the first layer of the convolutional neural network)  wherein the processing chip further includes a plurality of cores , and wherein each core  of the plurality of cores includes a tensor array connected to a memory cell  (See annotated Figure 2. Pg 4 ¶02 “At a high level, ISAAC implements a hierarchy of chips/tiles/IMAs/arrays and c-mesh/bus… The I/O interface is also used to communicate with other ISAAC chips” examiner notes that although the figure only indicates a single chip, the ISAAC implements “a hierarchy of chips” thus a plurality of chips each containing the Inter element bus shown.
    PNG
    media_image1.png
    522
    997
    media_image1.png
    Greyscale
) configuring a processing unit to include multiple tensor arrays from different cores to perform the particular set of operations. (pg 4 column 2 ¶01 “Therefore, unlike DaDianNao, the tiles/IMAs of ISAAC have to be partitioned across the different CNN layers. For example, tiles 0-3 may be assigned to layer 0, tiles 4-11 may be assigned to layer 1, and so on” the tiles or cores in the processing unit are assigned particular sets of operations across cores. Cores 0-3 operate according to the calculation of layer 0, whereas tiles 4-11 operate according to layer 1.)
Shafiee does not explicitly teach, identifying a default filter size of a plurality of tensor arrays included in a processing chip, determining, for a particular set of operations of a layer of the neural network, a particular filter size used for the particular set of operations; determining that a particular filter size is greater than the default filter size; and in response to determining that the particular filter size is greater than the default filter size, 
Du however when addressing convolutional accelerators that supports arbitrary convolutional window size teaches,  identifying a default filter size of a plurality of tensor arrays included in a processing chip, (Fig. 3 and pg 3 column 2 “CU engine is composed of sixteen convolution units to enable highly parallel convolution computation. Each unit can support the convolution with a kernel size up to three” the convolutional engine is composed of 16 convolution units or tensor arrays capable of performing computations of a default size, that size being 3.) determining, for a particular set of operations of a layer of the neural network, a particular filter size used for the particular set of operations; (pg 4 Section A para. 2 “To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel sized (>3 × 3) convolution through using only 3×3-sized CU. The algorithm begins with examining the kernel size of the Filter” given a particular kernel of convolutional layer examine the particular size of the kernel) determining that a particular filter size is greater than the default filter size; and in response to determining that the particular filter size is greater than the default filter size, (pg 4 Section A para. 2 “To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel sized (>3 × 3) convolution through using only 3×3-sized CU. The algorithm begins with examining the kernel size of the filter. If the original filter’s kernel size is not an exact multiple of three, zero padding weights will be added in the original filter’s kernel boundary to extend the original filter’s kernel size to be a multiple of three. Because the added weights in the boundary are 0, so the extended filter will result in same output value compared with the original filter during the computation. Next, the extended filters will be decomposed into several 3 × 3-sized filters. Each filter will be assigned a shift address based on its top left weight’s relative position in the original filter. For example, Fig. 5 is an example of decomposing a 5 × 5 filter into four 3 × 3 filters. One row and column zero padding are added in the original filter” as described when a particular filter is too large it is decomposed into filters of the default size. When a filter is too small it is padded with zeros as shown in figure 5. The decomposed filters are used by the CU engine containing 16 particular CU tensor array to perform the particular convolution.)
	It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate the reconfigurable accelerator as taught by Du to the disclosed invention of Shafiee.
One of ordinary skill in the arts would have been motivated to combine these references because both references disclose convolutional neural network accelerators composed of convolutional engines which are in turn composed of processing elements. Du improves the neural accelerator of Shafiee by utilizing a framework which “optimizes the energy efficiency by reducing unnecessary data movement. It also supports arbitrary window sized convolution by using filter decomposition technique… The result shows that this accelerator can support most popular CNNs and achieve 434GOPS/W energy efficiency” (Du Conclusion)




Regarding claim 8
	Shafiee/Du teach claim 7
	Further Shafiee teaches, wherein the plurality of processing units form an array, wherein the array comprises a plurality of systolic transfer structures to systolically transfer outputs, Outputs generated by a first subset of the processing units for one layer of the convolutional neural network to a second subset of processing units assigned to a next layer of the convolutional neural network. (pg 4 column 2 para. 1 “For example, tiles 0-3 may be assigned to layer 0, tiles 4-11 may be assigned to layer 1, and so on…. In this case, tiles 0-3 would store all weights for layer 0 and perform all layer 0 computations in parallel. The outputs of layer 0 are sent to some of tiles 4-11; once enough layer 0 outputs are buffered, tiles 4-11 perform the necessary layer 1 computations, and so on.” The tiles or subset of processing units are designated for one layer, the outputs generated are then passed to a second subset of tiles which perform the next layer or composition of layers. This data flow pattern of passing the output into the input of another processing unit in a chain, or array, of processing units amounts to a systolic transfer. )

Regarding claim 10
	Claim 10 is rejected for the reasons set forth in claim 9 and claim 2
Regarding claim 11
	Claim 11 is rejected for the reasons set forth in claim 9 and claim 3
Regarding claim 13
	Claim 13 is rejected for the reasons set forth in claim 9 and claim 5

Regarding claim 14
	Claim 14 is rejected for the reasons set forth in claim 9 and claim 6
Regarding claim 15
	Claim 15 is rejected for the reasons set forth in claim 9 and claim 7
Regarding claim 16
	Claim 16 is rejected for the reasons set forth in claim 9 and claim 13

Regarding claim 17
Shafiee teaches, A controller comprising one or more processors configured (pg 3 section 3 “We first present an overview of the ISAAC architecture… At a high level (Figure 2), an ISAAC chip is composed of a number of tiles” pg 4 ¶02 “During inference, inputs are provided to ISAAC through an I/O interface and routed to the tiles implementing the first layer of the CNN” the processing chip performs computations for at least the first layer of the convolutional neural network a plurality of tensor arrays included in a processing chip, (pg 3 Section 3 ¶01 “Each tile is composed of…a number of in-situ multiply-accumulate (IMA) units” a collection of tensor processors form a “tensor array”. Each tile contains a tensor array, thus the chip includes a plurality of tensor arrays wherein the processing chip further includes a plurality of cores and wherein each core of the plurality of cores connects a tensor array and a memory cell  (See annotated Figure 2. Pg 4 ¶02 “At a high level, ISAAC implements a hierarchy of chips/tiles/IMAs/arrays and c-mesh/bus… The I/O interface is also used to communicate with other ISAAC chips” examiner notes that although the figure only indicates a single chip, the ISAAC implements “a hierarchy of chips” thus a plurality of chips each containing the Inter element bus shown.
    PNG
    media_image1.png
    522
    997
    media_image1.png
    Greyscale
)

Shafiee does not explicitly teach, identify a default filter size of a plurality of tensor arrays;6Attorney Docket No.: WDA-363 1 *C-US determine, for a particular set of operations of a layer of a neural network, a particular filter size used for the particular set of operations; when the particular filter size equals the default filter size, configure a processing unit to include one of the tensor arrays, and configure the processing unit to perform the particular set of operations using the default filter size; when the particular filter size is less than the default filter size, configure the processing unit to include one of the tensor arrays, and configure the processing unit to perform the particular set of operations using the default filter size padded with zeros such that an unpadded portion of the default filter corresponds to the particular filter size; and when the particular filter size is greater than the default filter size, configure the processing unit to include multiple tensor arrays to perform the particular set of operations.

however when addressing convolutional accelerators that supports arbitrary convolutional window size teaches, identify a default filter size of a plurality of tensor arrays (Fig. 3 and pg 3 column 2 “CU engine is composed of sixteen convolution units to enable highly parallel convolution computation. Each unit can support the convolution with a kernel size up to three” the convolutional engine is composed of 16 convolution units or tensor arrays capable of performing computations of a default size, that size being 3.) determine, for a particular set of operations of a layer of a neural network, a particular filter size used for the particular set of operations (pg 4 Section A para. 2 “To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel sized (>3 × 3) convolution through using only 3×3-sized CU. The algorithm begins with examining the kernel size of the Filter” given a particular kernel of convolutional layer examine the particular size of the kernel) when the particular filter size equals the default filter size, configure a processing unit to include one of the tensor arrays, and configure the processing unit to perform the particular set of operations using the default filter size; when the particular filter size is less than the default filter size, configure the processing unit to include one of the tensor arrays, and configure the processing unit to perform the particular set of operations using the default filter size padded with zeros such that an unpadded portion of the default filter corresponds to the particular filter size; and when the particular filter size is greater than the default filter size, configure the processing unit to include multiple tensor arrays to perform the particular set of operations. (pg 4 Section A para. 2 “To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernelsized (>3 × 3) convolution through using only 3×3-sized CU. The algorithm begins with examining the kernel size of the filter. If the original filter’s kernel size is not an exact multiple of three, zero padding weights will be added in the original filter’s kernel boundary to extend the original filter’s kernel size to be a multiple of three. Because the added weights in the boundary are 0, so the extended filter will result in same output value compared with the original filter during the computation. Next, the extended filters will be decomposed into several 3 × 3-sized filters. Each filter will be assigned a shift address based on its top left weight’s relative position in the original filter. For example, Fig. 5 is an example of decomposing a 5 × 5 filter into four 3 × 3 filters. One row and column zero padding are added in the original filter” as described when a particular filter is too large it is decomposed into filters of the default size. When a filter is too small it is padded with zeros as shown in figure 5. The decomposed filters are used by the CU engine containing 16 particular CU tensor array to perform the particular convolution.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate the reconfigurable accelerator as taught by Du to the disclosed invention of Shafiee.
One of ordinary skill in the arts would have been motivated to combine these references because both references disclose convolutional neural network accelerators composed of convolutional engines which are in turn composed of processing elements. Du improves the neural accelerator of Shafiee by utilizing a framework which “optimizes the energy efficiency by reducing unnecessary data movement. It also supports arbitrary window sized convolution by using filter decomposition technique… The result shows that this accelerator can support most popular CNNs and achieve 434GOPS/W energy efficiency” (Du Conclusion)

Regarding claim 18
	Claim 18 is rejected for the reasons set forth in claim 17 and claim 2
Regarding claim 19
	Claim 19 is rejected for the reasons set forth in claim 17 and claim 3

Claim 4, 12, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Shafiee/Du. Further in view of Young et al. US Document ID US 20180165577 A1, hereinafter Young.

Regarding claim 4
	Shafiee/Du teaches claim 1 
Further Du teaches, wherein the controller is further configured by the instructions to configure a particular tensor array to perform a computation of a fully connected layer of the convolutional neural network (pg 2 Column 1 “Together with the integrated pooling function, our proposed accelerator architecture can support completed one-stop CNN acceleration with both arbitrarily sized convolution and reconfigurable pooling.”)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate the reconfigurable accelerator which includes tensor processors capable of implementing fully connected layers using the integrated pooling function to process a CNN on physical hardware as taught by Du to the disclosed invention of Shafiee.
Du improves the neural accelerator of Shafiee by implementing pooling layers in hardware rather than external digital processes this is because when “pooling function is not implemented in [accelerators], the convolution results must be transferred to CPU/GPU to run pooling function and then fed back to the accelerator to compute the next layer. This data movement not only consumes much power but also limits overall performance.” (Du pg 2 column 2)
Shafiee/Du does not explicitly teach, by instructing the tensor array to use a center value of the default convolutional filter for processing input data and pad remaining values of the default convolutional filter with zeros.  
Young when addressing tensor operations in a systolic computation array teaches, by instructing the tensor array to use a center value of the default convolutional filter for processing input data and pad remaining values of the default convolutional filter with zeros.  ( para. 071 “In some implementations, the neural network implementation engine 150 of the system may zero-pad the input tensor, and may provide the zero-padded input tensor to the special-purpose hardware circuit 110.” 0072 “For example, for an 8×8 input tensor and a 3×3 window for an average pooling layer, a zero-padded input tensor would be a 10×10 tensor”)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate zero padding tensor data with a center value relating to the default filter as taught by Young to the disclosed invention of Shafiee/Du.
	One of ordinary skill in the arts would have been motivated to make this modification in order to implement hardware that “allows for an inference of a neural network that includes an (¶0010 Young)

Regarding claim 12
	Claim 12 is rejected for the reasons set forth in claim 9 and claim 4
Regarding claim 20
	Claim 20 is rejected for the reasons set forth in claim 9 and claim 4

Conclusion
Prior Art
Qi et al. “FPGA Design of a Multicore Neuromorphic Processing System” discusses a multicore neuromorphic array, where each core is a crossbar array with an input output buffer in either a analog or digital configuration.
Lu et al. “FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks” discusses a convolutional accelerator in which several rows of processing elements compute 1d convolutions to make up a convolutional unit.  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNATHAN R GERMICK whose telephone number is (571)272-8363. The examiner can normally be reached M-F 7:30-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/J.R.G./Examiner, Art Unit 2122                                                                                                                                                                                                        
/NICHOLAS KLICOS/Primary Examiner, Art Unit 2145