DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Applicant’s amendment filed on April 01, 2021 has been considered and entered. 
Accordingly, claims 1-7, 11-17, and 21-26 are pending in this application. Claims 1-7 and 11-17 are currently amended; claims 21-26 are new.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-5 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Chen et al. (NPL – “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks”), hereinafter Chen.
Regarding claim 1, Chen teaches a neural network accelerator comprising: 
a data cache circuit (Chen Fig. 2, Section III.A, and Section V.A-B; data cache circuit – global buffer (GLB), control, and Global Input Network (GIN);
 an MxN multiplication accumulator array comprising first and second multiplication accumulation windows (Chen Fig. 2 MxN multiplication accumulator array – 12x14 PE array; Chen Fig. 5 ; and
an output control circuit (Chen Fig. 2 and Section V.A-B; output control circuit – output lines of PE array and Global output network (GON)), 
wherein the data cache circuit is configured to transmit a plurality of pieces of convolutional data and a plurality of convolutional parameters used for a convolution operation to the first multiplication accumulation window in the MxN multiplication accumulator array (Chen Fig. 2, Fig. 4-5 and page 4 left col lines 37-42. Fig. 2 shows GLB transmitting convolutional data (ifmap) and convolutional parameter (filter) to the PE array; Fig. 4 shows an example 3x3 PE set from the 12x14 PE array receiving filter and ifmap and performing 2D convolution where the first multiplication accumulation window is the 3x3 PE set in Fig. 4),
wherein the first multiplication accumulation window does not intersect with the second multiplication accumulation window (Chen Fig. 5 shows different PE sets mapped to the PE array that does not intersect with other.. For example, the 2 11x7 PE set in conv1 does not intersect with each other),
 wherein the plurality of convolutional parameters is determined by the data cache circuit based on a first convolutional parameter matrix, wherein the plurality of pieces of convolutional data are determined by the data cache circuit based on a first convolutional data matrix (Chen Fig. 2, Fig. 4-5 and page 4 left col lines 37-42. Fig. 2 shows GLB transmitting convolutional data (ifmap) and convolutional parameter (filter) to the PE array),
wherein the first convolutional parameter matrix comprises A rows and B columns, wherein the first convolutional data matrix comprises D rows and E columns, wherein the first multiplication accumulation window comprises A rows and C columns, and wherein A is an integer greater than or equal to 2, B and C are each an integer greater than or equal to 1, D is a positive integer greater than or equal to A, E is an integer greater than or equal to max (B, C), M is a positive integer greater than or equal to A, and N is a positive integer greater than or equal to C (Chen Fig. 2 and 4. In fig. 4, Chen shows an example where parameter matrix is 3x3; data matrix is 5x5; first multiplication accumulation window - 3x3 PE set; A = 3 which is greater than 2; B = 3 and greater than 1; C = 3 and greater than 1; D = 5 and greater than A; E = 5 and greater than 3; M = 12 and greater than A; N = 14 and greater than C),
	wherein the first multiplication accumulation window comprises AxC processing elements (PE), wherein a processing element in an ith row and a jth column is marked as processing element PEi,j, is set to an integer each time in ascending order of 1 to A in sequence, and corresponding to each value of i, j is set to an integer each time in ascending order of 1 to C in sequence wherein a processing element PEx,y of the first multiplication accumulation window is configured to perform a multiplication operation on convolutional data of the processing element PEx,y and a convolutional parameter of the processing PEx,y (Chen Fig. 4(a-c) shows a 3x3 PE set performing the convolution process as multiplication and accumulation operation),
wherein a second convolutional data matrix configured as a convolutional data matrix belonging to convolutional data transmitted by the data cache circuit to the second multiplication accumulation window (Chen Figs 2, and 5-6 ifmap belonging to the second PE set would be mapped and transmitted to the second PE set)
wherein a second convolutional parameter matrix configured as a convolutional parameter matrix belonging to a convolutional parameter transmitted by the data cache circuit to the second multiplication accumulation window (Chen Figs 2, and 5-6 filter belonging to the second PE set would be mapped and transmitted to the second PE set)
wherein the first and second convolutional parameter matrices are different when the first and second convolutional data matrices are the same (Chen Fig. 6B and page 5 left col lines 42-46), and
wherein the first and second convolutional parameter matrices are the same when the first and second convolutional data matrices are different (Chen Fig. 5 (conv2) and page4 right col lines 40-44 "The5x27 PE set of CONV2 is divided into two segments with dimensions 5x14 and 5x13, respectively, and each segment is independently mapped onto the PE array);
upon C being greater than or equal to 2, the processing element PEx,y is further configured to transmit the convolutional parameter of the processing element PEx,y to a processing element PEx,y+1, transmit the convolutional data of the processing PEx,y to a processing element PEx-1 ,Y+1, and respectively use the convolutional parameter of the processing PEx,y and the convolutional data of the processing element PEx,y as multipliers of multiplication operations performed by the processing element PEx,y+1 and the processing element PEx-1 ,y+1, wherein X is an integer greater than or equal to 2 and less than or equal to A, Y is an integer greater than or equal to 1 and less than or equal to C-1, the convolutional data of the processing element PEx,y is one of the plurality of pieces of convolutional data transmitted by the data cache circuit, and the convolutional parameter of the processing element PEx,y is one of the plurality of convolutional parameters transmitted by the data cache circuit (Chen Fig. 4(a-b) shows filter data is transmitted horizontally to a neighbor PE and ifmap data is transmitted diagonally to the upper right PE);   
the first multiplication accumulation window is configured to perform an addition operation on a product obtained after a processing element PEi,j performs a multiplication operation, to obtain a convolutional result, wherein J is an integer greater than or equal to 1 and less than or equal to C (Chen Fig. 4(c) and page 4 left col lines 37-42 rows of psum are accumulated vertically); and 
the output control circuit is configured to output the convolutional result (Chen Fig. 1 and 4(c) shows the rows of psum being output).


an array control circuit configured to: determine the first multiplication accumulation window used for the convolution operation from the MxN multiplication accumulator array; determine a row quantity of the first multiplication accumulation window based on a row quantity of the first convolutional parameter matrix; and determine a column quantity of the first multiplication accumulation window based on the row quantity of the first convolutional parameter matrix and a row quantity of the first convolutional data matrix (Chen Fig. 2 and Section III.B array control circuit –config scan chain which reads the configuration bits to configure the accelerator for the processing of filters and fmaps in a certain shape, which includes setting up the PE array computation mappings and Network on Chip (NoC) data delivery patterns).

Regarding claim 3, Chen teaches all the limitations of claim 2 as stated above. Further, Chen teaches wherein the array control circuit determines the column quantity of the first multiplication accumulation window according to the following formula: C= D-A+1 (Chen Fig. 4, Section III.B and as discussed in claim 1, C=3, D=5, A=3, therefore C = 5-3+1 = 3).

Regarding claim 4, Chen teaches all the limitations of claim 1 as stated above. Further, Chen teaches wherein the first multiplication accumulation window is configured to:
perform a multiplication operation on convolutional data of a processing element Pei,1 in a first column and a convolutional parameter of the processing element Pei,1 in a t" clock cycle, to obtain a product X1, wherein the convolutional data of the processing element Pei,1 and the convolutional parameter of the processing element Pei,1 are transmitted by the data cache circuit to the processing element Pei,1 (Chen Fig. 1 and 4);
transmit a convolutional parameter of a processing element PEx,1 to a processing element PEx,2, transmit convolutional data of the processing element PEx,1 to a processing element PEx-1,2, and respectively use the convolutional parameter of the PEx,1 and the convolutional data of the processing element PEx,1 as multipliers of multiplication operations performed by the processing element PEx,2 and the processing element PEx-1,2 in a (t+1)th clock cycle, wherein x is set to an integer each time in ascending order of 2 to A in sequence (Chen Fig. 4 (a-b) rows of filter weight are transmitted across PEs horizontally and rows of ifmap pixel are transmitted across PEs diagonally. Further, Fig. 12 shows an internal structure of a PE and shows the ifmap and filter are used as multipliers for a multiplication operation); and 
when t is set to each integer in a range [nB+1, nB+B], perform, by using the following formula, an addition operation on all products Xi corresponding to all values of t, to obtain a convolutional result Si:                          
                            
                                
                                    S
                                
                                
                                    1
                                
                            
                            =
                            
                                
                                    ∑
                                    
                                        t
                                        =
                                        n
                                        B
                                        +
                                        1
                                    
                                    
                                        n
                                        B
                                        +
                                        B
                                    
                                
                                
                                    
                                        
                                            ∑
                                            
                                                i
                                                =
                                                1
                                            
                                            
                                                A
                                            
                                        
                                        
                                            
                                                
                                                    X
                                                
                                                
                                                    i
                                                    ,
                                                    1
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                
                            
                        
                     wherein n is an integer greater than or equal to 0 and less than or equal to (E-B) (Chen Fig. 6 (c), Section II discloses the convolution equation, and page 4 left col lines 30-33 rows of psum are accumulated across PEs vertically, then the psums from multiple primitives are accumulated together as part of the convolution operation).

Regarding claim 5, Chen teaches all the limitations of claim 3 as stated above. Further, Chen teaches wherein when C is greater than or equal to 2, the first multiplication accumulation window is further configured to:  
perform a multiplication operation on convolutional data of a processing element PEi,j' in a (J')th column and a convolutional parameter of the processing element PEi,j' in a Tth clock cycle, to obtain a product XT, wherein J' is an integer greater than or equal to 2 and less than or equal to C; the data cache circuit is further configured to: obtain the convolutional parameter of the PEi,j' after convolutional parameter of a processing element PEi,j’-1 is transmitted to the processing element PEi,j', obtain convolutional data of a processing element PEh,j' after convolutional data of a processing PEh+1,J'-1 is transmitted to the processing element PEh,j’, transmit a convolutional parameter of a processing element PEA,j' and convolutional data of the processing element PEA,j' by the data cache circuit to the processing element PEA,j', and h is set to an integer each time in ascending order of 1 to A-1 in sequence (Chen Fig. 4(a-b) rows of filter weight are transmitted across PEs horizontally and rows of ifmap pixel are transmitted across PEs diagonally. Further, Fig. 12 shows an internal structure of a PE and shows the ifmap and filter are used as multipliers for a multiplication operation); and 
when T is set to each integer in a range [nB+J', nB+J'+B-1], the first multiplication accumulation window is further configured to perform, by using the following formula, an addition operation on all products XT, corresponding to all values of T, to obtain a convolutional result Sj’:                         
                            
                                
                                    S
                                
                                
                                    j
                                    '
                                
                            
                            =
                            
                                
                                    ∑
                                    
                                        T
                                        =
                                        n
                                        B
                                        +
                                        j
                                        '
                                    
                                    
                                        n
                                        B
                                        +
                                        
                                            
                                                j
                                            
                                            
                                                '
                                            
                                        
                                        +
                                        B
                                        -
                                        1
                                    
                                
                                
                                    
                                        
                                            ∑
                                            
                                                i
                                                =
                                                1
                                            
                                            
                                                A
                                            
                                        
                                        
                                            
                                                
                                                    X
                                                
                                                
                                                    i
                                                    ,
                                                    j
                                                    '
                                                
                                                
                                                    T
                                                
                                            
                                        
                                    
                                
                            
                        
                     n is an integer greater than or equal to 0 and less than or equal to (E-B) (Chen Fig. 6 (c), Section II discloses the convolution equation, and page 4 left col lines 30-33 rows of psum are accumulated across PEs vertically, then the psums from multiple primitives are accumulated together as part of the convolution operation).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of the commutative property of addition (NPL – “Commutative Property”).
Regarding claim 7, Chen teaches all the limitations of claim 1 as stated above. Further, Chen teaches wherein the first multiplication accumulation window is further configured to:  
transmit a convolutional intermediate result QA-1 obtained in a processing element PEA,J to the output control circuit for caching (Fig. 2 shows psum being output and stored in the global buffer);
transmit the convolutional intermediate result QA-1 to the processing element PE1,j in an (nB+J+1)th clock cycle, and use the convolutional intermediate result QA-1 as an initial accumulation value of an addition operation performed in the (nB+J+1)th clock cycle (Fig. 2 and 12 shows psum being input back to the PE array and added to the product of the filter and ifmap multiplication result); and
determine, as a convolutional result Sj, a convolutional intermediate result QA obtained in an ((n+ l)B+ J-1)th clock cycle (Fig 12 shows input psum is added result of multiplication of the ifmap and filter to generate next psum).
Further, Chen teaches the psum is accumulated vertically. The example shown in Fig. 4 shows the psums are accumulated vertically in an upward direction. Therefore, Chen does not explicitly teach in an (nB+J)th clock cycle, transmit a product X1 to a processing element PE2,j, and perform an addition operation on the product X1 and a product X2 to obtain a convolutional intermediate result Q1, wherein the product X1 is a product obtained after a processing element PE1,j performs a multiplication operation on convolutional data of the processing element PE1,j and a convolutional parameter of the processing element PE1,J in the (nB+J)th clock cycle, and the product X2 is a product obtained after the processing element PE2,j performs a multiplication operation on convolutional data of the processing element PE2,j and a convolutional parameter of the PE2,j in the (nB+J)th clock cycle; transmit, to a PEf+1,J, a convolutional intermediate result Qf-1 obtained after a processing element PEf,j performs an addition operation, wherein f is set to an integer each time in ascending order of 2 to A-1 in sequence; perform an addition operation on the convolutional intermediate result Qf-1 and a product Xf+1,j obtained after the processing element PEf+i,j performs a multiplication operation, to obtain a convolutional intermediate result Qf .
However, the commutative property of addition states that changing the order of operands does not change the result (page 1 definition).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify the accelerator of Chen using the commutative property of addition and perform the psum accumulation vertically in a downward direction such that the PE2,1 in Fig. 4c receives the partial sum generated by the PE1,1, add the result, then transmit the intermediate result to PE3,1 for further accumulation, and the psum of row 1 is output by the PE3,1 instead of the PE1,1.
The motivation to do so is accumulate the partial sums. As stated above, Chen teaches accumulating the psums in a vertical direction and performing the psum addition in an upward or downward fashion would not change the result of the addition operation.
Therefore, Chen as modified in view of the commutative property teaches in an (nB+J)th clock cycle, transmit a product X1 to a processing element PE2,j, and perform an addition operation on the product X1 and a product X2 to obtain a convolutional intermediate result Q1, wherein the product X1 is a product obtained after a processing element PE1,j performs a multiplication operation on convolutional data of the processing element PE1,j and a convolutional parameter of the processing element PE1,J in the (nB+J)th clock cycle, and the product X2 is a product obtained after the processing element PE2,j performs a multiplication operation on convolutional data of the processing element PE2,j and a convolutional parameter of the PE2,j in the (nB+J)th clock cycle; transmit, to a PEf+1,J, a convolutional intermediate result Qf-1 obtained after a processing element PEf,j performs an addition operation, wherein f is set to an integer each time in ascending order of 2 to A-1 in sequence; perform an addition operation on the f-1 and a product Xf+1,j obtained after the processing element PEf+i,j performs a multiplication operation, to obtain a convolutional intermediate result Qf .

Claims 6 is rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Kato et al. (US-PGPUB 20100223219 A1), hereinafter Kato.
Regarding claim 6, Chen teaches all the limitations of claim 5 as stated above. Further, Chen teaches wherein the DxE convolutional data matrix comprises DxE pieces of convolutional data ap,q, p is set to an integer each time in ascending order of 1 to D in sequence, and corresponding to each value of p, q is set to an integer each time in ascending order of 1 to E in sequence; wherein the AxB convolutional parameter matrix comprises AxB convolutional parameters bp',q’, p' is set to an integer each time in ascending order of 1 to A in sequence, and corresponding to each value of p', q' is set to an integer each time in ascending order of 1 to B in sequence (Chen Fig. 4 DxE data matrix – 5x5 input fmap (ifmap); AxB parameter matrix – 3x3 filter weight);
and the data cache circuit further comprises:
a cache, configured to cache the DxE pieces of convolutional data and the AxB convolutional parameters (Chen Fig. 1 cache – global buffer).
Chen does not teach the data cache circuit comprising a counter, configured to determine, in an (nB+P)th clock cycle, that the convolutional data of the processing element Pei,1, is ai,n+P, and the convolutional parameter of the processing element PEi,1 is bi,P, wherein a value of P is an integer greater than or equal to 1 and less than or equal to B; and the counter is further configured to determine, in an (nB+J'+Z-1)th clock cycle, that the convolutional data of the processing element PEA,j' is aA+j-1,n+z, and the convolutional parameter of the processing element PEA,j' is bA,z, wherein a value of Z is an integer greater than or equal to 1 and less than or equal to B.

Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify the accelerator of Chen using Kato and include a counter to the top-level control which coordinates traffic between the GLB and the PE array for determining filter and ifmap data transmitted to each PE for each clock cycle.
The motivation to do so is to use the counter for generating an address to access for retrieving and loading filter and ifmap data to the PEs (Kato paragraph [0093]).
Therefore, the combination of Chen as modified in view of Kato teaches data cache circuit comprising a counter, configured to determine, in an (nB+P)th clock cycle, that the convolutional data of the processing element Pei,1, is ai,n+P, and the convolutional parameter of the processing element PEi,1 is bi,P, wherein a value of P is an integer greater than or equal to 1 and less than or equal to B; and the counter is further configured to determine, in an (nB+J'+Z-1)th clock cycle, that the convolutional data of the processing element PEA,j' is aA+j-1,n+z, and the convolutional parameter of the processing element PEA,j' is bA,z, wherein a value of Z is an integer greater than or equal to 1 and less than or equal to B.

Claims 11-15 are rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Putnam et al. (NPL – “A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services”), hereinafter Putnam.

Regarding claims 11-15, Chen teaches all the limitations of claim 1-5 respectively as stated above. Further, Chen teaches the accelerator coupled to an off-chip DRAM and are communicatively coupled, and the DRAM is configured to input the plurality of pieces of convolutional data and the plurality of convolutional parameters to the data cache circuit of the communications device (Chen Fig. 2 and page 3 left col lines 47-49).
Chen does not teach the device comprising a central processing unit (CPU), a double data rate synchronous dynamic random access memory (DDR SDRAM); and wherein the CPU is configured to control the communications device to start the convolution operation.
However, on the same field of endeavor, Putnam teaches a device comprising of an accelerator, coupled to a host CPU, and a DRAM which consists of two dual-rank DDR3-1600 DIMMs (Putnam Fig. 1 and 3 and Section 2.1).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Chen using Putnam and configure the DRAM to use two dual-rank DDR3-1600 DIMMs. Further, connect the accelerator to a host CPU so that the CPU controls operation of the accelerator. 
The motivation to connect the accelerator to a host CPU is to incorporate the accelerator into CPU system to accelerate workloads in large-scale systems (Putnam Section 6 first paragraph). The motivation to use a DDR3-1600 DRAM is to operate a DDR3 speeds for additional bandwidth capacity (Putnam page 3 left col lines 14-19). 
.

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of the commutative property of addition as applied to claim 7 above, and further in view of Putnam.
Regarding claim 17, Chen as modified in view of the commutative property of addition teaches all the limitations of claim 7 as stated above. Further, Chen teaches the accelerator is coupled to an off-chip DRAM and are communicatively coupled and the DRAM is configured to input the plurality of pieces of convolutional data and the plurality of convolutional parameters to the data cache circuit of the communications device (Chen Fig. 2 and page 3 left col lines 47-49).
Chen does not teach the device comprising a central processing unit (CPU), a double data rate synchronous dynamic random access memory (DDR SDRAM); and wherein the CPU is configured to control the communications device to start the convolution operation.
However, on the same field of endeavor, Putnam teaches a device comprising of an accelerator, coupled to a host CPU, and a DRAM which consists of two dual-rank DDR3-1600 DIMMs (Putnam Fig. 1 and 3 and Section 2.1).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Chen using Putnam and configure the DRAM to use two dual-rank DDR3-1600 DIMMs. Further, connect the accelerator to a host CPU so that the CPU controls operation of the accelerator. 
The motivation to connect the accelerator to a host CPU is to incorporate the accelerator into CPU system is to accelerate workloads in large-scale systems (Putnam Section 6 first paragraph). The 
Therefore, the combination of Chen as modified in view of view of the commutative property of addition and Putnam teaches a device comprising of a central processing unit (CPU), a double data rate synchronous dynamic random access memory (DDR SDRAM), and a convolution operation chip that are communicatively connected.

Claims 16 is rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Kato as applied to claim 6 above, and further in view of Putnam.
Regarding claim 16, Chen as modified in view of Kato teaches all the limitations of claim 6 as stated above. Further, Chen teaches the accelerator is coupled to an off-chip DRAM and are communicatively and the DRAM is configured to input the plurality of pieces of convolutional data and the plurality of convolutional parameters to the data cache circuit of the communications device (Chen Fig. 2 and page 3 left col lines 47-49).
Chen does not teach the device comprising a central processing unit (CPU), a double data rate synchronous dynamic random access memory (DDR SDRAM); and wherein the CPU is configured to control the communications device to start the convolution operation.
However, on the same field of endeavor, Putnam teaches a device comprising of an accelerator, coupled to a host CPU, and a DRAM which consists of two dual-rank DDR3-1600 DIMMs (Putnam Fig. 1 and 3 and Section 2.1).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Chen using Putnam and configure the DRAM to use two dual-rank DDR3-1600 DIMMs. Further, connect the accelerator to a host CPU so that the CPU controls operation of the accelerator. 

Therefore, the combination of Chen as modified in view of Kato and Putnam teaches a device comprising of a central processing unit (CPU), a double data rate synchronous dynamic random access memory (DDR SDRAM), and a convolution operation chip that are communicatively connected.

Claims 21-26 are rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Putnam as applied to claim 11 above, and further in view of Kato.
Regarding claims 21-26, they are directed to a non-transitory computer readable medium storing computer instructions that when executed by the processor of claim 11, configure the processor to transmit data to the MxN multiplication array of claim 11. Claims 11-16 analysis for corresponding claims 21-26 limitations applies equally to claim 21-26. Further, claims 21-26 recite additional limitations which will be addressed below.
Regarding claims 21-26, Chen as modified in view of Putnam teaches all the limitations of claims 11-16 respectively as stated above. Further, Chen teaches transmitting a plurality of pieces of convolutional data and a plurality of convolutional parameters used for a convolution operation to a first multiplication accumulation window in an MxN multiplication accumulator array (Chen Fig. 2 and page 3 left col lines 47-49 the accelerator loads tiles of the ifmaps and filters from DRAM for processing, and the computed ofmaps are written back to DRAM).
Chen does not teach a non-transitory, computer readable medium storing computer executable instructions that when executed by a computer, configure the computer to control the operation of the MxN multiplication array.

Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Chen in view of Putnam using Kato and configure the system to include a non-transitory, computer readable medium storing computer executable instructions that when executed by the CPU would control the operation of the accelerator comprising the MxN multiplication array such that the CPU would transmit the filter and ifmap data from the off-chip DRAM to the PE array during run time.
The motivation to do so is to include a computer program for implementing the functional processing of the present invention stored on a computer-readable storage medium (Kato paragraphs [0155 and 0157]).
Response to Arguments
Applicant's arguments filed 04/01/2021, see remarks pages 22-24, with respect to the prior art rejection of claims 1-7, and 11-17 have been fully considered but they are not persuasive.
In response to applicant’s arguments with respect to the prior art rejection of claims 1-7, and 11-17, applicant amended independent claims 1 and 11 to include the features of dependent claims 8-10 and 18-20 respectively. Applicant argues that amended independent claims 1 and 11 recite two windows are being performed at the same while Chen only describes performing only one window at a time. Applicant further argued that neither of the two alternative situations is taught by Chen when two windows are performing at the same time: One situation is when the first and second convolutional data matrices are the same. In that situation, the data does not need to be twice transferred from a buffer to the processing element and, therefore, the speed is increased and the power requirements lowered. 
Examiner respectfully disagrees. As shown in Fig. 5 of Chen, multiple PE sets are mapped on the PE array. For example, in conv1, two 11x7 PE sets are mapped on the PE array. In conv2, one 5x13 and one 5x14 PE set in mapped to the PE array. Further, the description of Fig. 5 discloses that PEs of the same color receive the ifmap value in the same cycle and the arrows between two PE sets indicates that their psums can be accumulated together. Further, Chen page 5 right col lines 37-40 discloses that the mapping of multiple sets is described by parameters r and t. The PE array fits r × t PE sets in parallel that run r different channels of t different filters simultaneously. Therefore, the PE sets described in Fig. 5 are performing convolution operation at the same time. Further, Chen page 4 right col lines 58-62 through page 5 left col lines 29-34 teaches situations where: 1.) different ifmaps reuse the same filter (i.e. filter reuse); 2.) different filters reuse the same ifmaps (i.e. ifmap reuse) to further reduce DRAM and global buffer (GLB) accesses. Fig. 6(a) shows filter reuse where filter 1 is used for both ifmap 1 and ifmap2 operation which corresponds to situations when the first and second convolutional parameter matrices are the same. Fig. 6(b) shows ifmap reuse where ifmap1 is used for both filter1 and filter2 operation which corresponds to situations when the first and second convolutional data matrices are the same.
In response to applicant’s arguments with respect to claims 2-7, 12-17, and 21-26, applicant relied on argument for claims 1 and 11 and are not persuasive for the same reasons.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to CARLO C WAJE whose telephone number is (571)272-5767.  The examiner can normally be reached on 7:30-4:30 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aimee Li can be reached on (571) 272-4169.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.


Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/C.W./
Carlo WajeExaminer, Art Unit 2182                                                                                                                                                                                                        (571)272-5767


/Aimee Li/Supervisory Patent Examiner, Art Unit 2183