Detailed Action
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
	Claim 1-20 are pending.

Claim Interpretation
The claim amendments were received on 03/28/2022. The claims are acceptable.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim 1-2, 4-5, 7-8, and 10 is/are rejected under 35 U.S.C. 103 over Yu (US 20180046913 A1) in view of Chen (Chen et al, Nov 2016, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks").

Regarding claim 1, Yu teaches an operation method of a convolution circuit, the method comprising: 
receiving input feature maps ([Yu, Figure 1B; 0034] “The parameters of a CNN model are called “weights”. The first layer of a CNN reads an input image and outputs a series of feature maps”); 
processing only kernel data for a current input feature map being processed ([Yu, 0039] “A CONV layer takes a series of feature maps as input and convolves with convolutional kernels to obtain the output feature map”, [Yu, 0042] “where g.sub.ij is the convolutional kernel applied to j-th input feature map and i-th output feature map”, [Yu, 0046] “where p is the pooling kernel size. This non-linear “down sampling” not only reduces the feature map size and the computation for later layers, but also provides a form of translation invariance”, using only kernel data for a current input feature map processing is fundamental to convolutional neural network); 
generating output feature maps corresponding to the respective input feature maps through convolution operations by performing parallel processing with a kernel unit ([Yu, Figure 1A; Figure 1B; 0033-0034] “As shown in FIG. 1A, a typical CNN consists of a number of layers that run in sequence … The first layer of a CNN reads an input image and outputs a series of feature maps. The following layers read the feature maps generated by previous layers and output new feature maps. Finally a classifier outputs the probability of each category that the input image might belong to”, [Yu, Figure 1A; 0123] “Said weight buffer is for storing weights of the ANN”); and 
outputting the output feature maps to an external memory ([Yu, claim 4] “the PE further comprises: a convolver complex, coupled to the buffer to receive weights of ANN and said data, configured for performing convolutional operations of the ANN”, discloses that the PE performs convolution operation, which produces output feature maps, [Yu, claim 6] “The DPU of claim 1, the buffer further comprises: input buffer, configured for preparing the data, instructions for said convolver complex; output buffer, for storing and outputting data results”, the computing complex including a plurality of processing elements (PEs) outputs the output feature maps to output buffer, and the output buffer outputs the feature map to the DMA (i.e. external memory). [Yu, Figure 4] also shows the PE units connected to the external memory (i.e. DDR) through the output FIFO. [Yu, 0112] “Then, PL (e.g., buffer) gets data from FIFO for subsequent operations by the computational complex. In a similar manner, the output data from PL is transmitted to DDR via another FIFO” teaches that the output data from buffer is transmitted to DDR (i.e. external memory) via output FIFO).
	Yu does not specifically teach performing parallel processing with a kernel unit and using intermediate result values for one point at a same position of the output feature maps, the generating the output feature maps including writing and reading intermediate result values using intermediate result values for one point at a same position of the output feature maps.
Chen teaches performing parallel processing with a kernel unit ([Chen, page 131, right column, last paragraph; Figure 6] “a) PE array processing passes: So far we have described a way to exploit data reuse by maximally utilizing the storage of spads and the spatial parallelism of the PE array. The PE array can run multiple 2-D convolutions from up to q × r channels of p × t filters simultaneously. Multiple ifmaps can also be processed sequentially through the array. The amount of computation done in this fashion is called a Processing Pass. In a pass, each input data are read only once from the GLB, and the psums are stored back to the GLB only once when the processing is finished.”, discloses the parallel processing), the generating the output feature maps including writing and reading intermediate result values using intermediate result values for one point at a same position of the output feature maps ([Chen, page 129, right column, 2nd paragraph] “To minimize the movement of ifmaps and filters, the goal is to maximize three forms of data reuse. 1) Convolutional Reuse: Each filter weight is reused E×F times in the same ifmap plane, and each ifmap pixel is usually reused R × S times in the same filter plane. 2) Filter Reuse: Each filter weight is reused across the batch of N ifmaps. 3) Ifmap Reuse: Each ifmap pixel is reused across M filters (to generate M ofmap channels). To minimize the movement of psums, it is desirable that the psum accumulation across C × R × S values into one ofmap value can be done as soon as possible to save both the storage space and memory R/W energy. However, maximum input data reuse cannot be achieved simultaneously with immediate psum reduction, since the psums generated by multiply and accumulations (MACs) using the same filter or ifmap value are not reducible. Thus, the RS dataflow uses a systematic approach to optimize for all data types simultaneously as follows”, Chen discloses the process of reusing the stored partial sums, [Chen, page 130, Fig 4] shows a process of convolutional reuse. (a) shows the rows of filter weight values from same rows are reused in several different processing units) 
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Chen and Yu, the generating the output feature maps including writing and reading intermediate result values using intermediate result values for one point at a same position of the output feature maps of Chen to the method for convolution circuit of Yu. The suggestion and/or motivation for doing so is to process data more efficiently, as reusing data saves storage space.

Regarding claim 2, Yu in view of Chen teaches wherein the kernel unit is KxK window filtering (K is a natural number) ([Yu, 0123] “The input buffer further comprises an input data buffer and a weight buffer. Said weight buffer is for storing weights of the ANN”, discloses the kernel unit (i.e. weight buffer) stores the weight of the ANN (i.e. convolution window), [Yu, 0117] “The size of convolver usually has only several options such as 3×3, 5×5, and 7×7. For example, the 2D convolvers are designed for convolution operation only over a 3×3 window”, discloses the K×K window).

Regarding claim 4, Yu in view of Chen teaches wherein the generating the output feature maps comprises storing kernels necessary for generating the output feature maps in the external memory ([Yu, Figure 4; 0112] “Then, PL (e.g., buffer) gets data from FIFO for subsequent operations by the computational complex. In a similar manner, the output data from PL is transmitted to DDR via another FIFO”, discloses that the output FIFO sends the kernel data to the DDR (i.e. external memory), and the data includes the data in, weight (i.e. kernels), and bias. [Yu, 0114] “the DMA transmit data from output FIFO to the DDR”, discloses that the kernel data transmits from the output FIFO to the DDR).

Regarding claim 5, Yu in view of Chen teaches further comprising repeating loading and accumulating a partial sum of the convolution operation from the external memory, or storing the partial sum in the external memory ([Chen, page 131, left column, 2nd paragraph; Figure 2] “1) Multiple 2-D Convolutions in a PE Set: If the spad size is large enough, each PE can run multiple 1-D convolution primitives simultaneously by interleaving their computation. Equivalently, this means each PE set is running multiple 2-D convolutions on different filters and channels. There are two scenarios. 1) By interleaving the computation of primitives that run on the same ifmap with different filters, the spads can buffer the same ifmap value and reuse it to compute with a weight from each filter sequentially [Fig. 6(b)]. It requires increasing the filter and psum spad size. 2) By interleaving the computation of primitives that run on different channels, the PE can accumulate through all channels sequentially on the same psum [Fig. 6(c)]. This requires increasing the ifmap and filter spad size”, Global Buffer corresponds to the external memory).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Yu and Chen, to repeating cumulate partial sum, or storing the partial sum in the external memory of Chen to the method for convolution circuit of Yu. The suggestion and/or motivation for doing so is to add flexibility to the convolution circuit, because storing results of partial sum in the external memory enables user or computing device to review, change, or feed the calculations back to the processing unit for further processing as needed. 

Regarding claim 7, Yu in view of Chen teaches wherein result values of each of the convolution operations are stored in the external memory in a predetermined order ([Chen, page 130, Figure 4 and 5] “Fig. 5. Mapping of the PE sets on the spatial array of 168 PEs for the CONV layers in AlexNet. For the colored PEs, the PEs with the same color receive the same ifmap value in the same cycle. The arrow between two PE sets indicates that their psums can be accumulated together”, [Chen, page 130, right column, 2nd paragraph] “An example of these two exceptions can be seen from the PE set mapping of layers CONV1–CONV5 in AlexNet onto the 12×14 PE array of Eyeriss as shown in Fig. 5. The 11×55 PE set of CONV1 is strip-mined to 11×7. The strip-mined PE set width is determined by a process that optimizes for overall energy efficiency as introduced in [32]. The 5 × 27 PE set of CONV2 is divided into two segments with dimensions 5×14 and 5 × 13, respectively, and each segment is independently mapped onto the PE array. Finally, the 3 × 13 PE set of CONV3–CONV5 can easily fit into the PE array. Except for CONV2, the PE array can fit multiple PE sets in parallel as shown in Fig. 5, and the RS dataflow further defines how to fully utilize hardware resources to minimize data movement in the dimensions beyond 2-D. This mapping strategy is realized by a custom designed NoC that is also optimized for energy efficiency (Section V-B)”, shows the predetermined order of storing result values).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Chen and Yu, to store result value of the convolution operation in a predetermined order of Chen to the method for convolution circuit of Yu. The suggestion and/or motivation for doing so is to make data processing faster, because storing data with predetermined order makes processing unit easier to find the data required for calculation. 

Regarding claim 8, Yu in view of Chen teaches wherein at least one of the convolution operations is performed while outputting at least one of the output feature maps to the external memory ([Yu, Figure 2; Figure 9; 0149] “Referring to the example 2 shown in FIG. 9, wherein the CPU and the neural network accelerator work in a pipeline manner to process image 1, image 2, . . . image n. The CPU fetches image data, while the neural network accelerator processes the data. The above proposed solution supports a parallel schedule between the CPU and the neural network accelerator”, the accelerator performs convolution operation, and the Figure 2 shows the connection between the PE units (i.e. accelerator). Figure 9 shows the pipeline of image processing of the CPU and the accelerator).

Regarding claim 10, Yu teaches a convolution circuit comprising: 
a direct memory access (DMA) processing circuit configured to read data from an external memory or output data to the external memory ([Yu, Figure 4; 0009] “a direct memory access (DMA); a direct memory access (DMA), connected to the CPU, an external memory and a programmable logic module, used for communication between the external memory and the programmable logic module”);15 
a kernel buffer configured to store kernel data for connecting an input feature map being processed and N output feature maps ([Yu, Figure 4; 0123] “Said weight buffer is for storing weights of the ANN”, the buffer unit of the Figure 4 encompasses the kernel buffer (i.e. weight buffer), which is the ‘Weight’);
a bottom buffer configured to store a plurality of input data corresponding to an input feature map ([Yu, Figure 4; 0123] “Said input data buffer might be a line data buffer, for storing data and holding the data with delayers in order to reuse the data”, the buffer unit of the Figure 4 encompasses the data buffer (i.e. bottom buffer), which is the ‘Data in’);
an input data load circuit configured to store the N kernel data and M input feature map data from the DMA processing circuit into the kernel buffer ([Yu, Figure 4; 0126] “DMA communicate data/instructions between the DDR and the neural network processing unit via the input FIFO, instruction FIFO and output FIFO”, discloses what type of data input FIFO receives, and [Yu, 0114] “In running a neural network, CPU needs to monitor the status of DMA in real time. When the input FIFO is not full, the DMA transmits data from DDR to the input FIFO”, discloses how input FIFO received data. The input FIFO corresponds to the input data load unit. The Weight embedded in the buffer unit of the Figure 4 corresponds to the kernel buffer);
a pipeline parallel kernel processing circuit configured to perform a convolution operation to the KxK input data by using KxK kernel weight values for each P kernel processing ([Yu, Figure 4; 0116-0120] “The computation complex comprises convolver, adder tree, NL module. The size of convolver usually has only several options such as 3×3, 5×5, and 7×7. For example, the 2D convolvers are designed for convolution operation only over a 3×3 window”, [0034; Figure 1A] “The parameters of a CNN model are called “weights” ”);
a result reception unit circuit configured to receive a result value of the pipeline parallel kernel processing circuit ([Yu, Figure 4; claim 6; 0121] “The DPU of claim 1, the buffer further comprises: input buffer, configured for preparing the data, instructions for said convolver complex; output buffer, for storing and outputting data results”, the ‘Data out’, which is a part of Buffer of the figure 4 of Yu discloses the result reception unit).
a partial top buffer configured to store the intermediate result values ([Yu, Figure 4; claim 6; 0121] “The Output Buffer saves the results generated from convolvers and offers intermediate results to the convolvers at proper time”, the ‘Data out’, which is a part of Buffer of the figure 4 of Yu discloses both result reception unit and a partial top buffer) and 
a control circuit configured to control the DMA control circuit, the kernel buffer, the bottom buffer, the input data load circuit, the kernel/data supply circuit, the pipeline parallel kernel processing circuit, the result reception circuit, and the partial top buffer ([Yu, Figure 4; 0107-0114] “FIG. 4 shows further details and improvements over hardware design of FIG. 2 … CPU also controls the DMA for data communication. Specifically, under the control of CPU, DMA transmit data from the external memory (e.g., DDR) to the another FIFO unit. Then, PL (e.g., buffer) gets data from FIFO for subsequent operations by the computational complex. In a similar manner, the output data from PL is transmitted to DDR via another FIFO”, the CPU and DMA unit controls all the operations of the elements in the Figure 4, by controlling data transmission through FIFO).
Yu failed to teach a kernel/data supply unit configured to output P (P is a natural number of 2 or more) KxK input data of the bottom buffer and P KxK kernel data of the kernel buffer, Intermediate results written and read for one point at a same position of the output feature maps. 
Chen teaches a kernel/data supply circuit configured to output P (P is a natural number of 2 or more) KxK input data of the bottom buffer and P KxK kernel data of the kernel buffer ([Chen, page 128, Figure 2; from page 128, A. Overview to page 129, first paragraph] “Fig. 2 shows the top-level architecture and memory hierarchy of the Eyeriss system. It has two clock domains: the core clock domain for processing, and the link clock domain for communication with the off-chip DRAM through a 64-b bidirectional data bus. The two domains run independently and communicate through an asynchronous FIFO interface. The core clock domain consists of a spatial array of 168 PEs organized as a 12 × 14 rectangle, a 108-kB GLB, an RLC CODEC, and an ReLU module. To transfer data for computation, each PE can either communicate with its neighbor PEs or the GLB through an NoC, or access a memory space that is local to the PE called spads (Section V-C). Overall, there are four levels of memory hierarchy in the system (in decreasing energy per access): DRAM, GLB, inter-PE communication, and spads.”, the Global Buffer 108KB of the Figure 2 corresponds to the kernel/data supply unit); 
Intermediate results written and read for one point at a same position of the output feature maps ([Chen, page 129, right column, 2nd paragraph] “To minimize the movement of ifmaps and filters, the goal is to maximize three forms of data reuse. 1) Convolutional Reuse: Each filter weight is reused E×F times in the same ifmap plane, and each ifmap pixel is usually reused R × S times in the same filter plane. 2) Filter Reuse: Each filter weight is reused across the batch of N ifmaps. 3) Ifmap Reuse: Each ifmap pixel is reused across M filters (to generate M ofmap channels). To minimize the movement of psums, it is desirable that the psum accumulation across C × R × S values into one ofmap value can be done as soon as possible to save both the storage space and memory R/W energy. However, maximum input data reuse cannot be achieved simultaneously with immediate psum reduction, since the psums generated by multiply and accumulations (MACs) using the same filter or ifmap value are not reducible. Thus, the RS dataflow uses a systematic approach to optimize for all data types simultaneously as follows”, Chen discloses the process of reusing the stored partial sums, [Chen, page 130, Fig 4] shows a process of convolutional reuse. (a) shows the rows of filter weight values from same rows are reused in several different processing units);
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Yu and Chen, to use the kernel buffer, input buffer, and kernel/data supply unit of Chen to convolution circuit comprising an input data load, a DMA, a pipeline parallel kernel processing unit, a result reception unit, a partial top buffer, and a control unit of Yu. The suggestion and/or motivation for doing so is to improve the performance of processing unit, because buffering the weight and input data before processing helps the processing unit never sit idle at any moment and allows the synchronization of clock cycle and kernel/data input. 

Claim 3 is rejected under 35 U.S.C. 103 over Yu (US 20180046913 A1), in view of Chen (Chen et al, Nov 2016, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks"), and further in view of Takashi (JP 3973545 B2).

Regarding claim 3, Yu in view of Chen teaches the method of claim 2. 
Yu in view of Chen does not specifically teach further comprising storing K lines of each of the input feature maps in an internal memory of a chip.
Takashi teaches further comprising storing K lines of each of the input feature maps in an internal memory of a chip ([Takashi, 1st line of the page 8] “The one-line memories 307a to 307c are memories (FIFOs) that store data for one line of an image. In FIG. 3, one line of original image data (pixels d to f) stored in the one line memory 307c is stored in the register 305, and one line of original image data (pixels g to i) is newly stored”).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Takashi, Yu and Chen, storing lines of each of the input feature maps of Takashi to the method for convolution circuit of Yu and Chen. The suggestion and/or motivation for doing so is to process data more efficiently, because storing images line by line converts the image into linear data, and processing units including CPU processes linear data faster than multidimensional data. 

Claim 6 is rejected under 35 U.S.C. 103 over Yu (US 20180046913 A1), in view of Chen (Chen et al, Nov 2016, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks"), and further in view of Clemons (US 20170004089 A1).

Regarding claim 6, Yu in view of Chen teaches wherein at least one of the parallel processing convolutions includes using a physically different memory for data of the at least one of the parallel processing convolutions multiplied by kernel weights, the using of the physically different memory including placing original input data in a two-dimensional plane having a height of Hi and a width of Wi ([Chen Figure 4 and Figure 5] discloses that the PE (processing engine) are connected parallel, [Chen, Figure 14.5.5; 7th paragraph] “2-D Convolution PE Set: A 2-D convolution is composed of many 1-D convolution primitives, and its computation: 1) shares the same row of filter or ifmap across primitives and 2) accumulates the psums from multiple primitives together. Therefore, a PE Set, as shown in Fig. 4, is grouped to run a 2-D convolution and exploit the interprimitive convolutional reuse and psum accumulation, which avoids data accesses from GLB and DRAM. In a set, each row of filter is reused horizontally, each row of ifmap is reused diagonally, and rows of psum are accumulated vertically. The dimensions of a PE set are determined by the filter and ofmap size of a given layer. Specifically, the height and the width of the PE set are equal to the number of filter rows (R) and ofmap rows (E), respectively. In AlexNet, the PE sets are of size 11 × 55 (CONV1), 45×27 (CONV2), and 3×13 (CONV3–CONV5)”, discloses the placing original input data in 2d plane with H, W, [Chen, page 128, Fig. 2] discloses using physically different memory).
Yu in view of Chen does not specifically teach storing the divided input data in a memory corresponding to a position that each data occupies in the K x K window.
Clemons teaches dividing the original input data by a KxK window, and storing the divided input data in a memory corresponding to a position that each data occupies in the K x K window ([Clemons, 0051 and Fig. 3B] “FIG. 3B illustrates the digital image 300 subdivided into a plurality of tiles, in accordance with one embodiment. Tiles are a 2D array of contiguous pixels in the digital image 300. Each tile represents a non-overlapping subset of pixels in the digital image 300. It will be appreciated that, in contrast to tiles, patches (e.g., 301, 302, 303, etc.) may overlap and do not correspond to tile boundaries, whereas tiles (e.g., 311, 312, 313, etc.) do not overlap and have fixed boundaries determined based on the size of a tile. The digital image 300 may be stored in the memory 240, with the pixel data for each tile of the digital image 300 being stored in contiguous addresses in the memory 240. Different tiles may be stored in non-contiguous sections of memory such that the digital image 300, as a whole, is not stored in a contiguous portion of the memory. Typically, the size of a tile is fixed, such as 32 pixels by 32 pixels (32×32) or 64 pixels by 64 pixels (64×64), and may be related to a memory bandwidth required to transfer the data for a tile from the memory 240 to the PISP 200”, the Fig.3B teaches dividing the image data into several different tiles based on the location of the tile, and 0051 teaches how they stores the pixel data into the memory). 
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Clemons, Yu, Chung, and Chen, to store the divided input data in a memory corresponding to a position of Clemons to the method for convolution circuit of Yu, Chung, and Chen. The suggestion and/or motivation for doing so is to process data more efficiently, because dividing the image data and storing image data corresponding to a position that each data occupies in the K x K window will reduce the total processing time because it is easier to find data if the data is more organized.  

Claim 9 is rejected under 35 U.S.C. 103 over Yu (US 20180046913 A1) in view of Chen (Chen et al, Nov 2016, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks"), and further in view of Huang (CN 100409259 C). 

Regarding claim 9, Yu in view of Chen teaches the method of claim 1, and receiving the plurality of feature map data from the external memory ([Yu, Claim 2] “The DPU of claim 1, wherein the DMA is configured to transmit data between the external memory and the programmable logic module via FIFO”, the data includes feature maps).
Yu in view of Chen does not specifically teach wherein a plurality of feature map data are output at a same time while.
Huang teaches wherein a plurality of feature map data are output at a same time while receiving the plurality of feature map data from the memory ([Huang, page 3, line 43-52] “In order to improve the calculating speed, fully exert the convolver of parallel calculation efficiency, when the convolution calculation, when the flow establishing to the Y register Convolver written data and read from the convolver result are carried out at the same time, so as to make the write data and read data collision … the DSP through the data bus transmits the data written in the reference picture memory. When performing a convolution calculation, isolator closed off DSP data bus access of the memory data bus, data placed in the convolver by the memory is isolated from the data read out by the convolver, so as not to conflict, can be put in to the Y data simultaneously with the read calculation result, fully plays the convolution circuit and parallel calculation, improves the calculation speed”).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Huang, Yu and Chen, parallel processing of input and output of Huang to the method for convolution circuit of Yu and Chen. The suggestion and/or motivation for doing so is to process data more efficiently, because getting input and output concurrently will reduce the total processing time.  

Claim 11, 15, and 19 is/are rejected under 35 U.S.C. 103 over Yu (US 20180046913 A1) in view of Chen (Chen et al, Nov 2016, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks"), and further in view of Rozman (US 5438614 A).

Regarding claim 11, Yu in view of Chen teaches wherein the DMA processing circuit manages: a read first-in, first-out (FIFO) memory configured to store a plurality of input feature map data and kernel data from the external memory ([Yu, Figure 4; 0126] “DMA communicate data/instructions between the DDR and the neural network processing unit via the input FIFO, instruction FIFO and output FIFO”, [Yu, Claim 2] “The DPU of claim 1, wherein the DMA is configured to transmit data between the external memory and the programmable logic module via FIFO”, as the buffer of the Figure 4 mentions, the data transmitted by DMA encompasses input and kernel data); and a write FIFO memory configured to store a plurality of output feature map data to be written in the external memory ([Yu, Figure 4; 0126] “DMA communicate data/instructions between the DDR and the neural network processing unit via the input FIFO, instruction FIFO and output FIFO”, [Yu, Claim 2] “The DPU of claim 1, wherein the DMA is configured to transmit data between the external memory and the programmable logic module via FIFO”, as the buffer of the Figure 4 mentions, the data transmitted by DMA encompasses input and kernel data).
Yu in view of Chen does not specifically teach DMA processing circuit comprises: read first-in, first-out memory, and write first-in, first-out memory. 
Rozman teaches DMA processing unit comprises: read first-in, first-out memory, and write first-in, first-out memory ([Rozman, Figure 8C; column 12, page 39, line 32-38] “Still referring to FIG. 8C, DMA engine 742 comprises a frame flag circuit 743, a read FIFO programmable array logic (PAL) 744, a write FIFO PAL 745, a timing PAL 746, a DMA control circuit 747, a channel control PAL 748, a block count latch 749, a CPU address buffer 750, a CPU data buffer 751, a DMA control register 752, and a dual port RAM 753”).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Rozman, Yu and Chen, DMA processing unit comprising read FIFO and write FIFO of Rozman to convolution circuit of Yu and Chen. The suggestion and/or motivation for doing so is to buffer the data flow, because embedded read FIFO and write FIFO queue up the bytes till the processing unit is ready for them.

Regarding claim 15, Yu in view of Chen, and further in view of Rozman teaches wherein the kernel buffer collects the K weight values from the read FIFO memory and stores the K weight values in a corresponding memory ([Yu, 0114] “the DMA transmits data from DDR to the input FIFO” shows that input FIFO stores input values, [Yu, Figure 4; 0126] “The three FIFOs are used for instructions, input data and output data respectively”, shows that the data from FIFO which is stored in buffer (i.e. corresponding memory) includes the weight data. Therefore, the Weight included in the buffer unit of the Figure 4 corresponds to the kernel buffer).

Regarding claim 19, Yu in view of Chen teaches further comprising an output data storage circuit configured to read the intermediate result values from the partial top buffer and transmit the accumulated intermediate result values to the write FIFO memory ([Yu, 0114] “When the input FIFO is not full, the DMA transmits data from DDR to the input FIFO. When the output FIFO is not empty, the DMA transmit data from output FIFO to the DDR”, and [Yu, 0126; Figure 4] “DMA communicate data/instructions between the DDR and the neural network processing unit via the input FIFO, instruction FIFO and output FIFO” shows that output FIFO (i.e. partial top buffer) provides the intermediate result, and [Yu, 0121] “The Output Buffer saves the results generated from convolvers and offers intermediate results to the convolvers at proper time”, discloses the intermediate result is from output buffer (i.e. result reception unit)).
Yu in view of Chen failed to teach write FIFO memory of the DMA processing unit. 
Rozman teaches FIFO memory of the DMA processing unit ([Rozman, Figure 8C; column 12, page 39, line 32-38] “Still referring to FIG. 8C, DMA engine 742 comprises a frame flag circuit 743, a read FIFO programmable array logic (PAL) 744, a write FIFO PAL 745, a timing PAL 746, a DMA control circuit 747, a channel control PAL 748, a block count latch 749, a CPU address buffer 750, a CPU data buffer 751, a DMA control register 752, and a dual port RAM 753”).

Claim 12 is rejected under 35 U.S.C. 103 over Yu (US 20180046913 A1) in view of Chen (Chen et al, Nov 2016, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks"), and further in view of Anusha (Anusha et al, 2012, “Implementation of sobel edge detection on fpga”).

Regarding claim 12, Yu in view of Chen teaches convolution circuit of claim 10. 
Yu in view of Chen does not specifically teach wherein the kernel buffer is implemented as a dual port random access memory (DPRAM) for storing the N kernel data and outputting the P kernel data for parallel processing at the same time.
Anusha teaches wherein the kernel buffer is implemented as a dual port random access memory (DPRAM) for storing the N kernel data and outputting the P kernel data for parallel processing at the same time ([Anusha, B. FPGA Hardware Implementation, page 473; Figure 4] “The structure of 3x3 pixel generation module is shown in Fig. 3. This module consists of 3 shift register groups and two FIFO. The FIFO is used to cache a line of image data. The image data is input according to the clock signal so P1, P2, …., P9 is the 3x3 image data template. When the data is continuously input, 3x3 image data template changes. It can contain all pixels of an image. The FIFO is generated by dual-port RAM [11]. In Sobel enhancement operator module the orientation convolution kernel uses parallel processing construction”, 3x3 pixel generation module generates and stores the 3x3 convolution kernel, and it is implemented by dual-port RAM).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Anusha, Yu and Chen, kernel buffer implemented by DPRAM to store and output kernel data parallelly of Anusha to convolution circuit of Yu and Chen. The suggestion and/or motivation for doing so is to gaining the performance of parallel processing of image convolution, since dual-port RAM typically performs better in parallel computation because of wider bandwidth.

Claim 13-14 is/are rejected under 35 U.S.C. 103 over Yu (US 20180046913 A1), in view of Chen (Chen et al, Nov 2016, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks"), in view of Rozman (US 5438614 A), and further in view of Ito (WO 2008153194 A1).

Regarding claim 13, Yu in view of Chen, and further in view of Rozman wherein the kernel buffer further loads kernel data from the external memory in an order of an input feature map ([Yu, Figure 9; 0149] “Referring to the example 2 shown in FIG. 9, wherein the CPU and the neural network accelerator work in a pipeline manner to process image 1, image 2, . . . image n. The CPU fetches image data, while the neural network accelerator processes the data. The above proposed solution supports a parallel schedule between the CPU and the neural network accelerator”, the Figure 9 discloses the in order of input map and output, and the output is made by convolution of input map and kernel data loaded from the kernel buffer included in the buffer of the Figure 4), and loads kernel data to a memory in an order of processing output feature maps when processing the input feature map ([Yu, Figure 9] The processing element (i.e. PE) of Yu loads kernel data using the buffer (i.e. memory). Figure 9 shows the pipeline diagram of output feature maps and processing input feature maps).
Yu in view of Chen, and further in view of Rozman failed to teach wherein a storage order of each kernel data is to store the kernel data with a row unit first and then to store the kernel data with a column unit in each row.
Ito teaches wherein a storage order of each kernel data is to store the kernel data with a row unit first and then to store the kernel data with a column unit in each row ([Ito, 0095]  “For example, when the ring counter value is "1", and the height of a filter kernel (= convolution kernel) is "6", the start line of data to be referred to is a line indicated by A3. After the start address is determined, the controller 1601 outputs memory addresses while updating the value of the window counter 1607 … Upon completion of the read-out processing of the data (i.e., upon completion of the count operation corresponding to the width of the reference area) , the window counter is reset to indicate the zeroth pixel of the next row. Next, address A4 is generated based on the address value from the address converter 1605 and the count values of the column counter and window counter, and reference data of line L4 in the reference area are similarly stored in the reference data cache 1606”, Window counter and column counter counts the storage address of filter kernel, window kernel used first to count row unit first to store the data and then column counter is used).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Ito, Yu, Chen, and Rozman to implement the method of storing kernel data with a row unit first and column unit second of Ito to the convolution circuit of Yu, Chen, and Rozman. The suggestion and/or motivation for doing so is, to increase the efficiency of data processing, because row-major order (i.e. row unit first) converts the multidimensional data into a sequential data, and processing units can process sequential data more efficiently than multidimensional data. 

Regarding claim 14, Yu in view of Chen, in view of Rozman, and further in view of Ito teaches wherein the kernel buffer further allocates a different physical memory for each row of a kernel ([Chen, page 130, Figure 4 and 5] discloses that each row of filter (i.e. kernel) goes through different processing engines, and [Chen, page 134, Figure 12] discloses the structure of the processing engine contains physical memories (i.e. SRAM, register)).

Claim 16-18 is/are rejected under 35 U.S.C. 103 over Yu (US 20180046913 A1) in view of Chen (Chen et al, Nov 2016, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks"), in view of Rozman (US 5438614 A), and further in view of Huang (CN 100409259 C). 

Regarding claim 16, Yu in view of Chen, and further in view of Rozman teaches the convolution circuit of claim 11, and bottom buffer ([Yu, Figure 4; 0123] “Said input data buffer might be a line data buffer, for storing data and holding the data with delayers in order to reuse the data”). 
Yu in view of Chen, and further in view of Rozman failed to teach wherein the memory outputs all data in a kernel window at the same time while the kernel window for input data moves in the input feature map.
Huang teaches wherein the memory outputs all data in a kernel window at the same time while the kernel window for input data moves in the input feature map ([Huang, page 3, line 43-52] “In order to improve the calculating speed, fully exert the convolver of parallel calculation efficiency, when the convolution calculation, when the flow establishing to the Y register Convolver written data and read from the convolver result are carried out at the same time, so as to make the write data and read data collision”, teaches input data and calculating output data simultaneously. Kernel window for input data moves in the input feature map means calculating convolution).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Huang, Yu, Chen, and Rozman, to use the parallel processing of input and output of Huang to the method for convolution circuit of Yu, Chen, and Rozman. The suggestion and/or motivation for doing so is to process data more efficiently, because sending out the output data while input data moves into the processor (i.e. parallel processing) will reduce the total processing time. 

Regarding claim 17, Yu in view of Chen, in view of Rozman, and further in view of Huang teaches wherein the kernel/data supply circuit further reads input data corresponding to the kernel window from the bottom buffer according to a row and column index of an output feature map and reads the P kernel data for processing the data read from the kernel buffer ([Chen, page 128, Figure 2; page 130, Figure 4 and 5; paragraph attached to the Fig. 4] “Dataflow in a PE set for processing a 2-D convolution. (a) Rows of filter weights are reused across PEs horizontally. (b) Rows of ifmap values are reused across PEs diagonally. (c) Rows of psums are accumulated across PEs vertically. Reuse and accumulation of data within a PE set reduce accesses to the GLB and DRAM, saving data movement energy cost. In this example, the number of filter rows (R), ifmap rows (H), and ofmap rows (E) are 3, 5, and 3, respectively. Therefore, the PE set size is 3 × 3. Filter and ifmap values from different rows are sent to the PE set in a time-interleaved fashion; all the PEs that reuse the same value receive it at the same cycle. The psums generated from one PE are sent to its neighbor PE immediately”, Figure 2 shows Global Buffer corresponds to the kernel/data supply unit and receives input data and kernel data, and the SRAM provides input and kernel data to spatial array according to the Figure 4 and 5, and SRAM feeds the data to the spatial array according to the row of the data).

Regarding claim 18, Yu in view of Chen, in view of Rozman, and further in view of Huang teaches wherein the pipeline parallel kernel processing circuit outputs the P result values by performing a multiplication operation and an addition operation on the input data and corresponding kernel weight values delivered from the kernel/data supply circuit ([Yu, Figure 4; 0012-0015] “In addition, the PE further comprises: a convolver complex, coupled to the buffer to receive weights of ANN and said data, configured for performing convolutional operations of the ANN; adder tree, coupled to the convolver complex, configured for summing results of convolution operation; non-linear (NL) module, coupled to the adder tree, configured for applying a non-linear function to the output of adder tree … In addition, the buffer further comprises: bias shift, coupled to the input buffer, configured for shifting weights of ANN between different numerical ranges and providing said shifted weights to the adder tree, wherein the weights are quantized fixed-point numbers”, Figure 4 of Yu discloses the convolver & adder to perform convolution operation using the kernel data and the input data from the buffers, convolution operation involves multiplication).

Claim 20 is rejected under 35 U.S.C. 103 over Chung (US 20160379109 A1) in view of Huang (CN 100409259 C), and further in view of Chen (Chen et al, Nov 2016, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks").

Regarding claim 20, Chung teaches an operation method of an application processor, the method comprising: performing parallel convolution operations on each of input feature maps to extract features, the performing the parallel convolution operations including processing only kernel data for a current input feature map being processed ([Chung, 0273-0274] “Functional units in the same column perform N parallel convolution operations on slices Slice.sub.0, Slice.sub.1, . . . , Slice.sub.N-1 of input data for a single plane of the output volume … Functional units in the same row perform M parallel convolution operations on a single slice (Slice.sub.0, Slice.sub.1, . . . , Slice.sub.N-1) of input data for all M planes of the output volume”, discloses the parallel processing, [Chung, 0258] “… A 3D input volume 4214 of dimensions L×L×D is convolved with H weight kernels (e.g., weight kernel 4216) of dimension L×L×D and stride S. Each weight kernel (e.g., weight kernel 4216) is shifted in a sliding-window-like fashion (with a shift offset defined by stride S) across the input volume (e.g., volume 4204). During each shift, each weight in to the 3D weight kernel is multiplied and added with corresponding pair-wise input elements from the overlapping region of input volume 4214”, only weight kernel data corresponds to the input image is processed); and 
performing sub-sampling operations on each of result values of the parallel convolution operation to extract the features ([Chung, Fig 42; 0258] “A 3D input volume 4214 of dimensions L×L×D is convolved with H weight kernels (e.g., weight kernel 4216) of dimension L×L×D and stride S. Each weight kernel (e.g., weight kernel 4216) is shifted in a sliding-window-like fashion (with a shift offset defined by stride S) across the input volume (e.g., volume 4204). During each shift, each weight in to the 3D weight kernel is multiplied and added with corresponding pair-wise input elements from the overlapping region of input volume 4214”, discloses sub-sampling operation with window-like shifting, and [Chung, 0273-0274] “Functional units in the same column perform N parallel convolution operations on slices Slice.sub.0, Slice.sub.1, . . . , Slice.sub.N-1 of input data for a single plane of the output volume … Functional units in the same row perform M parallel convolution operations on a single slice (Slice.sub.0, Slice.sub.1, . . . , Slice.sub.N-1) of input data for all M planes of the output volume”, discloses the parallel processing of convolution operation).
 Chung does not specifically teach wherein the performing of the parallel convolution operations comprises outputting intermediate result values for output feature maps to an external memory at a same time while receiving input data from the external memory, the outputting the intermediate result values including writing and reading the intermediate result values for one point at a same position of the output feature maps.
Huang teaches wherein the performing of the parallel convolution operations comprises outputting intermediate result values for output feature maps to an external memory at a same time while receiving input data from the external memory ([Huang, page 3, line 43 -52] “In order to improve the calculating speed, fully exert the convolver of parallel calculation efficiency, when the convolution calculation, when the flow establishing to the Y register Convolver written data and read from the convolver result are carried out at the same time, so as to make the write data and read data collision … the DSP through the data bus transmits the data written in the reference picture memory. When performing a convolution calculation, isolator closed off DSP data bus access of the memory data bus, data placed in the convolver by the memory is isolated from the data read out by the convolver, so as not to conflict, can be put in to the Y data simultaneously with the read calculation result, fully plays the convolution circuit and parallel calculation, improves the calculation speed”, intermediate result is stored in intermediate result register group as shown in the Claim 1 of Huang). 
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Huang and Chung, to use the parallel processing of input and output of Huang to the method of an application processor of Chung. The suggestion and/or motivation for doing so is to process data more efficiently, because sending out the output data while input data moves into the processor (i.e. parallel processing) will reduce the total processing time.
Chen teaches the outputting the intermediate result values including writing and reading the intermediate result values for one point at a same position of the output feature maps ([Chen, page 129, right column, 2nd paragraph] “To minimize the movement of ifmaps and filters, the goal is to maximize three forms of data reuse. 1) Convolutional Reuse: Each filter weight is reused E×F times in the same ifmap plane, and each ifmap pixel is usually reused R × S times in the same filter plane. 2) Filter Reuse: Each filter weight is reused across the batch of N ifmaps. 3) Ifmap Reuse: Each ifmap pixel is reused across M filters (to generate M ofmap channels). To minimize the movement of psums, it is desirable that the psum accumulation across C × R × S values into one ofmap value can be done as soon as possible to save both the storage space and memory R/W energy. However, maximum input data reuse cannot be achieved simultaneously with immediate psum reduction, since the psums generated by multiply and accumulations (MACs) using the same filter or ifmap value are not reducible. Thus, the RS dataflow uses a systematic approach to optimize for all data types simultaneously as follows”, Chen discloses the process of reusing the stored partial sums, [Chen, page 130, Fig 4] shows a process of convolutional reuse. (a) shows the rows of filter weight values from same rows are reused in several different processing units).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Chung, Huang and Chen, to use the outputting the intermediate result values including writing and reading the intermediate result values for one point at a same position of the output feature maps of Chen to the method of an application processor of Chung and Huang. The suggestion and/or motivation for doing so is to improve the efficiency of the system, as processing intermediate result at a same position saves storage capacity.

Response to Argument
Applicant’s arguments filed 3/10/2022 have been fully considered but they are not persuasive.
Applicant’s arguments with 35 U.S.C. 103 prior arts respect to claim(s) 1 and 20 have been considered but are moot because the new ground of rejection does not rely on reference applied in the prior rejection of record. New reference Chen (Chen et al, Nov 2016, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks") is used to make new ground of rejection.
Applicant’s arguments with 35 U.S.C. 103 prior arts respect to claim(s) 10 have been considered but not persuasive. The applicant argues that the cited reference Yu failed to disclose or suggest ‘a partial top buffer configured to store intermediate result values written and read for one point at a same position of the output feature maps’. 
The examiner disagrees, Yu et al. does teach a partial top buffer configured to store intermediate result values “[Yu, Figure 4; claim 6; 0121] “The Output Buffer saves the results generated from convolvers and offers intermediate results to the convolvers at proper time”, the ‘Data out’, which is a part of Buffer of the figure 4 of Yu discloses both result reception unit and a partial top buffer”. The buffer stores the intermediate result from convolver. However, Yu does not specifically teach store intermediate result values written and read for one point at a same position of the output feature maps.
Chen teaches store intermediate result values written and read for one point at a same position of the output feature maps “[Chen, page 129, right column, 2nd paragraph] “To minimize the movement of ifmaps and filters, the goal is to maximize three forms of data reuse. 1) Convolutional Reuse: Each filter weight is reused E×F times in the same ifmap plane, and each ifmap pixel is usually reused R × S times in the same filter plane ... To minimize the movement of psums, it is desirable that the psum accumulation across C × R × S values into one ofmap value can be done as soon as possible to save both the storage space and memory R/W energy ... [Chen, page 130, Fig 4] shows a process of convolutional reuse. (a) shows the rows of filter weight values from same rows are reused in several different processing units”. 
Yu and Chen are combinable, as both discloses accelerator for an artificial neural network and shares have buffer and processing elements. Therefore, claim 10 is rejected as unpatentable over Yu in view of Chen.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Chen et al, Jan 2016, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks"

THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth
in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from
the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date
of this final action and the advisory action is not mailed until after the end of the THREE-MONTH
shortened statutory period, then the shortened statutory period will expire on the date the advisory
action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing
date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX
MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JUN KWON whose telephone number is (571)272-2072. The examiner can normally be reached on 7:30 AM - 5:30 PM. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ABDULLAH KAWSAR can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/JUN KWON/
Examiner, Art Unit 2127

/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127