Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
This action is in response to the amendment file on 09/23/2022. The claims 1, 17, 19 and 28 have been amended. The amendment has been entered. Claims 1-28 are pending in the application. 

Response to Arguments
	Applicant argues on page 11-12 Remarks filed on 09/23/2022 that Huynh fails to disclose at least "regroup the fetched one or more batches of data into multiple work items, wherein a first work item partially overlaps one or more work items among the multiple work items," as recited in claim 1. The Office cites paragraphs [0059]-[0061] of Huynh for disclosing the above subject matter. Office Action, p 5-6. Huynh in the cited portions provides a general description how a convolutional operation is performed in a convolutional layer of a convolutional neural network. Huynh, [0059]-[0062]. But Huynh fails to disclose how the data (e.g., input data, filter data, etc.) is fetched into a memory and regrouped before being broadcasted to the processing array. While the Office alleges that Huynh's classification of the input data to three channels, such as the red, green, and blue color channels corresponds to the claimed regrouping, Huynh's input data in the cited portions is not the fetched data in the memory. …
	Examiner disagreed with applicant’s arguments because of the following reasons. In light of applicant’s specification, [0022], [0025], [0055] and [0064], examiner interpret “first memory” as “on-chip memory” or “on-chip buffer”. Furthermore, the claimed invention is directed to a convolutional neural network (CNN) operation. 
Therefore, examiner cited [0066]-[0067] for teaching “fetch one or more batches of data into the first memory;” and “broadcast the multiple work items to the processing array, wherein the first work item is transferred to two or more processing strings of the processing array.” See page 4 Non-Final mailed 06/28/2022 and below for detail.
Huynh: [0066] Operation of a neural network (e.g., conducting inference), as illustrated by the models discussed above, generally involves fetching input data or input activations, executing multiply-and-accumulate operations in parallel for each node in a layer, and providing output activations. Optimum performance of a neural network, measured by response time, can be achieved when a hardware architecture is capable of highly parallelized computations. Special-purpose or domain-specific neural network processors can achieve better performance than both CPUs and GPUs when executing a neural network. Neural network processors can employ a spatial architecture including a processing element (PE) array, in which the processing elements may form processing chains and can pass data directly from one processing element to another. This can significantly reduce the number of memory transactions. In some examples, the weights or inputs can be pre-loaded into the processing element array. In some examples, neural network processors can also include an on-chip buffer that can store values read from processor memory, and that can distribute values to multiple computing engines in the processor. The computing engines can further include a small, local register file (e.g., a small memory) for storing intermediate results. Having an on-chip memory hierarchy can improve the efficiency of the operation of a neural network by reducing memory latencies.” 
Moreover, neither the claim nor the specification recites specific definition or additional detail for “regroup”. Therefore, it is appropriate for examiner to consider [0059-0061] “N batches of C input feature maps of dimensions”, “Sliding window 524 may be shifted on all C input feature maps 522 in 3-D input 520-1 based on the strides D in the two dimensions to generate another output pixel 532 at a different location on output feature map 530-1-1 in 3-D output 530-1.” of Huynh to teach “regroup the fetched one or more batches of data into multiple work items, wherein a first work item partially overlaps one or more work items among the multiple work items;” Furthermore, an ordinary in the art would understand that the operation of neural network described in [0066-0067] could be used for the method of CNN disclosed in [0059-0061]. 
See page 5-6 Non-Final mailed 06/28/2022 for detail.
	In addition, applicant did not provide evident(s) to show the differences between the prior arts and claimed invention.
	In summary, applicant’s arguments and amendment are not persuasive. Therefore, the 35 U.S.C. 102 rejection in maintained. In order to clarify examiner’s interpretation, additional clarification is added.  

Specification
The lengthy specification has not been checked to the extent necessary to
determine the presence of all possible minor errors. Applicant's cooperation is
requested in correcting any errors of which applicant may become aware in the
specification.

Examiner Notes
Examiner cites particular columns, paragraphs, figures and line numbers in the references as applied to the claims below for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references in their entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner. The entire reference is considered to provide disclosure relating to the claimed invention. The claims & only the claims form the metes & bounds of the invention. Office personnel are to give the claims their broadest reasonable interpretation in light of the supporting disclosure. Unclaimed limitations appearing in the specification are not read into the claim. Prior art was referenced using terminology familiar to one of ordinary skill in the art. Such an approach is broad in concept and can be either explicit or implicit in meaning. Examiner's Notes are provided with the cited references to assist the applicant to better understand how the examiner interprets the applied prior art. Such comments are entirely consistent with the intent & spirit of compact prosecution.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-6, 9-15, 18-24, 27-28 are is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Huynh el at (US 2021/0158132 A1), hereinafter Huynh.

Claim 1. A device for executing a convolutional neural network operation, comprising: Huynh discloses a first memory; a processing array comprising a plurality of processing strings; and a controller configured to: fetch one or more batches of data into the first memory; 
Huynh: [0066] “Operation of a neural network (e.g., conducting inference), as illustrated by the models discussed above, generally involves fetching input data or input activations, executing multiply-and-accumulate operations in parallel for each node in a layer, and providing output activations. Optimum performance of a neural network, measured by response time, can be achieved when a hardware architecture is capable of highly parallelized computations. Special-purpose or domain-specific neural network processors can achieve better performance than both CPUs and GPUs when executing a neural network. Neural network processors can employ a spatial architecture including a processing element (PE) array, in which the processing elements may form processing chains and can pass data directly from one processing element to another. This can significantly reduce the number of memory transactions. …”
Huynh: [0067] “FIG. 7 is a block diagram illustrating an example of an integrated circuit device for performing neural network operations, such as tensor operations, according to certain embodiments. The example shown in FIG. 7 includes an accelerator 702. In various examples, accelerator 702 can execute computations for a set of input data (e.g., input data 750) using a processing element array 710, an activation engine 716, and/or a pooling engine 718. In some examples, accelerator 702 may be an integrated circuit component of a processor, such as a neural network processor. The processor may have other integrated circuit components, including additional accelerator engines.”
In light of applicant’s specification, [0022], [0025], [0055] and [0064], examiner interpret “first memory” as “on-chip memory” or “on-chip buffer”.
Huynh discloses regroup fetched the one or more batches of data into multiple work items, wherein a first work item partially overlaps one or more work items among the multiple work items; 
Huynh: [0059-0061] “FIG. 5 illustrates an example of a model 500 for a convolution layer of a convolutional neural network used in, for example, image processing. As illustrated in the example, there may be multiple (e.g., N) 3-D inputs 520-1, . . . , and 520-N to the convolution layer. Each 3-D input may include C channels of 2-D input feature maps (with dimensions H×W). For the first convolution layer in a CNN, such as a ResNet-50, a 3-D input may include, for example, three channels of 2-D images, such as the red, green, and blue color channels. [correspond to regroup the one or more batches of data into multiple work items] Multiple (e.g., M) 3-D filters 510-1, . . . , and 510-M, each having C 2-D filters of dimensions R×S, may be convolved with the N 3-D inputs 520-1, . . . , and 520-N (e.g., N batches of C input feature maps of dimensions H×W) to generate multiple (e.g., N) 3-D outputs 530-1, . . . , and 530-N, where each of the 3-D outputs 530-1, . . . , and 530-N may include M output feature maps (also referred to as output channels). Each 3-D filter 510-1, . . . , or 510-M (with dimensions C×R×S) may be applied to a 3-D input 520-1, . . . , or 520-N (with dimensions C×H×W) to generate an output feature map (with dimensions E×F as described above with respect to FIGS. 3A and 3B) in a 3-D output 530-1, . . . , or 530-N that includes M output feature maps, and thus M 3-D filters may be used to generate the M output feature maps in a 3-D output 530-1, . . . , or 530-N for a 3-D input 520-1, . . . , or 520-N. …. In one example, for 3-D filter 510-1 and 3-D input 520-1, each 2-D filter 512 in the C 2-D filters in 3-D filter 510-1 may correspond to a respective input feature map 522 in 3-D input 520-1 and may be used to convolve with (e.g., filter) the corresponding input feature map 522, where each pixel in a sliding window 524 in input feature map 522 may be multiplied with a corresponding pixel in 2-D filter 512 to generate a product, and the products for all pixels in sliding window 524 may be summed to generate a partial sum. The partial sums for the C 2-D filters 512 (and corresponding input feature map 522) may be added together to generate an output pixel 532 at a location (e, f) on output feature map 530-1-1 in 3-D output 530-1. Sliding window 524 may be shifted on all C input feature maps 522 in 3-D input 520-1 based on the strides D in the two dimensions to generate another output pixel 532 at a different location on output feature map 530-1-1 in 3-D output 530-1. [correspond to wherein a first work item partially overlaps one or more work items among the multiple work items] Sliding window 524 may be repeatedly shifted together on all C input feature maps 522 until all output pixels 532 on output feature map 530-1-1 in 3-D output 530-1 are generated.”
[0066] Operation of a neural network (e.g., conducting inference), as illustrated by the models discussed above, generally involves fetching input data or input activations, executing multiply-and-accumulate operations in parallel for each node in a layer, and providing output activations. Optimum performance of a neural network, measured by response time, can be achieved when a hardware architecture is capable of highly parallelized computations. Special-purpose or domain-specific neural network processors can achieve better performance than both CPUs and GPUs when executing a neural network. Neural network processors can employ a spatial architecture including a processing element (PE) array, in which the processing elements may form processing chains and can pass data directly from one processing element to another. This can significantly reduce the number of memory transactions. In some examples, the weights or inputs can be pre-loaded into the processing element array. In some examples, neural network processors can also include an on-chip buffer that can store values read from processor memory, and that can distribute values to multiple computing engines in the processor. The computing engines can further include a small, local register file (e.g., a small memory) for storing intermediate results. Having an on-chip memory hierarchy can improve the efficiency of the operation of a neural network by reducing memory latencies.” 
Huynh discloses broadcast the multiple work items to the processing array, wherein the first work item is transferred to two or more processing strings of the processing array.  
Huynh: [0090] “Sixteen (16) values representing the second elements (e.g., r=0, s=1) of the 16 2-D filters in the four 3-D filter may then be loaded into PE array 810. The elements in the one-dimensional vector for each input feature map may be shifted into PE array 810 and may be multiplied with the pre-loaded weights in PE array 810. The products in each column may be accumulated to generate a second partial sum vector PSUM.sub.0,1 (832) that includes four partial sum sub-vectors for the four output feature maps. Each element in the 16 2-D filters may be loaded into PE array 810 and multiplied with the elements in the one-dimensional vector to generate a partial sum vector that includes four partial sum sub-values for the four output feature maps until a partial sum vector PSUM.sub.R-1,S-1 (834) that corresponds to the element (R-1, S-1) in each 2-D filter and includes four partial sum sub-vectors for the four output feature maps is generated. The partial sum sub-vectors in partial sum vectors PSUM.sub.0,0 (830), PSUM.sub.0,1 (832), . . . , and PSUM.sub.R-1,S-1 (834) and corresponding to each respective output feature map may be accumulated to generate a respective vector 840, 842, 844, or 846 that may correspond to a flattened output feature map.”

Regarding Claims 10 and 19, the same ground of rejection is made as discussed above for substantially similar rationale. 

Regarding Claim 28, the same ground of rejection is made as discussed above for substantially similar rationale. 
In addition, Claim 28 recites “A terminal, comprising: a host unit; and a device for executing a convolutional neural network operation communicatively coupled to the host unit, the device comprising: a first memory; a processing array comprising a plurality of processing strings; and a controller configured to:”.

Huynh discloses “A terminal, comprising: a host unit; and a device for executing a convolutional neural network operation communicatively coupled to the host unit, the device comprising: a first memory; a processing array comprising a plurality of processing strings; and a controller configured to:” [0149] “FIG. 19 includes a block diagram illustrating an example of a host system 1900 on which a compiler 1930, such as is described herein, can run. The illustrated host system 1900 is an example of a computing device, and includes a processor 1902, a processor memory 1904, at least one storage device 1906, various Input/Output (I/O) devices 1908, and at least one network interface 1910. In the example of FIG. 19, the host system 1900 also includes an acceleration engine 1912, which is an integrated circuit device that can accelerate certain operations or computations performed by the host system 1900. In various examples, the host system 1900 can be implemented as a server in a data center, a desktop computer, a laptop computer, a tablet computer, or a smartphone, among other examples. In some examples, operations or components discussed below as performed or included in the host system 1900 can be performed or included in other computer devices. For example, the compiler 1930 can execute on the host system 1900 while the acceleration engine 1912 is located at a different host system.”

Claim 2, 11 and 20
Huynh discloses wherein the plurality of processing strings are classified into a plurality of subsets and the first work item is transferred to a first processing string in each of the plurality of subsets.  
Huynh: [0088] “FIG. 8 illustrates a simplified example of a weight-stationary convolution operation using an example of a computing engine including a processing element array 810 according to certain embodiments. Processing element array 810 may include a large number of processing elements arranged in, for example, a 64×64 array, a 64×128 array, a 128×128 array, a 256×256 array, or the like. In the example illustrated in FIG. 8, processing element array 810 includes four rows and four columns of processing elements 812. [correspond to the plurality of processing strings are classified into a plurality of subsets, the first work item is transferred to a first processing string in each of the plurality of subsets] Inputs 820 to processing element array 810 may include four (corresponding to C) input channels 822, 824, 826, and 828. Each input channel may correspond to one input feature map or one input feature map in each of N (N=1 in the example) of inputs as described above. Each input feature map in the example may include an 8×8 matrix and may be flattened into a one-dimensional vector with 64 elements. PE array 810 may generate four (corresponding to M) output feature maps, one from each column of PE array 810.

Claim 3. The device of claim 2, further comprising Huynh discloses a second memory storing a plurality of filters of which number corresponds to a number of the subsets.  
Huynh: [0092] “According to certain embodiments, a convolution operation in a neural network layer that has a small number of input channels may be performed by multiple weight-stationary convolution operations, where an input feature map or a portion of the input feature map may be sequentially input into multiple rows, and multiple filter elements of a same filter may be loaded into the multiple rows of the processing element array at a same time to apply to the same input channel map or the same portion of the input feature map, thus improving the utilization of the processing element array. [correspond to a second memory storing a plurality of filters of which number corresponds to a number of the subsets] To avoid having more than one copy of the input feature map in the memory (e.g., memory subsystem 704) and/or to reduce the data transfer bandwidth used to move the input feature map from the memory (into the processing element array (e.g., processing element array 710), the multiple rows of the processing element array may share the input data. For example, the input data may be read from the memory once and replicated by input selector circuit 730 to input into the multiple rows of processing element array 710 to improve the utilization of processing element array 710 while reducing the memory bandwidth usage for data transfer.”

Claim 4. The device of claim 1, Huynh discloses wherein each of the processing strings includes a multiplier and an accumulator.  
Huynh: [0066] “Operation of a neural network (e.g., conducting inference), as illustrated by the models discussed above, generally involves fetching input data or input activations, executing multiply-and-accumulate operations in parallel for each node in a layer, and providing output activations. [correspond to multiplier and an accumulator] Optimum performance of a neural network, measured by response time, can be achieved when a hardware architecture is capable of highly parallelized computations. Special-purpose or domain-specific neural network processors can achieve better performance than both CPUs and GPUs when executing a neural network. Neural network processors can employ a spatial architecture including a processing element (PE) array, in which the processing elements may form processing chains and can pass data directly from one processing element to another. This can significantly reduce the number of memory transactions.”

Claim 5. The device of claim 3, Huynh discloses wherein each of the processing strings includes a multiplier and an accumulator, 
Huynh: [0066] “Operation of a neural network (e.g., conducting inference), as illustrated by the models discussed above, generally involves fetching input data or input activations, executing multiply-and-accumulate operations in parallel for each node in a layer, and providing output activations. [correspond to multiplier and an accumulator] Optimum performance of a neural network, measured by response time, can be achieved when a hardware architecture is capable of highly parallelized computations. Special-purpose or domain-specific neural network processors can achieve better performance than both CPUs and GPUs when executing a neural network. Neural network processors can employ a spatial architecture including a processing element (PE) array, in which the processing elements may form processing chains and can pass data directly from one processing element to another. This can significantly reduce the number of memory transactions.”37Attorney Docket No. 12852.0340-00000 
Huynh discloses Alibaba Ref. A24856USwherein the processing array includes an element-wise operation processor in each of the plurality of subsets.  
Huynh: [0067] “FIG. 7 is a block diagram illustrating an example of an integrated circuit device for performing neural network operations, such as tensor operations, according to certain embodiments. The example shown in FIG. 7 includes an accelerator 702. In various examples, accelerator 702 can execute computations for a set of input data (e.g., input data 750) using a processing element array 710, an activation engine 716, and/or a pooling engine 718. In some examples, accelerator 702 may be an integrated circuit component of a processor, such as a neural network processor. The processor may have other integrated circuit components, including additional accelerator engines.”

Claim 6, 15 and 24
Huynh discloses traverse the one or more batches of data in the first memory to determine a size of the one or more batches of data covers a predetermined data size corresponding to a size of each of the multiple work items.  
Huynh: [0139-0140] “Accelerator 1710 may perform instructions generated by a compiler using a neural network model, such as a ResNet-50 model. The neural network model may be represented by a data flow graph where each node (e.g., vertex) in the graph may represent an operation, and connections (e.g., edges) between the nodes may represent the data flow or data dependency. The compiler may perform shape inference on the neural network model, for example, to determine the sizes of the data used for each operation. The compiler may then traverse the data flow graph to identify operations that may not efficiently utilize the computing engines (e.g., accelerators, or more specifically, processing element arrays) of the hardware system for implementing the neural network. [correspond to traverse the one or more batches of data in the first memory to determine a size of the one or more batches of data] For example, the compiler may identify operations that use a small number of input channels, such as operations that each use no more than, for example, a half of the total number of rows in the PE array when applying one weight to each input channel. For the identified operations, the compiler may add, to the neural network model, operations for padding the input feature map for each input channel as described above with respect to, for example, FIG. 13, based on parameters of a convolution operation, such as the size of an original input feature map, the size of a filter (e.g., kernel), the stride used for the convolution, the memory alignment, and the size of the processing element array. [correspond to a predetermined data size corresponding to a size of each of the multiple work items] Optionally, the compiler may add to the neural network model operations for dividing the padded input feature map into multiple partitions and dividing the convolution operation into multiple sub-operations, where the sub-operations may use different partitions of the multiple partitions. In some embodiments, the compiler may add operations for discarding certain padded data or results generated using certain padded data.” See Fig. 5 for example.

Claim 9. The device of claim 1, Huynh discloses wherein each of the multiple work items has a first data size, the one or more batches of data has a plurality of channels, and each channel has a second data size covering the first data size.  
Huynh: [0059-0061] “FIG. 5 illustrates an example of a model 500 for a convolution layer of a convolutional neural network used in, for example, image processing. As illustrated in the example, there may be multiple (e.g., N) 3-D inputs 520-1, . . . , and 520-N to the convolution layer. Each 3-D input may include C channels of 2-D input feature maps (with dimensions H×W). For the first convolution layer in a CNN, such as a ResNet-50, a 3-D input may include, for example, three channels of 2-D images, such as the red, green, and blue color channels. [correspond to each of the multiple work items has a first data size, the one or more batches of data has a plurality of channel] Multiple (e.g., M) 3-D filters 510-1, . . . , and 510-M, each having C 2-D filters of dimensions R×S, may be convolved with the N 3-D inputs 520-1, . . . , and 520-N (e.g., N batches of C input feature maps of dimensions H×W) to generate multiple (e.g., N) 3-D outputs 530-1, . . . , and 530-N, where each of the 3-D outputs 530-1, . . . , and 530-N may include M output feature maps (also referred to as output channels). [correspond to each channel has a second data size covering the first data size] Each 3-D filter 510-1, . . . , or 510-M (with dimensions C×R×S) may be applied to a 3-D input 520-1, . . . , or 520-N (with dimensions C×H×W) to generate an output feature map (with dimensions E×F as described above with respect to FIGS. 3A and 3B) in a 3-D output 530-1, . . . , or 530-N that includes M output feature maps, and thus M 3-D filters may be used to generate the M output feature maps in a 3-D output 530-1, . . . , or 530-N for a 3-D input 520-1, . . . , or 520-N. ….”

Claim 12 and 21 
Huynh discloses transferring a plurality of filters to the processing array, wherein a number of the plurality of filters corresponds to a number of the plurality of subsets and each of the plurality of filter is transferred to a corresponding subset among the plurality of subsets.  
Huynh: [0092] “According to certain embodiments, a convolution operation in a neural network layer that has a small number of input channels may be performed by multiple weight-stationary convolution operations, where an input feature map or a portion of the input feature map may be sequentially input into multiple rows, and multiple filter elements of a same filter may be loaded into the multiple rows of the processing element array at a same time to apply to the same input channel map or the same portion of the input feature map, thus improving the utilization of the processing element array. [correspond to transferring a plurality of filters to the processing array, wherein a number of the plurality of filters corresponds to a number of the plurality of subsets and each of the plurality of filter is transferred to a corresponding subset among the plurality of subsets] To avoid having more than one copy of the input feature map in the memory (e.g., memory subsystem 704) and/or to reduce the data transfer bandwidth used to move the input feature map from the memory (into the processing element array (e.g., processing element array 710), the multiple rows of the processing element array may share the input data. For example, the input data may be read from the memory once and replicated by input selector circuit 730 to input into the multiple rows of processing element array 710 to improve the utilization of processing element array 710 while reducing the memory bandwidth usage for data transfer.”

Claim 13 and 22
Huynh discloses performing a multiplication operation on the first work item in the two or more processing strings in parallel.  
Huynh: [0090] “Sixteen (16) values representing the second elements (e.g., r=0, s=1) of the 16 2-D filters in the four 3-D filter may then be loaded into PE array 810. The elements in the one-dimensional vector for each input feature map may be shifted into PE array 810 and may be multiplied with the pre-loaded weights in PE array 810. The products in each column may be accumulated to generate a second partial sum vector PSUM.sub.0,1 (832) that includes four partial sum sub-vectors for the four output feature maps. Each element in the 16 2-D filters may be loaded into PE array 810 and multiplied with the elements in the one-dimensional vector to generate a partial sum vector that includes four partial sum sub-values for the four output feature maps until a partial sum vector PSUM.sub.R-1,S-1 (834) that corresponds to the element (R-1, S-1) in each 2-D filter and includes four partial sum sub-vectors for the four output feature maps is generated. The partial sum sub-vectors in partial sum vectors PSUM.sub.0,0 (830), PSUM.sub.0,1 (832), . . . , and PSUM.sub.R-1,S-1 (834) and corresponding to each respective output feature map may be accumulated to generate a respective vector 840, 842, 844, or 846 that may correspond to a flattened output feature map.”
Huynh: [0091] “As shown in FIGS. 5, 6, and 8 and Equation 3, a processing element array may perform parallel computation using different columns, which may correspond to different filters or different sets of filters. The processing element array may also perform fused multiply and add operations for data reduction in the dimensions of, for example, input channels, filter height, and filter width, using the columns and rows.”

Claim 14 and 23 
Huynh discloses performing an addition operation on multiplication results in the two or more processing strings in parallel.  
Huynh: [0090] “Sixteen (16) values representing the second elements (e.g., r=0, s=1) of the 16 2-D filters in the four 3-D filter may then be loaded into PE array 810. The elements in the one-dimensional vector for each input feature map may be shifted into PE array 810 and may be multiplied with the pre-loaded weights in PE array 810. The products in each column may be accumulated to generate a second partial sum vector PSUM.sub.0,1 (832) that includes four partial sum sub-vectors for the four output feature maps. Each element in the 16 2-D filters may be loaded into PE array 810 and multiplied with the elements in the one-dimensional vector to generate a partial sum vector that includes four partial sum sub-values for the four output feature maps until a partial sum vector PSUM.sub.R-1,S-1 (834) that corresponds to the element (R-1, S-1) in each 2-D filter and includes four partial sum sub-vectors for the four output feature maps is generated. The partial sum sub-vectors in partial sum vectors PSUM.sub.0,0 (830), PSUM.sub.0,1 (832), . . . , and PSUM.sub.R-1,S-1 (834) and corresponding to each respective output feature map may be accumulated to generate a respective vector 840, 842, 844, or 846 that may correspond to a flattened output feature map.”
Huynh: [0091] “As shown in FIGS. 5, 6, and 8 and Equation 3, a processing element array may perform parallel computation using different columns, which may correspond to different filters or different sets of filters. The processing element array may also perform fused multiply and add operations for data reduction in the dimensions of, for example, input channels, filter height, and filter width, using the columns and rows.”

Claim 18 and 27
Huynh discloses generating a plurality of outputs by the plurality of processing strings in parallel.  
Huynh: [0091] “As shown in FIGS. 5, 6, and 8 and Equation 3, a processing element array may perform parallel computation using different columns, which may correspond to different filters or different sets of filters. The processing element array may also perform fused multiply and add operations for data reduction in the dimensions of, for example, input channels, filter height, and filter width, using the columns and rows.”

Allowable Subject Matter
Claims 7-8, 16-17 and 25-26 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  
Huynh et al (US 2021/0158132 A1) teaches a method of compiling a neural network model to generate instructions for more efficiently utilizing a processing element (PE) array for a convolution operation that uses a small number of input channels. The compiler may generate instructions for loading multiple filter elements of a filter into multiple rows of the PE array, and replicating data in an input feature map for use by the multiple rows to apply the multiple filter elements on the input feature map at the same time. At the tensor level, the compiler may generate instructions for loading multiple filter elements of a filter into multiple rows of the PE array, replicating input data read from a memory for use by the multiple rows, and discarding results generated using certain padding data.
Huynh et al (US 2020/0410036 A1) teaches a method of improving the efficiency of a neural network processor in performing a dilated convolution operation by reducing memory operations. By comparing with a case where the overlapping input data elements are read from the memory and then stored at different locations in the memory, and then read again from the different locations in the memory to perform the dilated convolution, the neural network processor is provided with addresses of the overlapping input data elements and uses the addresses to selectively read the overlapping input data elements from the memory, and fetch the input data elements to the systolic array. The neural network processor does not need to perform additional memory write operations to write the overlapping input data elements back to the memory. Moreover, as the summation buffer can store the output data elements at pre-determined locations in the memory to reconstruct the output data array, the output data elements need not be rearranged in the memory to construct the output data array.
Tu et al (NPL: Deep convolutional neural network architecture with reconfigurable computation pattern, 2017) teaches a method of designing a DCNN acceleration architecture called deep neural architecture (DNA), with reconfigurable computation patterns for different models. The computation pattern comprises a data reuse pattern and a convolution mapping method. For massive and different layer sizes, DNA reconfigures its data paths to support a hybrid data reuse pattern.
Peemen et al (NPL: Memory-centric accelerator design for convolutional neural networks, 2013) teaches a memory-centric accelerator to improve performance without increasing memory bandwidth. This accelerator uses specialized memories that support the data movement patterns and optimized scheduling for data locality. This combination allows the required buffer size to be minimized and data reuse to be maximized.
These references taken either alone or in combination with the prior art of record fail to disclose instructions, including:
Claim 7, 16 and 25
fetch an additional batch of data into the first memory when the size of the one or more batches of data is determined not to cover a predetermined data size corresponding to the size of each of the multiple work items.  

Claim 8, 17 and 26
wherein the controller is further configured to: deallocate a portion of the one or more batches of data when the portion of the one or more batches of data is determined not to be used in a predetermined time period.  

in combination with the remaining elements and features of the claimed invention. The dependent claims would be allowable for at least their dependence on independent claim.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHUEN-MEEI GAN whose telephone number is (469)295-9127. The examiner can normally be reached Monday-Friday 9:00 am to 4:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Rehana Perveen can be reached on 571-272-3676. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CHUEN-MEEI GAN/Primary Examiner, Art Unit 2148