DETAILED ACTION
This action is in response to communications filed on 11/12/2020 in which claims 1-8, 10-17 and 19-20 are amended; and claims 1-20 are still pending. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Applicant’s claim for the benefit of a prior-filed U.S. Application No. 16/168, 778, filed on October 23, 2018, which is a continuation application of International Application No. PCT/CN2017/099991, filed August 3L 2017, which is acknowledged.

Information Disclosure Statement
The information disclosure statements (IDSs) submitted on 10/24/2019, 03/09/2020, and 05/19/2020 are being considered by the examiner. Only the non-patent document provided in English have been considered as noted in the annotated in the IDS documents.
 English abstract only for the foreign patent documents because the full document has not been available in English.
Drawings
The drawings were received on 06/09/2015.  These drawings are acceptable.




Response to Arguments
Applicant's arguments filed 11/12/2020 have been fully considered.

Regarding the objection to the specification, applicant has submitted the appropriate changes and the objection made in the previous action has been withdrawn.

Regarding the rejection of claims under 35 USC § 112(b), the applicant has removed the problematic terms and the rejection made in the previous office action has been withdrawn.

Regarding the rejection of claims under  35 USC § 102 and 35 USC § 103, the applicant’s arguments have been fully considered. 
The applicant argues that the prior art made of record fails to disclose the distribution and broadcasting of data using the same input data block and different weight blocks as recited by the amended claim limitation. Applicant’s arguments with respect to claims rejected under 35 USC § 102 and  35 USC § 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the 


Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Song et al. (NPL: “C-Brain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive Data-level Parallelization”, hereinafter ‘Song’), in view of Lu et al. (US Pat. No. 10,073,816, hereinafter ‘Lu’), in further view of Lan et al. (NPL: “A Library for Deep Learning Processor”, hereinafter ‘Lan’).

Regarding independent claim 1 limitations, Song teaches a convolution operation method performed by a processing device:
comprising a main processing circuit and a plurality of basic processing circuits, the convolution operation method comprising: (Song teaches the processing device comprising a host as the main processing unit and a plurality of basic processing circuits depicted in Fig. 2, in Sec. 3 1st para: …Fig. 2 is a typical architecture of state-of-the-art deep learning accelerators [8, 13, 14], which consists of four main components: one input data buffer, one output data buffers, one weight buffer, a computational block (neural Processing Unit, PE) and a logic Control Unit (CU). There is always a compiler, executed on host platform, that automatically translate network specification (numbers of layers, kernel size etc.) written by domain experts into a code segment, which can be mapped, scheduled and executed on the accelerator. And in Fig. 2:
[AltContent: textbox ([img-media_image1.png])]






)
receiving, by the main processing circuit, an input data block and a weight data block, wherein: the input data block comprises input data arranged as a four-dimensional data block; and the weight data block comprising weight data arranged as another four­dimensional data block; (Song teaches receiving  by main unit to execute instructions for receiving input data and weight via the external memory to the computing accelerator device depicted in Fig. 2 and in Sec. 3 1st para:… Once the instructions are ready, the raw image data and weights of pre-trained model are injected into the external memory as the input. And then, CU reads instructions one by one, loads data and weights to on-chip buffer, and computing. The accelerator performs forward propagation layer by layer and finally output the results to the external memory; where the input and weight data blocks of Din, kx, ky, s are arranged to be divided/portioned as data block, as four dimensional kernels associated with the data blocks, as depicted in Fig. 1 and 2 for processing convolution operations, in Sec. 3: 1st ¶: … An input cube is convolved with Dout groups of kernels (Din×k×k) at stride s. Each kernel is shifted in a sliding-window (with an offset s) across the multiple input maps. During each shift, every weight belonging to the kernel is multiplied to the according input element in the input maps and then added-up...

    PNG
    media_image2.png
    399
    729
    media_image2.png
    Greyscale

Where the input is arranged by mapping the using the kernels to divide the data for processing, in Sec. 4.2.1: … Fig. 5 shows the details to partition a kernel. Taking Alexnet Conv1 for example, Fig. 5(a) shows the raw data. Since the length and height of the data are not dividable by ks, '0's are padded at the boundary. Fig. 5(b) is the mapping result from the small kernel windows (ks× ks, represented by dx,y) [the input data block comprises input data arranged as a four-dimensional data block] to the on-chip buffer in sequence. Fig. 5(c) shows the layout of the corresponding weights, one more line of '0' is also padded. The weights are partitioned into g×g pieces, with size = ks×ks [and the weight data block comprising weight data arranged as another four­dimensional data block]. Each small window of weight is represented by wi/9. Specific steps to do the partitioning are described in Algorithm 1…)
dividing, by the main processing circuit, the weight data block into a plurality of basic data blocks; (Song teaches dividing, by the main processing circuit instruction, the weight data block into a plurality of basic data blocks depicted as segments of the input weight into a plurality of basic data blocks for distribution to the neural Processing units (PE) as depicted in Fig. 2, in Sec. 3 1st para :… During each shift, every weight belonging to the kernel is multiplied to the according input element in the input maps and then added-up. And then an optional pooling operation (defined by p and sp) is used to subsample the convolved output. Fig. 2 is a typical architecture of state-of-the-art deep learning accelerators [8, 13, 14], which consists of four main components: one input data buffer, one output data buffers, one weight buffer, a computational block (neural Processing Unit, PE) and a logic Control Unit (CU). There is always a compiler, executed on host platform, that automatically translate network specification (numbers of layers, kernel size etc.) written by domain experts into a code segment, which can be mapped, scheduled and executed on the accelerator. Once the instructions are ready, the raw image data and weights of pre-trained model are injected into the external memory as the input…; and by mapping by the main processing circuit the partitioned block kernels to the PE for operation of size dimension ks by ks, of dimension g by g pieces, for processing by the PE for the arithmetic operations depicted in Fig. 5 in Sec. 4.2.1: … Fig. 5(c) shows the layout of the corresponding weights, one more line of '0' is also padded. The weights are partitioned into g×g pieces, with size = ks×ks… The mapping scheme is same to intra-kernel parallelism (section 4.1.2), but the basic unit is a small kernel window (ks×ks). When Tin is bigger than the size of small kernel window (ks×ks), we map multiple small windows to PE in one operation.)
distributing, by the main processing circuit, the plurality of basic data blocks to the plurality of basic processing circuits, … (Song teaches dividing, by the main processing circuit instruction, the weight into a plurality of basic data blocks depicted as segments of the input weight into a plurality of basic data blocks for distribution to the neural Processing units (PE), considered the basic processing circuits as depicted in Fig. 2, in Sec. 3 1st para :… During each shift, every weight belonging to the kernel is multiplied to the according input element in the input maps and then added-up. And then an optional pooling operation (defined by p and sp) is used to subsample the convolved output. Fig. 2 is a typical architecture of state-of-the-art deep learning accelerators [8, 13, 14], which consists of four main components: one input data buffer, one output data buffers, one weight buffer, a computational block (neural Processing Unit, PE) and a logic Control Unit (CU). There is always a compiler, executed on host platform, that automatically translate network specification (numbers of layers, kernel size etc.) written by domain experts into a code segment, which can be mapped, scheduled and executed on the accelerator. Once the instructions are ready, the raw image data and weights of pre-trained model are injected into the external memory as the input… and distributing the data the partitioned block kernels to the PE for operation of size dimension ks by ks, of dimension g by g pieces, to the PE for preforming the arithmetic operations depicted in Fig. 5 in Sec. 4.2.1: … Fig. 5(c) shows the layout of the corresponding weights, one more line of '0' is also padded. The weights are partitioned into g×g pieces, with size = ks×ks… The mapping scheme is same to intra-kernel parallelism (section 4.1.2), but the basic unit is a small kernel window (ks×ks). When Tin is bigger than the size of small kernel window (ks×ks), we map multiple small windows to PE in one operation.)
broadcasting, by the main processing circuit, at least a portion of the input data block to the plurality of basic processing circuits, …(Song teaches dividing, by the main processing circuit instruction, the input data block into a plurality parts depicted as segments of the input data into a plurality of parts for broadcasting to the neural Processing units (PE), considered the basic processing circuits as depicted in Fig. 2, in Sec. 3 1st para :… During each shift, every weight belonging to the kernel is multiplied to the according input element in the input maps and then added-up. And then an optional pooling operation (defined by p and sp) is used to subsample the convolved output. Fig. 2 is a typical architecture of state-of-the-art deep learning accelerators [8, 13, 14], which consists of four main components: one input data buffer, one output data buffers, one weight buffer, a computational block (neural Processing Unit, PE) and a logic Control Unit (CU). There is always a compiler, executed on host platform, that automatically translate network specification (numbers of layers, kernel size etc.) written by domain experts into a code segment, which can be mapped, scheduled and executed on the accelerator. Once the instructions are ready, the raw image data and weights of pre-trained model are injected into the external memory as the input…)
performing, by each of the plurality of basic processing circuits, operations on the portion of the input data block broadcast to that basic processing circuit and one or more basic data blocks distributed to that basic processing circuit to obtain an operation result; (Song teaches the basic processing circuits as the PEs performing convolution operations on the weight, as the distributed blocks, and data input, as the broadcasted blocks, using multiplier and adder operators depicted in Fig. 2, in Sec. 3 1st para: … In this paper, we primarily discuss convolution operation, which typically makes 90% of the computational workload of a CNN[12]. Fig. 1 illustrates the basic pattern of convolution. An input cube is convolved with Dout groups of kernels (Din×k×k) at stride s. Each kernel is shifted in a sliding-window (with an offset s) across the multiple input maps. During each shift, every weight belonging to the kernel is multiplied to the according input element in the input maps and then added-up… And depicted in Fig. 2 [AltContent: textbox ([img-media_image1.png]  [img-media_image3.png])]& Fig. 6:


For obtaining an operational result using the basic processing circuits to obtain a final result, in Sec. 4.2.1: … The data (Fig. 5a) multiplied to the first sub-piece of weights (Fig. 5c) is starting from d1,1 to d55,55, from left to right and then top to bottom. Second one is starting from d1,2 to d56,56. The same calculation method applied to the following 7 sub-kernels. Thus the last piece of weights is multiplied to the data from d3,3 to d57,57. Ultimately, there will be 9 output maps as shown in Fig. 5(d), with map size to be 55×55. So the final result is to add the 9 maps together, which is exactly the same with the original computing method with that of big kernels. The mapping scheme is same to intra-kernel parallelism (section 4.1.2), but the basic unit is a small kernel window (ks×ks). When Tin is bigger than the size of small kernel window (ks×ks), we map multiple small windows to PE in one operation…)
providing, by the plurality of basic processing circuits, the respective operation results to the main processing circuit; and, calculating, by the main processing circuit, a convolution operation result according to the operation results provided by the plurality (Song teaches the PE basic processing circuits providing operational results to the the main processing circuit external memory as the operation results of the basic processing circuits as depicted in Fig. 2, in Sec. 3 1st para :… During each shift, every weight belonging to the kernel is multiplied to the according input element in the input maps and then added-up. And then an optional pooling operation (defined by p and sp) is used to subsample the convolved output. Fig. 2 is a typical architecture of state-of-the-art deep learning accelerators [8, 13, 14], which consists of four main components: one input data buffer, one output data buffers, one weight buffer, a computational block (neural Processing Unit, PE) and a logic Control Unit (CU). There is always a compiler, executed on host platform, that automatically translate network specification (numbers of layers, kernel size etc.) written by domain experts into a code segment, which can be mapped, scheduled and executed on the accelerator. Once the instructions are ready, the raw image data and weights of pre-trained model are injected into the external memory as the input… Once the instructions are ready, the raw image data and weights of pre-trained model are injected into the external memory as the input. And then, CU reads instructions one by one, loads data and weights to on-chip buffer, and computing. The accelerator performs forward propagation layer by layer and finally output the results to the external memory [calculating, by the main processing circuit, a convolution operation result according to the operation results provided by the plurality of basic processing circuits].)
While Song teaches the processing of data blocks by distributing data in a parallel processing environment, Song does not expressly teach distribution of the data block elements as recited in the claim limitations:
wherein each of the plurality of basic data blocks is distributed to one of the plurality of the basic processing circuits and at least two basic processing circuits receive different basic data blocks;
wherein each of the plurality of basic processing circuits receives the same portion of the input data block;
Lu teaches parallel processing operations for processing distributed data blocks as recited in the claim 1 limitations:
wherein each of the plurality of basic data blocks is distributed to one of the plurality of the basic processing circuits and at least two basic processing circuits receive different basic data blocks; (Lu teaches the  distributing data block elements using a scatter to at least to basic processing node circuits that receive different basic data blocks (e.g. B(1) and B(2)) as depicted in Fig. 5c:

    PNG
    media_image4.png
    561
    730
    media_image4.png
    Greyscale

In 8: 10-18: The different sections in FIGS. SA-SC can also be imple- mented in different ways. Hardware parallelism ( e.g., par­allel hardware data paths), time division multiplexing and packet switching are possible examples. Using FIG. 5B as an example, in hardware parallelism, the input to the node might be fully parallel, with the node receiving both Al and A2 at the same time on parallel data paths. The Al data paths lead to sub-node 1 and the A2 data paths lead to sub-node 2, thus implementing the scatter.)
wherein each of the plurality of basic processing circuits receives the same portion of the input data block; (Lu teaches the broadcasting of data block elements to at least to basic processing node circuits that receive the same input data blocks (e.g. A) as depicted in Fig. 5c:

    PNG
    media_image4.png
    561
    730
    media_image4.png
    Greyscale

In 8: 10-18: The different sections in FIGS. SA-SC can also be imple- mented in different ways. Hardware parallelism ( e.g., par­allel hardware data paths), time division multiplexing and packet switching are possible examples. Using FIG. 5B as an example, in hardware parallelism, the input to the node might be fully parallel, with the node receiving both Al and A2 at the same time on parallel data paths. The Al data paths lead to sub-node 1 and the A2 data paths lead to sub-node 2, thus implementing the scatter.)


It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for distributing data elements using broadcast and scatter operations as disclosed by Lu with the method for preforming convolution operations in parallel computing environments disclosed by Song.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to provide an improvement to enable the acceleration of the mathematical calculations by implementing parallel functions in the hardware system (Lu, 4:14-27).
	
While Song teaches the process the data as four dimensional kennels, it does not expressly note the four dimensions of the data block for processing the weight and input data as separate block formats that have three or two dimensional forms. Lan teaches the use of four dimensions to receive data blocks as recited by the limitations:
receiving, by the main processing circuit, an input data block and a weight data block, wherein: the input data block comprises input data arranged as a four-dimensional data block; and the weight data block comprising weight data arranged as another four­dimensional data block; (Lan teaches receiving tensors as the input data and weight arranged as a four dimensional data blocks as the parameter structures of the input data as depicted in Table 1 and weight filter parameters as depicted in  Table 2, in pg. 290: Left Col.: … The data format is an enumeration type variable used to indi-cate the data layout of the tensor. The order of these letters implies the data arrangement of the tensor. For example, NCHW indicates that the W stride is 1, the H stride is W, the C stride is H × W, and the N stride is C×H ×W. The data type indicates the data type of elements in the tensor.
[AltContent: textbox ([img-media_image5.png])]





For simplicity, we only provide a 4D-tensor data structure. For example, a 2D-tensor can be regarded as a 4D-tensor which has the parameters H and W of 1. Filter. The synaptic weight is a unique concept in neural networks. In DLPlib, synaptic weights are represented as a ﬁlter, which represents the learned synapses data of convolution and fully-connected ope-rations…. Table 2 shows the parameters of a convolutional ﬁlter. The four dimensions, OC, IC, Kh and Kw are used to indicate the number of output feature maps, the num-ber of input feature maps, the height and the width of the kernels respectively.
[AltContent: textbox ([img-media_image6.png])]





)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for using four dimensional data structures for representing the data and weight information as disclosed by Lan with the method for preforming convolution operations in parallel computing environments disclosed by Song.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to provide an improvement to enable optimization of the data structures for using deep learning processors (e.g., the deep learning library for GPU, cuDNN) (Lan, Abstract).
	
Regarding claim 2, the rejection of claim 1 is incorporated and Song in combination with Lu and Lan further teaches the convolution operation method of claim 1,
wherein the input data in the input data block are arranged with H number of data in the first dimension, W number of data in the second dimension, C number of data in the third dimension, and N number of data in the fourth dimension; and the weight data in the weight data block are arranged with KH number of data in the first dimension, KW number of data in the second dimension, C number of data in the third dimension, and M number of data in the fourth dimension. (Song teaches receiving  by main unit to execute instructions for receiving input data and weight via the external memory to the computing accelerator device depicted in Fig. 2 and in Sec. 3 1st para:… Once the instructions are ready, the raw image data and weights of pre-trained model are injected into the external memory as the input. And then, CU reads instructions one by one, loads data and weights to on-chip buffer, and computing. The accelerator performs forward propagation layer by layer and finally output the results to the external memory; where the input and weight data blocks of Din, kx, ky, s are arranged to be divided/portioned as data block, as four dimensional kernels associated with the data blocks, as depicted in Fig. 1 and 2 for processing convolution operations, in Sec. 3: 1st ¶: … An input cube is convolved with Dout groups of kernels (Din×k×k) at stride s. Each kernel is shifted in a sliding-window (with an offset s) across the multiple input maps. During each shift, every weight belonging to the kernel is multiplied to the according input element in the input maps and then added-up...

    PNG
    media_image2.png
    399
    729
    media_image2.png
    Greyscale

Where the input is arranged by mapping the using the kernels to divide the data for processing, in Sec. 4.2.1: … Fig. 5 shows the details to partition a kernel. Taking Alexnet Conv1 for example, Fig. 5(a) shows the raw data. Since the length and height of the data are not dividable by ks, '0's are padded at the boundary. Fig. 5(b) is the mapping result from the small kernel windows (ks× ks, represented by dx,y) [the input data block comprises input data arranged as a four-dimensional data block] to the on-chip buffer in sequence. Fig. 5(c) shows the layout of the corresponding weights, one more line of '0' is also padded. The weights are partitioned into g×g pieces, with size = ks×ks [and the weight data block comprising weight data arranged as another ]. Each small window of weight is represented by wi/9. Specific steps to do the partitioning are described in Algorithm 1…)
Additionally, Lan does teach claim 2 limitation:
wherein the input data in the input data block are arranged with H number of data in the first dimension, W number of data in the second dimension, C number of data in the third dimension, and N number of data in the fourth dimension; and the weight data in the weight data block are arranged with KH number of data in the first dimension, KW number of data in the second dimension, C number of data in the third dimension, and M number of data in the fourth dimension. (Lan teaches the input data and weight as a four dimensional data blocks as the parameter structures of the input data as depicted in Table 1 where the C is the H; the N and M are the W number of data in the third and forth dimension set to 1 as the dimensional parameters or M can be I or C, as depicted in  Table 2, in pg. 290: Left Col.: … The data format is an enumeration type variable used to indi-cate the data layout of the tensor. The order of these letters implies the data arrangement of the tensor. For example, NCHW indicates that the W stride is 1, the H stride is W, the C stride is H × W, and the N stride is C×H ×W. The data type indicates the data type of elements in the tensor.
[AltContent: textbox ([img-media_image5.png])]





For simplicity, we only provide a 4D-tensor data structure. For example, a 2D-tensor can be regarded as a 4D-tensor which has the parameters H and W of 1. Filter. The synaptic weight is a unique concept in neural networks. In DLPlib, synaptic weights are represented as a ﬁlter, which represents the learned synapses data of convolution and fully-connected ope-rations…. Table 2 shows the parameters of a convolutional ﬁlter. The four dimensions, OC, IC, Kh and Kw are used to indicate the number of output feature maps, the num-ber of input feature maps, the height and the width of the kernels respectively.
[AltContent: textbox ([img-media_image6.png])]





)
The Song, Lu, and Lan references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing methods for performing convolution operations in parallel computing environments.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for using four dimensional data structures for representing the data and weight information as disclosed by Lan with the method for preforming convolution operations in parallel computing environments disclosed by Song.


	
 

Regarding claim 3, the rejection of claim 2 is incorporated and Song in combination with Lu and Lan further teaches the convolution operation method of claim 2,
weight data block includes M convolution kernels, and dividing the weight data block into a plurality of basic data blocks includes: dividing, by the main processing circuit, the weight data block into M basic data blocks each comprising a convolution kernel.  (Song teaches dividing, by the main processing circuit instruction, the weight data block into the M plurality of basic data blocks included in the data block  as convolution kernels with a kernel size depicted as segments of the input weight into a plurality of basic data blocks for distribution to the neural Processing units (PE), considered the basic processing circuits as depicted in Fig. 1 and Fig. 2, in Sec. 3 1st para :… During each shift, every weight belonging to the kernel is multiplied to the according input element in the input maps and then added-up. And then an optional pooling operation (defined by p and sp) is used to subsample the convolved output. Fig. 2 is a typical architecture of state-of-the-art deep learning accelerators [8, 13, 14], which consists of four main components: one input data buffer, one output data buffers, one weight buffer, a computational block (neural Processing Unit, PE) and a logic Control Unit (CU). There is always a compiler, executed on host platform, that automatically translate network specification (numbers of layers, kernel size etc.) written by domain experts into a code segment, which can be mapped, scheduled and executed on the accelerator. Once the instructions are ready, the raw image data and weights of pre-trained model are injected into the external memory as the input… )

Regarding claim 4, the rejection of claim 2 is incorporated and Song in combination with Lu and Lan further teaches the convolution operation method of claim 2,
wherein the weight data block includes M convolution kernels, each convolution kernel including C weight matrices, wherein dividing the weight data block into a plurality of basic data blocks includes: dividing, by the main processing circuit, the weight data block into a first quantity of basic data blocks each comprising a weight matrix, wherein the first quantity is equal to M multiplied by C.  (Song teaches multi-dimensional representation of input information using M convolution kernels as the total operations included in the layers of kernel-level operations  into M*C feature maps denoted as Din input number of feature maps, as depicted in Fig. 1 for computing convolution operation using neural processors (NP)  as depicted in Fig. 1, in Sec. 1; 3rd  & 4th  paras: Therefore, a general purpose neural processor (NP) like[8] and [9] is thought as a promising solution to offer both flexibility and efficiency. Such a NP has many good features. First, it reuses the limited hardware resources in a time multiplexing way to increase hardware and power utility. Second, it relies on multi-aspect data tiling methods to exploit data locality and relieve the pressure to on-chip memory… Taking a typical CNN illustrated in Fig. 1 for example, the forward propagation of a CNN includes repetitive layers of kernel-level operations, like convolution and pooling, which are the critical tasks to accelerate for NPs. Generally, there are two major types of data-level parallelism to exploit in such kernel operations: inter-kernel and intra-kernel parallelization. Fig. 1:
[AltContent: textbox ([img-media_image7.png])]






Examiner notes that the specifications paragraph [0109] is used to obtain the board reasonable interpretation of as depicted in Fig. 4c as the number of kernels in each layer of operations modeled as a 3-D tensor. Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181,26 USPQ2d 1057 (Fed. Cir. 1993).
	Additionally, Lan teaches claim 4 limitation:
wherein dividing the weight data block into a plurality of basic data blocks includes: dividing, by the main processing circuit, the weight data block into a first quantity of basic data blocks each comprising a weight matrix, wherein the first quantity is equal to M multiplied by C. (Lan teaches performing the convolutions as a 4-dimensional tensor across the input using divided set of filters using matrix computations, in pg. 290: Sec. 3.3: In order to balance the ﬂexibility, DLPlib also provides basic vector/matrix computations which allow users to implement new and more complex operations[18]. In addition, DLPlib provides a series of functions to cate-nate, split and reshape the data. An operator takes m (m > 0) input tensors and n (n > 0) output tensors. For some operators, a set of attributes are provided to describe their computational behavior (e.g., Conv., Pooling). Conv.. Convolution is the most important layer in convolutional neural networks (CNNs). It takes a 4D-tensor as input, and outputs a 4D-tensor. The out-put is computed by using a set of ﬁlters convolving across through the input…; where the weight filer is M*C as IC number of feature maps as depicted table 2, in pg. 290: Left Col. … For simplicity, we only provide a 4D-tensor data structure. For example, a 2D-tensor can be regarded as a 4D-tensor which has the parameters H and W of 1. Filter. The synaptic weight is a unique concept in neural networks. In DLPlib, synaptic weights are represented as a ﬁlter, which represents the learned synapses data of convolution and fully-connected ope-rations…. Table 2 shows the parameters of a convolutional ﬁlter. The four dimensions, OC, IC, Kh and Kw are used to indicate the number of output feature maps, the num-ber of input feature maps, the height and the width of the kernels respectively.
[AltContent: textbox ([img-media_image6.png])]





 )

Regarding claim 5, the rejection of claim 3 is incorporated and Song in combination with Lu and Lan further teaches the convolution operation method of claim 3,
distributing the plurality of basic data blocks to the plurality of basic processing circuits includes: distributing, by the main processing circuit, one or more of the M multiple convolution kernels to at least one basic processing circuit when a number of the basic processing circuits is less than M; and  distributing, by the main processing circuit, each of the convolution kernels to a separate basic processing circuit when the number of the basic processing circuits is equal to or larger than M. (Song teaches distributing at least one weight and input data of the M coevolution kernels processed by the neural Processing units (PE), considered the basic processing circuits as depicted in Fig. 2 and Fig. 6, in Sec. 3 1st para :… During each shift, every weight belonging to the kernel is multiplied to the according input element in the input maps and then added-up. And then an optional pooling operation (defined by p and sp) is used to subsample the convolved output. Fig. 2 is a typical architecture of state-of-the-art deep learning accelerators [8, 13, 14], which consists of four main components: one input data buffer, one output data buffers, one weight buffer, a computational block (neural Processing Unit, PE) and a logic Control Unit (CU)… Once the instructions are ready, the raw image data and weights of pre-trained model are injected into the external memory as the input…; And mapping multiple convolution kernels captured over the small kernel widow to the processing PE basic circuits based on the window size associated with the convolution kernel window, in Sec. 4.2.1: The mapping scheme is same to intra-kernel parallelism (section 4.1.2), but the basic unit is a small kernel window (ks×ks). When Tin is bigger than the size of small kernel window (ks×ks), we map multiple small windows to PE in one operation. And in Sec. 4.2.2. 2nd para. : As shown in Fig. 6, there are Din input maps and Dout output maps with Din×Dout groups of weight, and kernel size is k. Prior designs go through the whole kernel window at first, which means the kernel window will not shift to right or downward until it accomplished a complete pixels of the output map (need k×k×Din times of multiplication)…;)

Regarding claim 6, the rejection of claim 1 is incorporated and Song in combination with Lu and Lan further teaches the convolution operation method of claim 1,
wherein each basic data block has a same size and broadcasting at least a portion of the input data block to the plurality of basic processing circuits includes: sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block; and (Song teaches broadcasting each respective parts of data in the input data to the basic processing circuits, as the input data broadcasted to the  PE circuits as depicted in Fig. 6  includes sliding as the shift of the pixel window that have the same k X k size, as depicted in Fig. 6, for performing the operations for each mul-operations, in Sec. 4.2.2, 2nd para. As shown in Fig. 6, there are Din input maps and Dout output maps with Din×Dout groups of weight, and kernel size is k. Prior designs go through the whole kernel window at first, which means the kernel window will not shift to right or downward until it accomplished a complete pixels of the output map (need k×k×Din times of multiplication). They have to reload both data and weight repetitively on each mul-operation... )
extracting, by the main processing circuit, the portion of the input data block within the operation window at each sliding position for broadcasting to the plurality of basic processing circuits. (Song teaches the main processing circuit instructing the access to the memory for accessing (e.g. extracting) the input block and weight for computing the output map, as the respective part of the data to be broadcasted using memory access instructions for performing the operations for each mul-operations, in Sec. 4.2.2, 2nd para. As shown in Fig. 6, there are Din input maps and Dout output maps with Din×Dout groups of weight, and kernel size is k. Prior designs go through the whole kernel window at first, which means the kernel window will not shift to right or downward until it accomplished a complete pixels of the output map (need k×k×Din times of multiplication). They have to reload both data and weight repetitively on each mul-operation...;And in Sec. 4.2.1:3rd & 4th para: Fig. 5 shows the details to partition a kernel. Taking Alexnet Conv1 for example, Fig. 5(a) shows the raw data. Since the length and height of the data are not dividable by ks, '0's are padded at the boundary. Fig. 5(b) is the mapping result from the small kernel windows (ks× ks, represented by dx,y) to the on-chip buffer in sequence. Fig. 5(c) shows the layout of the corresponding weights, one more line of '0' is also padded. The weights are partitioned into g×g pieces, with size = ks×ks. Each small window of weight is represented by wi/9. Specific steps to do the partitioning are described in Algorithm 1… For example, the original big kernel (11×11) is partitioned into 9 small sub-kernels (4×4) in Fig. 5(c), so the code within the first outer loop of Algorithm 1 accomplishes 1/9 computing tasks of the original big kernel. The data (Fig. 5a) multiplied to the first sub-piece of weights (Fig. 5c) is starting from d1,1 to d55,55, from left to right and then top to bottom.)

Regarding claim 7, the rejection of claim 1 is incorporated and Song in combination with Lu and Lan further teaches the convolution operation method of claim 1,
wherein: performing, by each of the plurality of basic processing circuits, the operations on the portion of the input data block broadcast to that basic processing circuit and one (Song teaches performing operations with the distributed weight basic block data by each of the PE basic processing circuits in parallel as depicted in Fig. 2 and Fig. 6, in 4.2.2: …In this section, we proposed an improvement for the mapping scheme of inter-kernel parallelism to increase the data reuse rate. As shown in Fig. 6, there are Din input maps and Dout output maps with Din×Dout groups of weight, and kernel size is k. Prior designs go through the whole kernel window at first, which means the kernel window will not shift to right or downward until it accomplished a complete pixels of the output map (need k×k×Din times of multiplication). They have to reload both data and weight repetitively on each mul-operation. In our design, to better reuse both data or weight for less memory access, each time we move to the same pixel in the next output map or the next pixel in the same output map to calculate the 1/(k×k) partial sum instead of the complete sum. In this case, the output buffer is used to store the partial sums, and it requires additional “add-and-store” operations to accumulate the partial sums to obtain the final result.
[AltContent: textbox ([img-media_image3.png])]





Where each PE performs parallel operations, as depicted in Fig.6, to                              generate a convolution result, using multiplication operations on element values of part of the data corresponding to a kernel partitioning position of the distributed weight basic block using  multiply and accumulate operations as depicted in Fig. 5; wherein the main processing circuit executes instructions for accumulation of the plurality of multiplication results as depicted in Fig. 5(a-d) using an add operations to accumulate the plurality  of multiplication results obtained for sorting each portioned plurality of convolution results and accumulate a calculation result of the convolution operation as depicted in 5(d), in   4.2.1: … Fig. 5 shows the details to partition a kernel. Taking Alexnet Conv1 for example, Fig. 5(a) shows the raw data. Since the length and height of the data are not dividable by ks, '0's are padded at the boundary. Fig. 5(b) is the mapping result from the small kernel windows (ks× ks, represented by dx,y) to the on-chip buffer in sequence. Fig. 5(c) shows the layout of the corresponding weights, one more line of '0' is also padded. The weights are partitioned into g×g pieces, with size = ks×ks. Each small window of weight is represented by wi/9. … For example, the original big kernel (11×11) is partitioned into 9 small sub-kernels (4×4) in Fig. 5(c), so the code within the first outer loop of Algorithm 1 accomplishes 1/9 computing tasks of the original big kernel. The data (Fig. 5a) multiplied to the first sub-piece of weights (Fig. 5c) is starting from d1,1 to d55,55, from left to right and then top to bottom. Second one is starting from d1,2 to d56,56. The same calculation method applied to the following 7 sub-kernels. Thus the last piece of weights is multiplied to the data from d3,3 to d57,57. Ultimately, there will be 9 output maps as shown in Fig. 5(d), with map size to be 55×55. So the final result is to add the 9 maps together, which is exactly the same with the original computing method with that of big kernels.

[AltContent: textbox ([img-media_image8.png])]






)

Regarding claim 8, the rejection of claim 1 is incorporated and Song in combination with Lu and Lan teaches the convolution operation method of claim 1,
wherein: performing, by each of the plurality of basic processing circuits, the operations on the portion of the input data block broadcast to that basic processing circuit and one or more basic data blocks distributed to that basic processing circuit to obtain the operation result further includes: performing, by each of the plurality of basic processing circuits, multiplication operations on element values of a part the portion of the input data block and element values at corresponding positions of the one or more  (Song teaches performing operations with the distributed weight basic block data by each of the PE basic processing circuits in parallel as depicted in Fig. 2 and Fig. 6, in 4.2.2 for performing multiplication operations of the weight and data on the repetitive parts and accumulating multiplication results to obtain respective and sending  the result to the memory of the main processing circuit as the external memory for convolutional result obtaining the sent convolutional results to produce Do maps that are sorted by the main processing circuit executed instructions for obtaining a final result as depicted in Fig. 6, in 4.2.2 : …In this section, we proposed an improvement for the mapping scheme of inter-kernel parallelism to increase the data reuse rate. As shown in Fig. 6, there are Din input maps and Dout output maps with Din×Dout groups of weight, and kernel size is k. Prior designs go through the whole kernel window at first, which means the kernel window will not shift to right or downward until it accomplished a complete pixels of the output map (need k×k×Din times of multiplication). They have to reload both data and weight repetitively on each mul-operation. In our design, to better reuse both data or weight for less memory access, each time we move to the same pixel in the next output map or the next pixel in the same output map to calculate the 1/(k×k) partial sum instead of the complete sum. In this case, the output buffer is used to store the partial sums, and it requires additional “add-and-store” operations to accumulate the partial sums to obtain the final result.
[AltContent: textbox ([img-media_image3.png])]





using multiplication operations on element values of a part of data corresponding to a kernel partitioning position of the distributed weight basic block using  multiply and accumulate operations as depicted in Fig. 5; wherein the main processing circuit executes instructions for accumulation of the plurality of multiplication results as depicted in Fig. 5(a-d) using an add operations to accumulate the plurality  of multiplication results obtained for sorting each portioned plurality of convolution results and accumulate a calculation result of the convolution operation as depicted in 5(d), in   4.2.1: … Fig. 5 shows the details to partition a kernel. Taking Alexnet Conv1 for example, Fig. 5(a) shows the raw data. Since the length and height of the data are not dividable by ks, '0's are padded at the boundary. Fig. 5(b) is the mapping result from the small kernel windows (ks× ks, represented by dx,y) to the on-chip buffer in sequence. Fig. 5(c) shows the layout of the corresponding weights, one more line of '0' is also padded. The weights are partitioned into g×g pieces, with size = ks×ks. Each small window of weight is represented by wi/9. … For example, the original big kernel (11×11) is partitioned into 9 small sub-kernels (4×4) in Fig. 5(c), so the code within the first outer loop of Algorithm 1 accomplishes 1/9 computing tasks of the original big kernel. The data (Fig. 5a) multiplied to the first sub-piece of weights (Fig. 5c) is starting from d1,1 to d55,55, from left to right and then top to bottom. Second one is starting from d1,2 to d56,56. The same calculation method applied to the following 7 sub-kernels. Thus the last piece of weights is multiplied to the data from d3,3 to d57,57. Ultimately, there will be 9 output maps as shown in Fig. 5(d), with map size to be 55×55. So the final result is to add the 9 maps together, which is exactly the same with the original computing method with that of big kernels.

[AltContent: textbox ([img-media_image8.png])]






)

Regarding claim 9, the rejection of claim 1 is incorporated and Song in combination with Lu and Lan further teaches the convolution operation method of claim 1,
wherein the processing device further includes branch processing circuits configured to connect the main processing circuit to the plurality of basic processing circuits, and the method further includes:  transmitting, by the branch processing circuits, data among the mam processing circuit and the plurality of basic processing circuits. (Song teaches the processing device system comprising branch circuits as direct memory access circuits that are used to transmit data from the main circuitry to the plurality of basic processing PE circuits and support memory access and control instruction data transfer, as depicted in Fig. 2:  Sec. 3: 1st para: Fig. 2 is a typical architecture of state-of-the-art deep learning accelerators [8, 13, 14], which consists of four main components: one input data buffer, one output data buffers, one weight buffer, a computational block (neural Processing Unit, PE) and a logic Control Unit (CU). There is always a compiler, executed on host platform, that automatically translate network specification (numbers of layers, kernel size etc.) written by domain experts into a code segment, which can be mapped, scheduled and executed on the accelerator… And then, CU reads instructions one by one, loads data and weights to on-chip buffer, and computing.  
[AltContent: textbox ([img-media_image1.png])]






)

Regarding independent claim 10 limitations, Song teaches a processing device(Song teaches a processing device system comprising a host platform as the main processing unit and a plurality of basic processing circuits depicted in Fig. 2, in Sec. 3 1st para: …Fig. 2 is a typical architecture of state-of-the-art deep learning accelerators [8, 13, 14], which consists of four main components: one input data buffer, one output data buffers, one weight buffer, a computational block (neural Processing Unit, PE) and a logic Control Unit (CU). There is always a compiler, executed on host platform, that automatically translate network specification (numbers of layers, kernel size etc.) written by domain experts into a code segment, which can be mapped, scheduled and executed on the accelerator
The claim limitations are similar to claim 1 limitations and are therefore rejected under the same rationale.

Regarding claims 11-14, the rejection of claim 10 is incorporated. The claim limitations are similar to the claim 2-5 limitations respectively, therefore the claims are rejected under the same rationale.

Regarding claim 15 the rejection of claim 10 is incorporated and the claim limitations are similar to the limitation in claim 6 and are rejected under the same rationale. 

Regarding claim 16 the rejection of claim 10 is incorporated and the claim limitations are similar to the limitation in claim 7 and are rejected under the same rationale. 

Regarding claim 17 the rejection of claim 10 is incorporated and the claim limitations are similar to the limitation in claim 8 and are rejected under the same rationale.

Regarding claim 18 the rejection of claim 10 is incorporated and the claim limitations are similar to the limitation in claim 9 and are rejected under the same rationale.

Regarding claim 19, the rejection of claim 10 is incorporated and Song in combination with Lu and Lan further teaches the processing device of claim 10,
wherein the main processing circuit includes one or any combination of a vector arithmetic unit circuit, an arithmetic logic unit (ALU) circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access (DMA) circuit, or a data rearrangement circuit. (Song teaches the main processor including a DMA circuit as depicted in Fig.2 and a an accumulator circuit, a matrix transposition circuit, and a data rearrangement circuit for accumulating the output maps, computing the matrix computation of the output feature maps and sorting the portioned convolutional matrix results with the respective circuits as depicted in Fig. 5(d), depicted in Fig. 2 and Fig. 6, in 4.2.2: …In this section, we proposed an improvement for the mapping scheme of inter-kernel parallelism to increase the data reuse rate. As shown in Fig. 6, there are Din input maps and Dout output maps with Din×Dout groups of weight, and kernel size is k. Prior designs go through the whole kernel window at first, which means the kernel window will not shift to right or downward until it accomplished a complete pixels of the output map (need k×k×Din times of multiplication). They have to reload both data and weight repetitively on each mul-operation. In our design, to better reuse both data or weight for less memory access, each time we move to the same pixel in the next output map or the next pixel in the same output map to calculate the 1/(k×k) partial sum instead of the complete sum. In this case, the output buffer is used to store the partial sums, and it requires additional “add-and-store” operations to accumulate the partial sums to obtain the final result.
[AltContent: textbox ([img-media_image3.png])]





Where each PE performs parallel operations, as depicted in Fig.6, to                              generate a convolution result, using multiplication operations on element values of part of the data corresponding to a kernel partitioning position of the distributed weight basic block using  multiply and accumulate operations as depicted in Fig. 5; wherein the main processing circuit executes instructions for accumulation of the plurality of multiplication results as depicted in Fig. 5(a-d) using an add operations to accumulate the plurality  of multiplication results obtained for sorting each portioned plurality of convolution results and accumulate a calculation result of the convolution operation as depicted in 5(d), in   4.2.1: … Fig. 5 shows the details to partition a kernel. Taking Alexnet Conv1 for example, Fig. 5(a) shows the raw data... The data (Fig. 5a) multiplied to the first sub-piece of weights (Fig. 5c) is starting from d1,1 to d55,55, from left to right and then top to bottom. Second one is starting from d1,2 to d56,56. The same calculation method applied to the following 7 sub-kernels. Thus the last piece of weights is multiplied to the data from d3,3 to d57,57. Ultimately, there will be 9 output maps as shown in Fig. 5(d), with map size to be 55×55. So the final result is to add the 9 maps together, which is exactly the same with the original computing method with that of big kernels.

[AltContent: textbox ([img-media_image8.png])]






)
Regarding claim 20, the rejection of claim 10 is incorporated and Song in combination with Lu and Lan further teaches the processing device of claim 10,
wherein each of the basic processing circuits includes of an inner-product arithmetic unit circuit or an accumulator circuit. (Song teaches performing iner product operations with the distributed weight basic block and input data by each of the PE basic processing circuits in parallel to compute the feature maps,l as depicted in Fig. 2 and Fig. 6, in 4.2.2: …In this section, we proposed an improvement for the mapping scheme of inter-kernel parallelism to increase the data reuse rate. As shown in Fig. 6, there are Din input maps and Dout output maps with Din×Dout groups of weight, and kernel size is k. Prior designs go through the whole kernel window at first, which means the kernel window will not shift to right or downward until it accomplished a complete pixels of the output map (need k×k×Din times of multiplication). They have to reload both data and weight repetitively on each mul-operation. In our design, to better reuse both data or weight for less memory access, each time we move to the same pixel in the next output map or the next pixel in the same output map to calculate the 1/(k×k) partial sum instead of the complete sum. In this case, the output buffer is used to store the partial sums, and it requires additional “add-and-store” operations to accumulate the partial sums to obtain the final result.
[AltContent: textbox ([img-media_image3.png])]





)
Additionally, Lu teaches claim 20 limitation:
wherein each of the basic processing circuits includes of an inner-product arithmetic unit circuit or an accumulator circuit. (Lu teaches an accumulator circuits, in 8:42-46: …Each IPE 630 includes multiple atomic processing elements (APEs) 640. Each APE 640 uses multiply-accumulate cir­cuits (MACs)…)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Song and Lu for the same reasons disclosed above.

	


	
	Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  


The prior art made of record and not relied upon is considered pertinent to applicant's disclosure are listed below:
Jiao et al. (US Pat No. 8644643) teaches the graphics processor for convolution filter using inner product and AMU circuits. 

Suda et al. (NPL: “Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks”): teaches method for CNN accelerator optimization using parallel computing resources and OpenCL implementation of matrix multiplication operations using MAC; and interfacing IPs as branch circuits for communications operations between the host and accelerator board.; and the use of the inner product

Chellapilla et al.  (NPL: “High Performance Convolutional Neural Networks for Document Processing”): teaches M as the number of inputs associated with the computing the matrix multiplications resulting in the feature map kernels.

Pande et al. (NPL: “Matrix Convolution using Parallel Programming”): teaches parallel processing of the matrix convolution operations using convolution filters to perform multiplication operations.

Tsai et al. (NPL” Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks”): teaches the processing data type as 3 dimensional objects with a notation for the number of 3-D objects as the N, and K value, as depicted in Fig. 2.

Wasserman et al. (US Patent No. 7,737,994): teaches the processing of large kernel convolutions using graphic accelerators to optimize the computations time using parallel computing. 

Goyal et al. (US Pub. No. 2017/0316312): teaches deep learning processor for processing matrix-matrix multiplication operations for a convolutional network. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUWATOSIN ALABI whose telephone number is (571)272-0516.  The examiner can normally be reached on Monday-Friday, 8:00am-5:00pm EST..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/O.O.A./Examiner, Art Unit 2126                                                                                                                                                                                                        
/BABOUCARR FAAL/Primary Examiner, Art Unit 2184