DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Applicant’s claim for the benefit of a prior-filed U.S. Application No. 16/168, 778, filed on October 23, 2018, which is a continuation application of International Application No. PCT/CN2017/099991, filed August 3L 2017, which is acknowledged.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 1/13/2022 is being considered by the examiner.

Response to Arguments
Applicant's arguments filed 05/13/2022 have been fully considered.

Regarding the rejection of claims under 35 USC § 103, the applicant’s arguments have been fully considered. 
The applicant argues that the prior art made of record fails to disclose limitation elements disclosed in the original claim 6 limitations that have been included into the claim 1 limitations. The examiner notes the argument are not persuasive and the rejection made in the previous office action has been maintained.
Specifically, the applicant argues 
the "sliding" element (originally recited in claim 6 and now incorporated into claim 1), fail to disclose or suggest that element because the concepts of "window size" and "stride" relied on by the Examiner are for a pooling process to "significantly reduce[] the quantity of data produced by the convolutional processes." Bo, para. [003 8]. This is different from "sliding ... an operation window ... in the input data block" and "extracting ... the portion of the input data block within the operation window at each sliding position for broadcasting to the plurality of basic processing circuits [to perform a convolution operation]," as recited in claim 1 … This is because Bo's pooling operation is performed after the convolution operation, so the window size is not necessarily the same as any  "basic data block" that is used during the performance of the convolution operation.
Examiner notes that applicant appears to be arguing limitations not expressly claimed.  Specifically, the claims do not limit the operation to a specific order or operation (e.g. convolution operations performed prior to pooling; or the convolution operations in a convolution neural network (CNN) only include performing the convolution and not operation performed by the CNN layers). Additionally, applicant has admitted that the use of sliding to exact information are used by the convolution layer to as inputs for performing the pooling operations, as noted by the statement “the "sliding" element (originally recited in claim 6 and now incorporated into claim 1), fail to disclose or suggest that element because the concepts of "window size" and "stride" relied on by the Examiner are for a pooling process to "significantly reduce[] the quantity of data produced by the convolutional processes”  thus the notion of their being an exaction after performing a pooling operation is in contradiction to what the office action describes and amounts to mere allegations of patentability. If the convolution needs to extract the data as claimed  to perform pooling so that the next layer of the neural network can have a reduce dataset, that does not support the allegation that the operations occur after the pooling process is completed, as noted in applicant’s remarks. 
For context regarding how convolutions work with regards to performing convolution operations, Ngo (“FPGA Hardware Acceleration of Inception Style Parameter Reduced Convolution Neural Networks”, hereinafter ‘Kal’) teaches what one of ordinary skill in the art would understand regarding how the use of windows are used to extract information, in Pg. 6- 8 Sec. 2.2.2: …Convolutional Neural Network (CNN) belong to the feed-forward class of ANNs, and are most mainly used for vision applications… The convolution equation 2.1, where y[m,n] is the output matrix map of same dimensions as the input map x, m and n are the coordinates of the ”center” pixel of the interest region, and L, K denote the kernel dimensions. The convolution kernel ”slides” across the input until the entire image has been traversed, as depicted in 2.4a… The convolution kernel values stay constant throughout the frame traversal with the output map containing correlation values of how strongly a particular input region displays the kernel feature. Each layer of the CNN distinguish multiple features with independent convolutions, each with a distinct kernel.

    PNG
    media_image1.png
    676
    1152
    media_image1.png
    Greyscale

A common simplification used to learn basic CNN operation is to consider input image and kernels as two-dimensional arrays, but in practice multichannel inputs such as RGB or the combine output feature maps, form three-dimensional data volumes that utilize similarly three-dimensional convolution kernels
Regarding the claim limitations Boesch et al. (US Pat. No. US 2018/0189642, hereinafter ‘Bo’) in view of Yang et al. (NPL: “A Systematic Approach to Blocking Convolutional Neural Networks”, hereinafter ‘Yang’) while Bo teaches the process for sliding the window to extract different kernels within the input data using the Window that is part of the process for exacting the convolution kernels used to perform the pooling operations, where the convolution and pooling are both considered convolution operations within a convolution neural network and thus the teaches of Bo in combination with Yang are within the scope of the claims as recited. 
As noted in the office action Bo discloses the limitations wherein each basic data block has a same size and broadcasting at least a portion of the input data block to the plurality of basic processing circuits includes: sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block; and (in 0038-0039: As evident in FIG. lH, a large quantity of data is generated during the convolutional layering process. In addition, each kernel map (i.e., each filtered image) has nearly as many values in it as the original image… The pooling process introduces the concepts of "window size" and "stride."  [i.e. sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block] The window size is the dimen­sions of a window such that a single, maximum value within the window will be selected in the pooling process. A window may be formed having dimensions of m-pixels by n-pixels wherein "m" and "n" are integers, but in most cases, "m" and "n" are equal [i.e. sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block]… the pooling window is conceptually overlayed onto each portion of the kernel map. The "stride" [i.e. sliding, by the main processing circuit, an operation window…] represents how much the pooling window is moved after each pooling act. If the stride is set to "two," then the pooling window is moved by two pixels after each pooling act. If the stride is set to "three," then the pooling window is moved by three pixels [i.e. sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block] after each pooling act.)
extracting, by the main processing circuit, the portion of the input data block within the operation window at each sliding position for broadcasting to the plurality of basic processing circuits.  (in 0036-0039:… For each kernel, a distinct kernel map is created. The plurality of created kernel maps may be envisioned as a stack of kernel maps having a depth equal to the number of filters (i.e., kernels) that are applied. The stack of kernel maps may also be called a stack of filtered images… In the convolutional process of the CNN system 10, a single unknown image is convolved to create a stack of filtered images. The depth of the stack is the same as, or is otherwise based on, the number of filters (i.e., kernels) that are applied to the unknown image… The pooling process [i.e. extracting, by the main processing circuit, the portion of the input data block within the operation window at each sliding position for broadcasting to the plurality of basic processing circuits] introduces the concepts of "window size" and "stride." The window size is the dimen­sions of a window such that a single, maximum value within the window will be selected in the pooling process. A window may be formed having dimensions of m-pixels by n-pixels wherein "m" and "n" are integers, ) 
Yang teaches performing convolution operations where the same convolution kernel is broadcasted with a portioned kernel to compute an output map that is considered part of performing the convolution operations for given data set blocks where the shared block is index over the window covered by the three-dimensional space, as noted by Yang in in Fig. 2

    PNG
    media_image2.png
    704
    590
    media_image2.png
    Greyscale

Figure 2: Multicore partitioning. Top: kernel partitioning broadcasts a shared input to separate cores [i.e. broadcasting, by the main processing circuit, at least a portion of the input data block to the plurality of basic processing circuits, wherein each of the plurality of basic processing Atty. Dkt. No. 10015-01-0002-US-CON2Reply to Final Office Action of-3 - LIU et al.March 1, 2021Application No. 16/663,174circuits receives the same portion of the input data block], each of which processes a disjoint subset of the kernels [i.e. distributing, by the main processing circuit, the plurality of basic data blocks to the plurality of basic processing circuits, wherein each of the plurality of basic data blocks is distributed to one of the plurality of the basic processing circuits and at least two basic processing circuits receive different basic data blocks that include portions of the weight data belonging to different convolution kernels] to produce a dis-joint slab of the output (in the K dimension). Bottom: input partitioning broadcasts all kernels across cores which each process a different subset of the input to produce a disjoint subset of the output, shown here in the Y dimension.)
Additionally, Yan teaches the DCNN having a plurality use of windows in the convolutional layer operations, in pg. 2:Left Col.: …, a CNN is more clearly thought of as a specialized class of image processing pipelines, rather than as a biological neural model. The operations in this pipeline—convolution, local response normalization, pooling, and fully connected layers—correspond to the different “layers" used in the net-work… A convolutional layer (Conv) corresponds to a filter bank. In the standard case of 3D input [i.e. [i.e. sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block]] and output, a convolutional layer maps a C×X ×Y input to a K×X ×Y output using K shift-invariant 3D stencils [i.e. sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block; and], where each stencil is of the size Fw×Fh×C (i.e., a set of K 3-dimensional convolutions) [i.e. extracting, by the main processing circuit, the portion of the input data block within the operation window at each sliding position for broadcasting to the plurality of basic processing circuit]… .
One of ordinary skill in the arts would have been motivated to combine the disclosed methods of Yang and Bo in order to execute memory hierarchy and computing instructions regarding how the data is partitioned into an optimal set of processing blocks, (Yang, Introduction 1st and 2nd paras.); Doing so will help automatically derive optimized blockings for common networks that improve the energy efficiency of hardware implementations by up to an order of magnitude (Yang, Abstract; and Introduction 1st para.).

In conclusion, the requirements of what distinguishes one convolution operation in a neural network from another and what order the operations must occur are not expressly required by the applicant’s claim limitations and thus the claim rejection made in the pervious office action has been maintained for the independent claims and dependent claims. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 7-10 and 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Boesch et al. (US Pat. No. US 2018/0189642, hereinafter ‘Bo’) in view of Yang et al. (NPL: “A Systematic Approach to Blocking Convolutional Neural Networks”, hereinafter ‘Yang’).
Regarding independent claim 1 limitations, Bo teaches a convolution operation method performed by a processing device:
a main processing circuit and a plurality of basic processing circuits, the convolution operation method comprising: receiving, by the main processing circuit, an input data block and a weight data block, (in 0081: In a third embodiment, a system includes a system on chip (SoC) [i.e. a main processing circuit and a plurality of basic processing circuits] having a configurable accelerator framework configurable that is configurable at run time to perform deep convolutional neural network (DCNN) operations. The configurable accelerator framework has a stream switch, and the stream switch includes a plurality of input ports, a plurality of output ports, and a plurality of stream link structures… In the embodiment, the configurable accelerator framework [including claimed main processing circuit and a plurality of basic processing circuits, the convolution operation method comprising] includes a plurality of convolution accelerators. Each one of the plurality of convolution accelerators is configurable at run time to unidirectionally receive input data [i.e. receiving, by the main processing circuit, an input data block] via at least two of the plurality of stream switch output ports. In addition, each one of the plurality of convolution accelerators is also configurable at run time to unidirectionally communicate output data via an input port of the stream switch…; And in 0310-0311: The DCNN training module 106 executes or runs tl1e deep learning framework with the provided training images to train the DCNN. This training generates DCNN weights and metadata. These weights define the trained neural network [i.e. a weight data block]. The DCNN weights and metadata are provided to the SoC configuration tool 108. The SoC configuration tool 108 utilizes the DCNN weights to generate and acceptably optimize the SoC configurable topology [i.e. a weight data block]… Other operations that the SoC configuration tool 108 performs include performing network description to network topology operations; memory management and buffer placement; OMA descriptor chains generation; and acceptably optimal mapping and scheduling of DCNN execution on configurable accelerator framework and DSP clusters. The SoC configuration tool 108 outputs a SoC configuration file and the DCNN SoC weights. The SoC configuration file identifies the various configurations of the SoC 110 (e.g., the configuration of one or more DSPs 138 on the SoC 110 and a configurable accelerator framework) for implementing the DCNN on the SoC 110.)
wherein: the input data block comprises input data; and the weight data block comprises weight data arranged as plurality of convolution kernels; (as depicted in Fig. 6D: 

    PNG
    media_image3.png
    714
    438
    media_image3.png
    Greyscale

In 0229-0249: FIG. 6 includes FIGS. 6A-6D…In an exemplary embodiment, the first CA input data port 602 is arranged to pass a stream of batch data into the CA 600A, the second CA input data port 604 is arranged to pass a stream of kernel data into the CA 600A, and the third CA input data port 606 is arranged to pass a stream of feature data into the CA 600A…Some data that is input to a convolutional neural network is called feature data. Feature data in many cases consists of two or more channels of two-dimensional data structure. In some neural networks that perform image…In some embodiments, the kernels are derived from the training of the neural network. Kernels may have any two-dimensions, but in some cases, kernel data may have a dimension in the range of 3-pixels-by-3-pixels (3x3) up to 11-pixels-by-11-pixels (llxl l). In these cases, the kernel depth [i.e. wherein: … the weight data block comprises weight data arranged as plurality of convolution kernel] is often identical to the number of channels of the feature data set [i.e. wherein: the input data block comprises input data… arranged as plurality of convolution kernel] that will be processed…The feature data and the kernel data  [i.e. wherein: the input data block comprises input data; and the weight data block comprises weight data arranged as plurality of convolution kernel] may use a selected number representation such as fixed-point numbers, floating-point numbers, or some other convention…To process a kernel of a convolution layer, each value (i.e., each pixel) of the input feature at a first position ( e.g., upper right comer, upper left corner, or some other position) is multiplied with each corresponding value of the kernel [i.e. wherein: the input data block comprises input data; and the weight data block comprises weight data arranged as plurality of convolution kernel], and the products are summed to generate one output result.)
dividing, by the main processing circuit, the weight data block into a plurality of basic data blocks each including a portion of the weight data belonging to one of the plurality of convolution kernels; (As depicted in Fig 6D: Kernel 0-1 in the data block distributed to MAC as claimed plurality of basic processing block  of the computed of the claimed plurality convolution kernel to determine partial sums,  in 0248-0249: To process a kernel of a convolution layer, each value (i.e., each pixel) of the input feature at a first position ( e.g., upper right comer, upper left corner, or some other position) is multiplied with each corresponding value of the kernel, and the products are summed to generate one output result… The MAC process is repeated for each pixel in the horizontal direction and the vertical direction to generate the output for one kernel…As the kernel traverses feature pixels in the hori­zontal and vertical directions (e.g., FIGS. lF, lG), the kernel advances in the horizontal direction, the vertical direction, or the horizontal and vertical directions by a selected amount of displacement [i.e. dividing, by the main processing circuit, the weight data block into a plurality of basic data blocks each including a portion of the weight data belonging to one of the plurality of convolution kernels]. Horizontal and vertical displacement is selected by a designer of the neural network that is imple­mented by the CA 600. The displacement, which is also called the "stride," may be between one pixel and several pixels in horizontal direction, vertical direction, or horizon­tal and vertical directions…)
distributing, by the main processing circuit, the plurality of basic data blocks to the plurality of basic processing circuits, wherein each of the plurality of basic data blocks is distributed to one of the plurality of the basic processing circuits and at least two basic processing circuits receive different basic data blocks that include portions of the weight data belonging to different convolution kernels; (As depicted in Figs 6B-D, in 0261-0290: The processing executed in each CA 600 may be directed according to one or more configuration registers . The configuration registers may be embodied in the CA configuration logic 626, the CAF control registers 402, or in other registers associated with each CA 600. In some embodiments, the configuration registers may be accessed and programmed by a host processor such as applications processor 128, a DSP of DSP cluster 122, a command passed into the stream switch 500 and processed by message/ command logic 512, or by some other circuitry… Depending on the geometry of feature data, the kernel size, and the available MAC units 620, a CA 600 may process multiple kernels in parallel. When a convolution process starts, a CA 600 will accept a sufficient amount of feature data lines and the kernels required to perform the convolution process [i.e. distributing, by the main processing circuit, the plurality of basic data blocks to the plurality of basic processing circuits]  before the process is started… In cases where small kernels are processed in parallel (e.g., the case discussed herein of 4 times 3x3), two or more MAC units (e.g., two or more MAC clusters) may select a same feature row] of the fifth CA internal buffer 618 feature line buffer. For example, when the exemplary embodiment is considered, the first, second, and third MAC clusters will handle the first kernel [i.e. wherein each of the plurality of basic data blocks is distributed to one of the plurality of the basic processing circuits and at least two basic processing circuits receive different basic data blocks that include portions of the weight data belonging to different convolution kernels]; the fourth, fifth, and sixth clusters will handle the second kernel[i.e. wherein each of the plurality of basic data blocks is distributed to one of the plurality of the basic processing circuits and at least two basic processing circuits receive different basic data blocks that include portions of the weight data belonging to different convolution kernels], and so on…)
broadcasting, by the main processing circuit, at least a portion of the input data block to the plurality of basic processing circuits, wherein each of the plurality of basic processing Atty. Dkt. No. 10015-01-0002-US-CON2Reply to Final Office Action of-3 - LIU et al.March 1, 2021Application No. 16/663,174circuits receives the same portion of the input data block; (As depicted in Figs 6B-D, in 0261-0290: The processing executed in each CA 600 may be directed according to one or more configuration registers . The configuration registers may be embodied in the CA configuration logic 626, the CAF control registers 402, or in other registers associated with each CA 600. In some embodiments, the configuration registers may be accessed and programmed by a host processor such as applications processor 128, a DSP of DSP cluster 122, a command passed into the stream switch 500 and processed by message/ command logic 512, or by some other circuitry… Depending on the geometry of feature data, the kernel size, and the available MAC units 620, a CA 600 may process multiple kernels in parallel. When a convolution process starts, a CA 600 will accept a sufficient amount of feature data lines [i.e. broadcasting, by the main processing circuit, at least a portion of the input data block to the plurality of basic processing circuits] and the kernels required to perform the convolution process before the process is started… In cases where small kernels are processed in parallel (e.g., the case discussed herein of 4 times 3x3), two or more MAC units (e.g., two or more MAC clusters) may select a same feature row [i.e. broadcasting, by the main processing circuit, at least a portion of the input data block to the plurality of basic processing circuits, wherein each of the plurality of basic processing Atty. Dkt. No. 10015-01-0002-US-CON2Reply to Final Office Action of-3 - LIU et al.March 1, 2021Application No. 16/663,174circuits receives the same portion of the input data block] of the fifth CA internal buffer 618 feature line buffer. For example, when the exemplary embodiment is considered, the first, second, and third MAC clusters will handle the first kernel; the fourth, fifth, and sixth clusters will handle the second kernel, and so on. In cases where larger kernels are processed, all MAC units (e.g., all clusters) may be used to calculate the result of one single kernel, and only one kernel is handled at a time.)
performing, by each of the plurality of basic processing circuits, operations on the portion of the input data block broadcast to that basic processing circuit and one or more basic data blocks distributed to that basic processing circuit to obtain an operation result; (As depicted in Figs 6B-D, in claimed operational results as partial sums, in 0150- 0152: The CAF 400 allows for the definition of a select­able number of concurrent, virtual processing chains at run time. The CAF 400 also includes a full featured back pressure mechanism to control data flow to the various components of the framework. The CAF 400 is arranged for stream multicasting operations, which enable the reuse of a data stream at multiple block instances [i.e. performing, by each of the plurality of basic processing circuits, operations on the portion of the input data block broadcast to that basic processing circuit and one or more basic data blocks distributed to that basic processing circuit to obtain an operation result]… In each CA (600), a register-based kernel buffer provides multiple read ports (e.g., 36), while multiple fixed­point multiply-accumulate (MAC) units (e.g., 36 16-bit MAC units) perform multiple MAC operations per clock cycle (e.g., up to 36 operations per clock cycle). An adder tree accumulates MAC results for each kernel column [i.e. performing, by each of the plurality of basic processing circuits, operations on the portion of the input data block broadcast to that basic processing circuit and one or more basic data blocks distributed to that basic processing circuit to obtain an operation result]…Kernel sets are partitioned in batches processed sequentially and intermediate results [i.e. an operation result] can be stored in the SoC global memory 126 [i.e. providing, by the plurality of basic processing circuits, the respective operation results to the main processing circuit]…
providing, by the plurality of basic processing circuits, the respective operation results to the main processing circuit;  (in 0150- 0152: The CAF 400 allows for the definition of a select­able number of concurrent, virtual processing chains at run time. The CAF 400 also includes a full featured back pressure mechanism to control data flow to the various components of the framework. The CAF 400 is arranged for stream multicasting operations, which enable the reuse of a data stream at multiple block instances… In each CA (600), a register-based kernel buffer provides multiple read ports (e.g., 36), while multiple fixed­point multiply-accumulate (MAC) units (e.g., 36 16-bit MAC units) perform multiple MAC operations per clock cycle (e.g., up to 36 operations per clock cycle). An adder tree accumulates MAC results for each kernel column…Kernel sets are partitioned in batches processed sequentially and intermediate results can be stored in the SoC global memory 126 [i.e. providing, by the plurality of basic processing circuits, the respective operation results to the main processing circuit]… The CAF 400 may be configured for image processing, audio processing, prediction analysis (e.g., games of skill, marketing data, crowd behavior pre­diction, weather analysis and prediction, genetic mapping, disease diagnosis, and other scientific, commercial, and such processing) or some other type of processing; particularly processing that includes convolutional operations [i.e. providing, by the plurality of basic processing circuits, the respective operation results to the main processing circuit]…)
and calculating, by the main processing circuit, a convolution operation result according to the operation results provided by the plurality of basic processing circuits. (in 0046-0047:After processing in the ReLU layer, data in the normalized output map may be averaged in order to predict whether or not the feature of interest characterized by the kernel [i.e. a convolution operation result according to the operation results provided by the plurality of basic processing circuits] is found or is not found in the unknown image [i.e. and calculating, by the main processing circuit, a convolution operation result according to the operation results provided by the plurality of basic processing circuits.]. In this way, each value in a normalized output map is used as a weighted "vote" that indicates whether or not the feature is present in the image. In some cases, several features (i.e., kernels) are convolved, and the predictions are further combined to characterize the image more broadly…After the convolution process produces a kernel map (i.e., a feature image), the kernel map is passed through a pooling layer, and a normalization (i.e., ReLU) layer. All of the values in the output maps are averaged (i.e., sum and divide), and the output value from the averaging is used as a prediction of whether or not the unknown image contains the particular feature found in the known image… . Considering the entire CNN, a two-dimensional image is input to the CNN and produces a set of votes at its output [i.e. a convolution operation result according to the operation results provided by the plurality of basic processing circuits]. The set of votes at the output are used to predict whether the input image either does or does not contain the object of interest that is characterized by the features [i.e. and calculating, by the main processing circuit, a convolution operation result according to the operation results provided by the plurality of basic processing circuits.].)
While Bo teaches the use of parallel circuits in processing data blocks as the 3 dimensional kernels that can be distributed based on processing instructions, as noted above. Bo does not expressly teach the distributed data blocks as multi-processing portioning including broadcasted partitioned subsets and shared  input blocks using a plurality of basic processing circuits.
Yang does expressly teach the distributed data blocks as multi-processing portioning including broadcasted partitioned subsets and shared  input blocks using a plurality of basic processing circuits. (As depicted in Fig. 2

    PNG
    media_image2.png
    704
    590
    media_image2.png
    Greyscale

Figure 2: Multicore partitioning. Top: kernel partitioning broadcasts a shared input to separate cores [i.e. broadcasting, by the main processing circuit, at least a portion of the input data block to the plurality of basic processing circuits, wherein each of the plurality of basic processing Atty. Dkt. No. 10015-01-0002-US-CON2Reply to Final Office Action of-3 - LIU et al.March 1, 2021Application No. 16/663,174circuits receives the same portion of the input data block], each of which processes a disjoint subset of the kernels [i.e. distributing, by the main processing circuit, the plurality of basic data blocks to the plurality of basic processing circuits, wherein each of the plurality of basic data blocks is distributed to one of the plurality of the basic processing circuits and at least two basic processing circuits receive different basic data blocks that include portions of the weight data belonging to different convolution kernels] to produce a dis-joint slab of the output (in the K dimension). Bottom: input partitioning broadcasts all kernels across cores which each process a different subset of the input to produce a disjoint subset of the output, shown here in the Y dimension.)
The Bo and Yang references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing methods for information processing by performing convolution operations in parallel computing environments.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art for computing convolutional neural network (CNN) computations using partitioned data blocks as disclosed by Yang with the method of performing convolution operations in using kernels and partitioned data blocks as disclosed by Bo.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods of Yang and Bo in order to execute memory hierarchy and computing instructions regarding how the data is partitioned into an optimal set of processing blocks, (Yang, Introduction 1st and 2nd paras.); Doing so will help automatically derive optimized blockings for common networks that improve the energy efficiency of hardware implementations by up to an order of magnitude (Yang, Abstract; and Introduction 1st para.).

Bo further teaches, wherein each basic data block has a same size and broadcasting at least a portion of the input data block to the plurality of basic processing circuits includes: sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block; and (in 0038-0039: As evident in FIG. lH, a large quantity of data is generated during the convolutional layering process. In addition, each kernel map (i.e., each filtered image) has nearly as many values in it as the original image… The pooling process introduces the concepts of "window size" and "stride."  [i.e. sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block] The window size is the dimen­sions of a window such that a single, maximum value within the window will be selected in the pooling process. A window may be formed having dimensions of m-pixels by n-pixels wherein "m" and "n" are integers, but in most cases, "m" and "n" are equal [i.e. sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block]… the pooling window is conceptually overlayed onto each portion of the kernel map. The "stride" [i.e. sliding, by the main processing circuit, an operation window…] represents how much the pooling window is moved after each pooling act. If the stride is set to "two," then the pooling window is moved by two pixels after each pooling act. If the stride is set to "three," then the pooling window is moved by three pixels [i.e. sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block] after each pooling act.)
extracting, by the main processing circuit, the portion of the input data block within the operation window at each sliding position for broadcasting to the plurality of basic processing circuits.  (in 0036-0039:… For each kernel, a distinct kernel map is created. The plurality of created kernel maps may be envisioned as a stack of kernel maps having a depth equal to the number of filters (i.e., kernels) that are applied. The stack of kernel maps may also be called a stack of filtered images… In the convolutional process of the CNN system 10, a single unknown image is convolved to create a stack of filtered images. The depth of the stack is the same as, or is otherwise based on, the number of filters (i.e., kernels) that are applied to the unknown image… The pooling process [i.e. extracting, by the main processing circuit, the portion of the input data block within the operation window at each sliding position for broadcasting to the plurality of basic processing circuits] introduces the concepts of "window size" and "stride." The window size is the dimen­sions of a window such that a single, maximum value within the window will be selected in the pooling process. A window may be formed having dimensions of m-pixels by n-pixels wherein "m" and "n" are integers, ) 

Examiner notes casting data streams as claimed broadcasting processing instructions disclosed by Bo as depicted in Figs 6B-D, in 0150- 0152: The CAF 400 allows for the definition of a select­able number of concurrent, virtual processing chains at run time. The CAF 400 also includes a full featured back pressure mechanism to control data flow to the various components of the framework. The CAF 400 is arranged for stream multicasting operations [i.e. for broadcasting to the plurality of basic processing circuits], which enable the reuse of a data stream at multiple block instances… In each CA (600), a register-based kernel buffer provides multiple read ports (e.g., 36), while multiple fixed­point multiply-accumulate (MAC) units (e.g., 36 16-bit MAC units) perform multiple MAC operations per clock cycle (e.g., up to 36 operations per clock cycle). An adder tree accumulates MAC results for each kernel column….)
While Bo teaches the processing instructions by the main processing circuit for executed broadcasting instructions for deep convolutional neural networks (DCNN) modeled over a the sequence of neural network layers as noted above and in 0004: A DCNN is a computer-based tool that processes large quantities of data and adaptively "learns" by conflating proximally related features within the data, making broad predictions about the data, and refining the predictions based on reliable conclusions and new conflations. The DCNN is arranged in a plurality of "layers," and different types of predictions are made at each layer. 
Additionally, Yan teaches the DCNN having a plurality use of windows in the convolutional layer operations, in pg. 2:Left Col.: …, a CNN is more clearly thought of as a specialized class of image processing pipelines, rather than as a biological neural model. The operations in this pipeline—convolution, local response normalization, pooling, and fully connected layers—correspond to the different “layers" used in the net-work… A convolutional layer (Conv) corresponds to a filter bank. In the standard case of 3D input [i.e. [i.e. sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block]] and output, a convolutional layer maps a C×X ×Y input to a K×X ×Y output using K shift-invariant 3D stencils [i.e. sliding, by the main processing circuit, an operation window that has a same size as each basic data block in the input data block; and], where each stencil is of the size Fw×Fh×C (i.e., a set of K 3-dimensional convolutions) [i.e. extracting, by the main processing circuit, the portion of the input data block within the operation window at each sliding position for broadcasting to the plurality of basic processing circuit]… .
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Bo and Yang for the same reasons disclosed above.

Regarding claim 7, the rejection of claim 1 is incorporated and Bo in combination with Yang further teaches the convolution operation method of claim 1, wherein: performing, by each of the plurality of basic processing circuits, the operations on the portion of the input data block broadcast to that basic processing circuit and one or more. basic data block distributed to that basic processing circuit to obtain the operation results result further includes: performing, by each of the plurality of basic processing circuits, multiplication operations on element values of he portion of the input data block and element values at corresponding positions of the one or more basic data blocks to obtain a plurality of multiplication results; (As depicted by Fig. 6D [i.e. including using the MAC basic processing circuits for performing, by each of the plurality of basic processing circuits, multiplication operations on element values of he portion of the input data block and element values at corresponding positions of the one or more basic data blocks to obtain a plurality of multiplication results], in 0150-0151: The CAF 400 allows for the definition of a select­able number of concurrent, virtual processing chains at run time. The CAF 400 also includes a full featured back pressure mechanism to control data flow to the various components of the framework. The CAF 400 is arranged for stream multicasting operations [i.e. broadcast to that basic processing circuit and one or more. basic data block distributed to that basic processing circuit to obtain the operation results result further includes], which enable the reuse of a data stream at multiple block instances. Linked lists control the fully autonomous processing of an entire convolution layer. Multiple accelerators grouped or chained together handle varying sizes for feature maps data and multiple kernels in parallel. … Each CA 600 includes a line buffer to fetch up to a predetermined number ( e.g., 12) of feature map data words in parallel with a single memory access… In each CA (600), a register-based kernel buffer provides multiple read ports (e.g., 36), while multiple fixed­point multiply-accumulate (MAC) units (e.g., 36 16-bit MAC units) perform multiple MAC operations per clock cycle (e.g., up to 36 operations per clock cycle) [i.e. performing, by each of the plurality of basic processing circuits, multiplication operations on element values of he portion of the input data block and element values at corresponding positions of the one or more basic data blocks to obtain a plurality of multiplication results]. An adder tree accumulates MAC results for each kernel column. The overlapping, column based calculation of the MAC opera­tions allows an acceptably optimal reuse of the feature maps data for multiple MACs, thus reducing power consumption associated with redundant memory accesses.)
and providing, by each of the basic processing circuits, the plurality of multiplication results to the main processing circuit; (As depicted in Fig. D, 0150- 0152: The CAF 400 allows for the definition of a select­able number of concurrent, virtual processing chains at run time. The CAF 400 also includes a full featured back pressure mechanism to control data flow to the various components of the framework. The CAF 400 is arranged for stream multicasting operations, which enable the reuse of a data stream at multiple block instances [i.e. and providing, by each of the basic processing circuits, the plurality of multiplication results to the main processing circuit]… In each CA (600), a register-based kernel buffer provides multiple read ports (e.g., 36), while multiple fixed­point multiply-accumulate (MAC) units (e.g., 36 16-bit MAC units) perform multiple MAC operations per clock cycle (e.g., up to 36 operations per clock cycle). An adder tree accumulates MAC results for each kernel column…Kernel sets are partitioned in batches processed sequentially and intermediate results [i.e. the plurality of multiplication results] can be stored in the SoC global memory 126 [i.e. and providing, by each of the basic processing circuits, the plurality of multiplication results to the main processing circuit;]…
 and calculating, by the main processing circuit, the convolution operation result includes accumulating, by the main processing circuit, the plurality of multiplication results provided by each of the basic processing circuits to obtain a convolution result for each basic processing circuit; and sorting, by the main processing circuit, results to obtain a the convolution operation result. (As depicted in Fig. 6D, in 00235-0237: … The second CA internal buffer 612 is physically or virtually arranged in line with the second CA input data interface 604 to automatically store streamed kernel data until the kernel data is passed to a dedicated fourth CA internal buffer 616 that is dedicated to storing kernel buffer data… Once stored in the feature line buffer, the feature data is applied to the plurality of CA MAC units 620. Feature and kernel buffer data applied to the CA MAC units 620 is mathematically combined according to the convolutional operations described herein, and the resulting output products from the CA MAC units [i.e. and calculating, by the main processing circuit, the convolution operation result includes accumulating, by the main processing circuit, the plurality of multiplication results provided by each of the basic processing circuits to obtain a convolution result for each basic processing circuit; ] 620 are passed to the CA adder tree 622. The CA adder tree 622 mathematically combines (e.g., sums) the incoming MAC unit data and batch data passed through the first CA input data port. In some cases, the CA 600A also includes an optional CA bus port interface 624. The CA bus port interface 624, when it is included, may be used to pass data [i.e. the plurality of multiplication results provided by each of the basic processing circuits to obtain a convolution result for each basic processing circuit; and sorting, by the main processing circuit, results to obtain a the convolution operation result] into or out from the CA 600A from SoC global memory 126 [i.e. and sorting, by the main processing circuit, results to obtain a the convolution operation result] or some other location. In some cases, the applications processor 128, a DSP of the DSP cluster 122, or some other processor directs the passage of data, commands, or other….; The host memory sorting process for arranging DCNN topologies, in 0067: The exemplary architecture may include, in the SoC, one or more (e.g., four) static random access memory (SRAM) banks or some other architecture memory with multi-byte (e.g., 1 Mbyte) memory, one or more dedicated bus ports [i.e. instructions for sorting, by the main processing circuit, results to obtain a the convolution operation result], and coarse-grained, fine-grained, or coarse- and fine-grained power gating logic. The exemplary architecture is arranged to sustain, without the need to access external memory, acceptably high throughput for convolutional stages fitting DCNN topologies such as AlexNet without pruning or larger topologies [i.e. instructions for sorting, by the main processing circuit, results to obtain a the convolution operation result], and in some cases, particularly larger topologies if fewer bits are used for activations and/or weights...)

Regarding claim 8, the rejection of claim 1 is incorporated and Bo in combination with Yang further teaches the convolution operation method of claim 1,
wherein: performing, by each of the plurality of basic processing circuits, the operations on the portion of the input data block broadcast to that basic processing circuit and one or more basic data blocks distributed to that basic processing circuit to obtain the operation result further includes: performing, by each of the plurality of basic processing circuits, multiplication operations on element values of a part the portion of the input data block and element values at corresponding positions of the one or more basic data blocks to obtain a plurality of multiplication results; (As depicted by Fig. 6D [i.e. including using the MAC basic processing circuits for performing, by each of the plurality of basic processing circuits, multiplication operations on element values of a part the portion of the input data block and element values at corresponding positions of the one or more basic data blocks to obtain a plurality of multiplication results], in 0150-0151: The CAF 400 allows for the definition of a select­able number of concurrent, virtual processing chains at run time. The CAF 400 also includes a full featured back pressure mechanism to control data flow to the various components of the framework. The CAF 400 is arranged for stream multicasting operations [i.e. broadcast to that basic processing circuit and one or more. basic data block distributed to that basic processing circuit to obtain the operation results], which enable the reuse of a data stream at multiple block instances. Linked lists control the fully autonomous processing of an entire convolution layer. Multiple accelerators grouped or chained together handle varying sizes for feature maps data and multiple kernels in parallel. … Each CA 600 includes a line buffer to fetch up to a predetermined number ( e.g., 12) of feature map data words in parallel with a single memory access… In each CA (600), a register-based kernel buffer provides multiple read ports (e.g., 36), while multiple fixed­point multiply-accumulate (MAC) units (e.g., 36 16-bit MAC units) perform multiple MAC operations per clock cycle (e.g., up to 36 operations per clock cycle) [i.e. performing, by each of the plurality of basic processing circuits, multiplication operations on element values of he portion of the input data block and element values at corresponding positions of the one or more basic data blocks to obtain a plurality of multiplication results]. An adder tree accumulates MAC results for each kernel column. The overlapping, column based calculation of the MAC opera­tions allows an acceptably optimal reuse of the feature maps data for multiple MACs, thus reducing power consumption associated with redundant memory accesses.)
accumulating, by each of the basic processing circuits, the plurality of multiplication results to obtain a convolution result; and sending, by each of the basic processing circuits, the convolution result to the main processing circuit; and calculating, by the main processing circuit, the convolution operation result further includes: sorting, by the main processing circuit, a plurality of convolution results provided by the plurality of basic processing circuits to obtain the convolution operation result. (As depicted in Fig. 6D, in 00235-0237: … The second CA internal buffer 612 is physically or virtually arranged in line with the second CA input data interface 604 to automatically store streamed kernel data until the kernel data is passed to a dedicated fourth CA internal buffer 616 that is dedicated to storing kernel buffer data… Once stored in the feature line buffer, the feature data is applied to the plurality of CA MAC units 620. Feature and kernel buffer data applied to the CA MAC units 620 is mathematically combined according to the convolutional operations described herein, and the resulting output products from the CA MAC units [i.e. accumulating, by each of the basic processing circuits, the plurality of multiplication results to obtain a convolution result] 620 are passed to the CA adder tree 622. The CA adder tree 622 mathematically combines (e.g., sums) the incoming MAC unit data and batch data passed through the first CA input data port. In some cases, the CA 600A also includes an optional CA bus port interface 624. The CA bus port interface 624, when it is included, may be used to pass data [i.e. and calculating, by the main processing circuit, the convolution operation result further includes: sorting, by the main processing circuit, a plurality of convolution results provided by the plurality of basic processing circuits to obtain the convolution operation result.] into or out from the CA 600A from SoC global memory 126 [i.e. sorting, by the main processing circuit, a plurality of convolution results provided by the plurality of basic processing circuits to obtain the convolution operation result.] or some other location. In some cases, the applications processor 128, a DSP of the DSP cluster 122, or some other processor directs the passage of data, commands, or other….; The host memory sorting process for arranging DCNN topologies, in 0067: The exemplary architecture may include, in the SoC, one or more (e.g., four) static random access memory (SRAM) banks or some other architecture memory with multi-byte (e.g., 1 Mbyte) memory, one or more dedicated bus ports [i.e. instructions for sorting, by the main processing circuit, a plurality of convolution results provided by the plurality of basic processing circuits to obtain the convolution operation result.], and coarse-grained, fine-grained, or coarse- and fine-grained power gating logic. The exemplary architecture is arranged to sustain, without the need to access external memory, acceptably high throughput for convolutional stages fitting DCNN topologies such as AlexNet without pruning or larger topologies [i.e. instructions for sorting, by the main processing circuit, a plurality of convolution results provided by the plurality of basic processing circuits to obtain the convolution operation result.], and in some cases, particularly larger topologies [i.e. instructions for sorting, by the main processing circuit, a plurality of convolution results provided by the plurality of basic processing circuits to obtain the convolution operation result.] if fewer bits are used for activations and/or weights...)

Regarding claim 9, the rejection of claim 1 is incorporated and Bo in combination with Yang further teaches the convolution operation method of claim 1,
wherein the processing device further includes branch processing circuits configured to connect the main processing circuit to the plurality of basic processing circuits, and the method further includes:  transmitting, by the branch processing circuits, data among the main processing circuit and the plurality of basic processing circuits. (As depicted in Fig. 6D, in 00235-0237: … The second CA internal buffer 612 is physically or virtually arranged in line with the second CA input data interface 604 to automatically store streamed kernel data until the kernel data is passed to a dedicated fourth CA internal buffer 616 that is dedicated to storing kernel buffer data… Once stored in the feature line buffer, the feature data is applied to the plurality of CA MAC units 620. Feature and kernel buffer data applied to the CA MAC units 620 is mathematically combined according to the convolutional operations described herein, and the resulting output products from the CA MAC units [i.e. processing device further includes branch processing circuits configured to connect the main processing circuit to the plurality of basic processing circuits, and the method further includes:] 620 are passed to the CA adder tree 622. The CA adder tree 622 mathematically combines (e.g., sums) the incoming MAC unit data and batch data passed through the first CA input data port. In some cases, the CA 600A also includes an optional CA bus port interface 624. The CA bus port interface 624 [i.e. device further includes branch processing circuits configured to connect the main processing], when it is included, may be used to pass data  into or out from the CA 600A [includes the plurality of basic processing circuits] from SoC global memory 126 [passing data circuitries as branch processing circuits wherein the processing device further includes branch processing circuits configured to connect the main processing circuit to the plurality of basic processing circuits, and the method further includes:  transmitting, by the branch processing circuits, data among the main processing circuit and the plurality of basic processing circuits.] or some other location. In some cases, the applications processor 128, a DSP of the DSP cluster 122, or some other processor directs the passage of data, commands, or other….; The host memory sorting process for arranging DCNN topologies, in 0067: The exemplary architecture may include, in the SoC [as part of the main processing circuit], one or more (e.g., four) static random access memory (SRAM) banks or some other architecture memory with multi-byte (e.g., 1 Mbyte) memory, one or more dedicated bus ports  [i.e. bus ports as branch processing circuits wherein the processing device further includes branch processing circuits configured to connect the main processing circuit to the plurality of basic processing circuits, and the method further includes:  transmitting, by the branch processing circuits, data among the main processing circuit and the plurality of basic processing circuits.], and coarse-grained, fine-grained, or coarse- and fine-grained power gating logic...; And in )

Regarding independent claim 10 limitations, Song teaches a processing device Bo in combination with Yang teaches a processing device system comprising the claim limitations that are similar to claim 1 limitations and are rejected under the same rationale.

Regarding claim 16, the rejection of claim 10 is incorporated and the claim limitations are similar to the limitation in claim 7 and are rejected under the same rationale. 

Regarding claim 17, the rejection of claim 10 is incorporated and the claim limitations are similar to the limitation in claim 8 and are rejected under the same rationale.

Regarding claim 18, the rejection of claim 10 is incorporated and the claim limitations are similar to the limitation in claim 9 and are rejected under the same rationale.

Regarding claim 19, the rejection of claim 10 is incorporated and Bo in combination with Yang further teaches the processing device of claim 10,
wherein the main processing circuit includes one or any combination of a vector arithmetic unit circuit, an arithmetic logic unit (ALU) circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access (DMA) circuit, or a data rearrangement circuit. (Bo in  0073: In some embodiments of the exemplary architec­ture, each 32-bit DSP is arranged to perform any one or more instructions of a set of specific instructions ( e.g., Min, Max, Sqrt, Mac, Butterfly, Average, 2-4 SIMD ALU [i.e. an arithmetic logic unit (ALU) circuit]) to accelerate and support the convolutional operations of a DCNN. A dual load with 16b saturated MAC [i.e. an accumulator circuit], advanced memory buffer addressing modes, and zero latency loop control executed in a single cycle while an independent two-dimensional (2D) DMA  [i.e. a direct memory access (DMA) circuit] channel  allows the overlap of data transfers. The DSP's preform pooling ( e.g., max pooling, average pooling, etc.) [i.e. a data rearrangement circuit.], nonlinear activation, cross-channel response normal­ization, and classification representing a selected fraction of the total DCNN computation in an architecture that is flexible and amenable to future algorithmic evolutions. DSP's in the exemplary architecture can operate in parallel with CA's and data transfers,…; And as depicted in Fig. 6D, in 0261-264: The processing executed in each CA 600 may be directed according to one or more configuration registers. The configuration registers may be embodied in the CA configuration logic 626, the CAF control registers 402, or in other registers associated with each CA 600... The CA 600 may optionally filter kernel batches if, for example, multiple accelerators are supplied with data through a single DMA controller 406 [i.e. a direct memory access (DMA) circuit]. The CA 600 may optionally enable and perform kernel decompression [i.e. a data rearrangement circuit]. Depending on the geometry of feature data, the kernel size, and the available MAC units 620 [i.e. accumulator circuit], a CA 600 may process multiple kernels in parallel…)

Regarding claim 20, the rejection of claim 10 is incorporated and Bo in combination with Yang further teaches the processing device of claim 10,
wherein each of the basic processing circuits includes of an inner-product arithmetic unit circuit or an accumulator circuit. (Bo teaches instructions for performing convolution, pooling and classifier operations including a accumulator circuit as depicted in Fig. 6D, in 0261-264: The processing executed in each CA 600 may be directed according to one or more configuration registers. The configuration registers may be embodied in the CA configuration logic 626, the CAF control registers 402, or in other registers associated with each CA 600... Depending on the geometry of feature data, the kernel size, and the available MAC units 620 [i.e. each of the basic processing circuits includes of… an accumulator circuit], a CA 600 may process multiple kernels in parallel…)	

Claims 2-5 and 11-14 are rejected under 35 U.S.C. 103 as being unpatentable over Boesch et al. (US Pat. No. US 2018/0189642, hereinafter ‘Bo’) in view of Yang et al. (NPL: “A Systematic Approach to Blocking Convolutional Neural Networks”, hereinafter ‘Yang’), and in further view of Lan et al. (NPL: “A Library for Deep Learning Processor”, hereinafter ‘Lan’).

	
Regarding claim 2, the rejection of claim 1 is incorporated and Bo in combination with Yang further teaches the convolution operation method of claim 1,wherein: the input data in the input data block are arranged as a first four-dimensional data block, … and the weight data in the weight data block are arranged as a second four-dimensional data block... (Bo teaches the use of an input data and weight kernel data as four dimensional data blocks as depicted in Fig. 6D; and Yang teaches the use of the data blocks as depicted in Fig. 2)
While Bo and Yang teach the use of data blocks having four dimensions for data flow in a memory hierarchy for processing in a parallel computing environment.  The notations as claimed are not expressly taught by the Bo and Yang references as recited by the limitations … the input data in the input data block are arranged as a first four-dimensional data block, with H number of data in a first dimension, W number of data in  a second dimension, C number of data in a third dimension, and N number of data in a fourth dimension of the first four-dimensional data block; …
Lan teaches the notations as claimed, in tables 1 [i.e. the input data in the input data block are arranged as a first four-dimensional data block, with H number of data in a first dimension, W number of data in  a second dimension, C number of data in a third dimension, and N number of data in a fourth dimension of the first four-dimensional data block] and Table 2 [i.e. with KH number of data in  a first dimension, KW number of data in  a second dimension, C number of data in a third dimension, and M number of data in a fourth dimension of the second four-dimensional data block]in pg. 290: Left Col.: … The data format is an enumeration type variable used to indicate the data layout of the tensor. The order of these letters implies the data arrangement of the tensor. For example, NCHW indicates that the W stride is 1, the H stride is W, the C stride is H × W, and the N stride is C×H ×W. The data type indicates the data type of elements in the tensor.
[AltContent: textbox ([img-media_image4.png])]





For simplicity, we only provide a 4D-tensor data structure. For example, a 2D-tensor can be regarded as a 4D-tensor which has the parameters H and W of 1. Filter. The synaptic weight is a unique concept in neural networks. In DLPlib, synaptic weights are represented as a ﬁlter, which represents the learned synapses data of convolution and fully-connected ope-rations…. Table 2 shows the parameters of a convolutional ﬁlter. The four dimensions, OC, IC, Kh and Kw are used to indicate the number of output feature maps, the number of input feature maps, the height and the width of the kernels respectively.
[AltContent: textbox ([img-media_image5.png])]





)
The Bo, Yang, and Lan references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing methods for performing convolution operations in parallel computing environments.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for using four dimensional data structures for representing the data and weight information as disclosed by Lan with the method for preforming convolution operations in parallel computing environments disclosed by Song.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to provide an improvement to enable optimization of the data structures for using deep learning processors (e.g., the deep learning library for GPU, cuDNN) (Lan, Abstract).

	
 Regarding claim 3, the rejection of claim 2 is incorporated and Bo in combination with Yang and Lan further teaches the convolution operation method of claim 2, a quantity of the plurality of convolution kernels is equal to M, and dividing the weight data block into a plurality of basic data blocks includes: Atty. Dkt. No. 10015-01-0002-US-CON2Reply to Final Office Action of-4- LIU et al. March 1, 2021Application No. 16/663,174dividing, by the main processing circuit, the weight data block into M basic data blocks each comprising a convolution kernel. (in 0071: A plurality of CA's may be grouped to achieve larger computational entities, which provides flexibility to neural network designers by enabling choices for desirable balancing of available data bandwidth, power, and available processing resources. Kernel sets may be partitioned in batches and processed sequentially, and intermediate results may be stored in on-chip memory. Various kernel sizes ( e.g., up to 12x12), various batch sizes (e.g., up to 16), and parallel kernels (e.g., up to 4 [ i.e. dividing the weight data block into a plurality of basic data blocks includes: Atty. Dkt. No. 10015-01-0002-US-CON2Reply to Final Office Action of-4- LIU et al. March 1, 2021Application No. 16/663,174dividing, by the main processing circuit, the weight data block into M basic data blocks each comprising a convolution kernel.]) can be handled by a single CA instance, and any size kernel can be accommodated with the accumulator input [i.e. a quantity of the plurality of convolution kernels is equal to M, and dividing the weight data block into a plurality of basic data blocks includes: Atty. Dkt. No. 10015-01-0002-US-CON2Reply to Final Office Action of-4- LIU et al. March 1, 2021Application No. 16/663,174dividing, by the main processing circuit, the weight data block into M basic data blocks each comprising a convolution kernel.]; And dividing the weight kernels into any M basic data block = 4 for performing convolution operations as depicted in Fig. 6D)

Regarding claim 4, the rejection of claim 2 is incorporated and Bo in combination with Yang and Lan further teaches the convolution operation method of claim 2, wherein a quantity of the plurality of convolution kernels is equal to M, each convolution kernel including C weight …, wherein dividing the weight data block into a plurality of basic data blocks includes: dividing, by the main processing circuit, the weight data block into a first quantity of basic data blocks each comprising a weight matrix, wherein the first quantity is equal to M multiplied by C. ( M= 4, C=3 and the first quantity of data blocks is 12 processed by the 12 MAC units, as depicted in Fig. 6D; where the kernels can be represented as matrix values for computing the feature maps as depicted in Figs. 1H, in )
Additionally, Lan teaches claim 4 limitation as using the weight kernel as a matrix:
dividing, by the main processing circuit, the weight data block into a first quantity of basic data blocks each comprising a weight matrix, wherein the first quantity is equal to M multiplied by C. (Lan teaches performing the convolutions as a 4-dimensional tensor across the input using divided set of filters using matrix computations, in pg. 290: Sec. 3.3: In order to balance the ﬂexibility, DLPlib also provides basic vector/matrix computations [i.e. the weight data block into a first quantity of basic data blocks each comprising a weight matrix, wherein the first quantity is equal to M multiplied by C] which allow users to implement new and more complex operations[18]. In addition, DLPlib provides a series of functions to cate-nate, split and reshape the data. An operator takes m (m > 0) input tensors and n (n > 0) output tensors. For some operators, a set of attributes are provided to describe their computational behavior (e.g., Conv., Pooling). Conv.. Convolution is the most important layer in convolutional neural networks (CNNs). It takes a 4D-tensor as input, and outputs a 4D-tensor. The out-put is computed by using a set of ﬁlters convolving across through the input…; where the claimed  M multiplied by C as disclosed as IC number of feature maps as depicted table 2, in pg. 290: Left Col. … For simplicity, we only provide a 4D-tensor data structure. For example, a 2D-tensor can be regarded as a 4D-tensor which has the parameters H and W of 1. Filter. The synaptic weight [i.e. the weight data block into a first quantity of basic data blocks each comprising a weight matrix, wherein the first quantity is equal to M multiplied by C] is a unique concept in neural networks. In DLPlib, synaptic weights are represented as a ﬁlter, which represents the learned synapses data of convolution and fully-connected ope-rations…. Table 2 shows the parameters of a convolutional ﬁlter. The four dimensions, OC, IC, Kh and Kw are used to indicate the number of output feature maps [i.e. the weight data block into a first quantity of basic data blocks each comprising a weight matrix, wherein the first quantity is equal to M multiplied by C], the number of input feature maps, the height and the width of the kernels respectively.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Bo, Yang, and Lan for the same reasons disclosed above.
	
Regarding claim 5, the rejection of claim 3 is incorporated and Bo in combination with Yang and Lan further teaches the convolution operation method of claim 3,
distributing the plurality of basic data blocks to the plurality of basic processing circuits includes: distributing, by the main processing circuit, one or more of the M multiple convolution kernels to at least one basic processing circuit when a number of the basic processing circuits is less than M (Bo teaches distribution of the kernel of a kernel when  in the number of kernels = 4 to the grouping of 3 Mac Units [i.e. the basic processing circuits] per row processing group for using the different input block and a kernel 0, as depicted 6D); and  distributing, by the main processing circuit, each of the convolution kernels to a separate basic processing circuit when the number of the basic processing circuits is equal to or larger than M. (Bo teaches distribution of the kernels when M = 4 =number of kernels = 4 Mac Units per column processing group for using the same input block, as depicted 6D as Kernels 0-3)

Regarding claims 11-14, the claim limitations are similar to the claim 2-5 limitations respectively, therefore the claims are rejected under the same rationale.

	
	Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure are listed below:
Tashev et al. (US Pat. No. 10528147): teaches the scope of kernels as disclosed for processing convolutional neural network operations, in 3:55-4:45“…The convolution sublayer may include multiple filters (also referred to as kernels or activation functions…”

Towal (US Pat. No. 10528147): teaches, in 9:55-10:6: “…In conventional systems, convolution may be specified for linear filtering of an im1age. Specifically, the convolution output is the weighted sum of input pixels. The matrix of weights may be referred to as the convolution kernel, or filter. The convolution may be obtained by a matrix multiply of a linearized image and a linearized filter…”

Jiao et al. (US Pat No. 8644643) teaches the graphics processor for convolution filter using inner product and AMU circuits. 

Ngo (NPL: “FPGA Hardware Acceleration of Inception Style Parameter Reduced Convolution Neural Networks”): teaches that “[a] common simplification used to learn basic CNN operation is to consider input image and kernels as two-dimensional arrays, but in practice multichannel inputs such as RGB or the combine output feature maps, form three-dimensional data volumes that utilize similarly three-dimensional convolution kernels. This enables CNNs to distinguish higher order correlations such as colour and textures, and numerical relations. …While this divide-and-conquer methodology provides greater functional density, it comes at a cost; by storing the unused network portions off chip, memory bandwidth is fast becoming the limiting factor, unable to keep up with the raw computation power of modern FPGAs. The majority of CNN solutions take inspiration from the SIMD vector processors of the GPUs commonly used for ANN development. Leveraging the GPU architectural reliance on program counter and instruction issues in order to emulate any ANN by breaking the topology down to an ambiguous massive series of floating point calculations.

Suda et al. (NPL: “Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks”): teaches method for CNN accelerator optimization using parallel computing resources and OpenCL implementation of matrix multiplication operations using MAC; and interfacing IPs as branch circuits for communications operations between the host and accelerator board.; and the use of the inner product

Chellapilla et al.  (NPL: “High Performance Convolutional Neural Networks for Document Processing”): teaches M as the number of inputs associated with the computing the matrix multiplications resulting in the feature map kernels.

Pande et al. (NPL: “Matrix Convolution using Parallel Programming”): teaches parallel processing of the matrix convolution operations using convolution filters to perform multiplication operations.

Tsai et al. (NPL” Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks”): teaches the processing data type as 3 dimensional objects with a notation for the number of 3-D objects as the N, and K value, as depicted in Fig. 2.

Wasserman et al. (US Patent No. 7,737,994): teaches the processing of large kernel convolutions using graphic accelerators to optimize the computations time using parallel computing. 

Goyal et al. (US Pub. No. 2017/0316312): teaches deep learning processor for processing matrix-matrix multiplication operations for a convolutional network. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUWATOSIN ALABI whose telephone number is (571)272-0516.  The examiner can normally be reached on Monday-Friday, 8:00am-5:00pm EST..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/O.O.A./Examiner, Art Unit 2129                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129