DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 07/17/2018 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description: reference character 828 in Fig. 8.  Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The specification is objected to under 37 C.F.R. 1.74, which requires the detailed description to refer to the different parts of the figures by use of reference letters or reference numerals. Implicit in this rule is that the detailed description correctly reference the figures. In this application the figures and detailed description are inconsistent as explained below.
A. Paragraph [0035] refers to a bus 235 in Fig. 2, however Fig.2 does not include a reference character 235, and the bus is labelled as reference character 132 instead.
Claim Objections
Claims 1-12 and 20 are objected to because of the following:
A. In claim 1 lines 19-20, “each of the processing cycles” should read “each of the plurality of processing cycles” instead for consistency of claim terminologies. Claim 2 recites a similar limitation in line 5 and is objected to for the same reason. Claim 20 recites a similar limitation in lines 20-21 and is objected to for the same reason. Claims 2-12 inherit the same deficiency as claim 1 by reason of dependence.
B. In claim 3 lines 8-9, “kernel coefficient” should read “the kernel coefficient” instead.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 13-19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 13 recites “the neural engine circuits” in line 8, “the plurality of neural engine circuits” in lines 9-10, and “the processing cycles” in line 15. There is insufficient antecedent basis for these limitations in the claim. Claim 14 recites “the processing cycles” in line 5 and is rejected for the same reason. Claims 15 “the neural engine circuits” in lines 3-4, 6-7, and 8 is rejected for the same reason. Claims 14-19 inherit the same deficiency as claim 13 by reason of dependence.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 7-8, 11, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US-PGPUB 2018/0129935 A1), hereinafter Kim, in view of Talpes et al. (US-PGPUB 2019/0026249 A1), hereinafter Talpes.
Regarding claim 1, Kim teaches a neural processor circuit, comprising (Kim Fig. 3 and paragraph [0065] neural processor circuit – convolutional neural network (CNN) system 100):
a plurality of neural engine circuits configured to perform convolution operations on at least a work unit of input data and kernel data (Kim Fig. 3 and paragraph [0068] a plurality of neural engine circuits – MAC 121-12i; “the MAC computator 120 may include a plurality of MAC cores 121 to 12i. As described in relation to FIG. 2, each of the plurality of MAC cores 121 to 12i may use a plurality of kernels to perform convolution computations on the input tile Din_T; Fig. 4 and paragraphs [0072-0073] work unit of input data and kernel data – input tile Din_T and kernel KER_1 to KER_M);
a data buffer between the plurality of neural engine circuits and a system memory external to the neural processor circuit, the data buffer configured to store at least a portion of the input data received from the system memory for sending to the neural engine circuits and to store output data received from the neural engine circuits, the portion of the input data comprising the work unit of the input data (Kim Fig. 3] data buffer – input buffer device 110 and output buffer device 130; system memory – external memory 101; fig. 3 shows external memory 101 is external to the CNN system 100; paragraph [0066] “the input buffer device 110 may load the part Din_T of the input data from the external memory 101 … the part Din_T of the input data loaded to the input buffer device 110 is called as an input tile”; paragraph [0072] “the input buffer device 110 may load an input tile Din_T that is a part of input data Din. At this point, the input tile Din_ T may have a size of TnxTwxTh”; paragraph [0069] “The output buffer device 130 may load the part Dout_T of the output data of the convolution computation”; paragraph [0074] “The generated output tile Dout_T may be loaded to the output buffer device 130”); and
a kernel fetcher circuit between the plurality of neural engine circuits and the system memory, the kernel fetcher circuit configured to receive one or more kernels from the system memory, and send a corresponding kernel to the neural engine circuits (Kim Fig. 3 kernel fetcher – kernel weight buffer device 140; paragraph [0070] “The weight kernel buffer device 140 may load, from the external memory 101 parameters necessary for convolution computation … and may provide the loaded parameters to the MAC computator 120; paragraphs [0068 and 0073] “The MAC core 121 may use a plurality of kernels KER_1 to KER_M from the weight kernel buffer device 140 to perform convolution computations on the input tile Din_T loaded to the input buffer device 110”, 
wherein at least one of the neural engine circuits is configured to:
receive a plurality of matrix elements of a matrix as at least the portion of the input data from the data buffer over a plurality of processing cycles (Kim Figs. 2 and 4 and paragraphs [0055-0056, 0073] “The MAC core L1_1 may multiply a kernel of a KxK size by each piece of overlapping data of the input data Din. The MAC core L1_1 may accumulate data values multiplied for each channel of the input data Din to generate one output data value (i.e. a data value of 1x1x1). The MAC core L1_1 may recursively perform such a computation operation to generate the output data Dout for each of the plurality of kernels KER_1 to KER_M”; “the multiplier of the MAC core L1_1 may perform a multiplication on input values of the input data and corresponding weight values … Thereafter, other input values may be input to the MAC core L1_1 and recursively perform the above-described computation to perform a convolution computation”),
receive a plurality of  (Kim Figs. 2 and 4; paragraph [0073] “The MAC core 121 may use a plurality of kernels KER_1 to KER_M from the weight kernel buffer device 140”), and 
perform multiplication between the matrix and the  (Kim Figs. 2 and 4; paragraphs [0054-0056, 0073] one output channel of the output data – shaded portion of Dout).
Kim does not explicitly teach receive a plurality of vector elements of a vector from the kernel fetcher circuit, each of the vector elements extracted as the corresponding kernel to the at least one neural engine circuit in each of the processing cycles, and perform multiplication between the matrix and the vector as a convolution operation to produce at least one output channel of the output data.
However, on the same field of endeavor, Talpes discloses formatting a weight matrix as a vector of weight values to be processed by a matrix processor (Talpes paragraph [0021]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kim in view of Talpes and format at least one of the kernels KER_1 to KER_M as a vector weight values.
The motivation to do so is because by formatting the kernels as a vector, each value of the vector are stored consecutively within the weight buffer which allows for loading the entire vector of weight values to be loaded using minimal processing resources (Talpes paragraph [0015]).
Therefore, the combination of Kim as modified in view of Talpes teaches receive a plurality of vector elements of a vector from the kernel fetcher circuit, each of the vector elements extracted as the corresponding kernel to the at least one neural engine circuit in each of the processing cycles, and perform multiplication between the matrix and the vector as a convolution operation to produce at least one output channel of the output data.

	Regarding claim 3, Kim as modified in view of Talpes teaches all the limitations of claim 1 as stated above. Further, Kim as modified in view of Talpes teaches wherein the at least one of the neural engine circuits is further configured to:
receive a plurality of bias elements of a bias vector  (Kim Fig. 2 and paragraph [0047, 0057] plurality of bias elements of a bias vector – Mx1X1 bias values);
perform, using multiply-add circuits and accumulators in the at least one neural engine circuit, multiply-accumulate operations on the bias elements  (Kim Fig. 2 and paragraphs [0056-0057] “the MAC core L1_1 may use an adder, a multiplier, a register or the like to perform the above-described convolution computation. For example, the multiplier of the MAC core L1_1 may perform a multiplication on input values of the input data and corresponding weight values. The adder may perform an addition on the result of the multiplication and previous computation results stored in the register. The register may store results of the addition. Thereafter, other input values may be input to the MAC core L1_1 and recursively perform the above-described computation to perform a convolution computation” where the adder, the multiplier, and the register of the MAC core L1_1 corresponds to the multiply-add circuits and accumulators in the at least one neural engine circuit).
	The combination of Kim as modified in view of Talpes thus far does not teach wherein the at least one of the neural engine circuits is further configured to: receive a plurality of bias elements of a bias vector from the data buffer during a processing cycle; receive a kernel coefficient from the kernel fetcher circuit during the processing cycle; and perform, using multiply-add circuits and accumulators in the at least one neural engine circuit, multiply-accumulate operations on the bias elements and the kernel coefficient as part of the convolution operation.
	However, on the same field of endeavor, Talpes discloses receiving the bias through a data line from a data formatter and receiving a multiplication identity element. Further, Talpes discloses that the bias parameter is multiplied against the identity element to preserve the bias parameter and the multiplication result (e.g., the bias parameter) is added to the dot-product result as part of the convolution (Talpes paragraphs [0052, 0064]). 
	Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kim in view of Talpes using Talpes and configure the data buffer to store the bias values such that the bias values are received through the input data line. Further, configure the weight kernel buffer device to include an identity element and perform multiply-accumulate operations on the bias values and the identity element as part of the convolution operation. Configuring the data buffer to store the bias values instead of the weight buffer would not change the result of the convolution as the bias vector values would still be added to the Dout. Further, the reason to include an identity element is to preserve the bias parameter and to perform the bias addition using multiply-accumulate operations. 
	Therefore, the combination of Kim as modified in view of Talpes	 teaches wherein the at least one of the neural engine circuits is further configured to: receive a plurality of bias elements of a bias vector from the data buffer during a processing cycle; receive a kernel coefficient from the kernel fetcher circuit during the processing cycle; and perform, using multiply-add circuits and accumulators in the at least one neural engine circuit, multiply-accumulate operations on the bias elements and kernel coefficient as part of the convolution operation.

Regarding claim 7, Kim as modified in view of Talpes teaches all the limitations of claim 1 as stated above. Further, Kim teaches as modified in view of Talpes teaches wherein the at least one of the neural engine circuits is further configured to:
	receive a second plurality of matrix elements of a second matrix from the data buffer over
multiple processing cycles (Kim Figs. 2 and 4 and paragraphs [0055-0056] “The MAC core L1_1 may multiply a kernel of a KxK size by each piece of overlapping data of the input data Din. The MAC core L1_1 may accumulate data values multiplied for each channel of the input data Din to generate one output data value (i.e. a data value of 1x1x1). The MAC core L1_1 may recursively perform such a computation operation to generate the output data Dout for each of the plurality of kernels KER_1 to KER_M”; “the multiplier of the MAC core L1_1 may perform a multiplication on input values of the input data and corresponding weight values … Thereafter, other input values may be input to the MAC core L1_1 and recursively perform the above-described computation to perform a convolution computation” where other input values – corresponds to the second plurality of matrix elements of a second matrix; paragraph [0075] “on other input tiles of the input data Din, the above-described convolution computations may be recursively performed and results of the recursive performances may be combined to generate the output data Dout” where other input tiles corresponds to the second plurality of matrix elements of a second matrix; in conjunction with Fig. 2 although not shown, the other input values/input tiles corresponds to a second NxKxK portion of Din);
	receive a third plurality of matrix elements of a third matrix from the kernel fetcher circuit over the multiple processing cycles (Kim Fig. 2 third plurality of matrix elements of a third matrix – a second kernel of KER_1 to KER_M); and
perform multiplication between the second matrix and the third matrix as a convolution operation on the second matrix elements and the third matrix elements producing multiple output channels of the output data (Kim Fig. 2 and paragraphs [0054-0055] multiple output channels of the output data – M channel of output data Dout; “Thereafter, other input values may be input to the MAC core L1_1 and recursively perform the above-described computation to perform a convolution computation”; paragraph [0075] “on other input tiles of the input data Din, the above-described convolution computations may be recursively performed and results of the recursive performances may be combined to generate the output data Dout”; in conjunction with Fig. 2, multiplication between other input tiles (unshaded portion of Din) and the kernel KER_1 to KER_M is recursively performed to generate the unshaded portion of Dout).

Regarding claim 8, Kim as modified in view of Talpes teaches all the limitations of claim 7 as stated above. Further, Kim teaches as modified in view of Talpes teaches wherein two or more of the neural engine circuits are configured to (Kim Fig. 7 two or more of the neural engine circuits – MAC 221-22i):
receive the second matrix elements from the data buffer (Kim Fig. 7 and paragraph [0098] “the CNN system 200 may further include elements for other respective input tiles, or may recursively perform computation operations on each input tile on the basis of the elements illustrated in FIG. 7”; paragraph [0103] “Each of the plurality of MUXes 251 to 25i may select any one of data values from the connected input buffers to provide the data values to the MAC cores 221 to 22i of the MAC computator 220”; when the operation in Fig. 7 is recursively performed, the second input tile/input values would be received from the data buffer);
receive the third matrix elements from the kernel fetcher circuit (Kim Fig. 7 shows receiving the kernel weights from the weight kernel buffer device); and
perform multiplication between the second matrix and the third matrix as the convolution operation on the second matrix elements and the third matrix elements, using multiply-add circuits and accumulators in each of the two or more neural engine circuits producing one or more output channels of the multiple output channels of the output data (Kim Fig. 7 and paragraphs [104-0105] “Each of the plurality of MAC cores 221 to 22i of the MAC computator 220 may perform multiplications and additions (i.e. convolution computations) on the basis of a received data value and the sparse weight kernel SW”; “The output buffer device 230 includes a plurality of output buffers, and each of the output buffers may store or accumulate output data from the plurality of MAC cores 221 to 22i. For example, the MAC computator may perform a convolution computation for the input tile Din_T by using a first sparse weight kernel. Hereafter, the MAC computator 220 may perform a convolution computation for the input tile Din_T by using a second sparse weight kernel different from the first sparse weight kernel. A result of the convolution computation using the first sparse weight kernel may be a first channel of an output tile Dout_ T, and a result of the convolution computation using the second sparse weight kernel may be a second channel of the output tile Dout_T. In other words, the output buffer device 230 may store or accumulate, as different channels of the output tile Dout_ T, the results of convolution computations performed using a plurality of sparse weight kernels. In short, when a convolution computation is performed using M sparse weight kernels with respect to one input tile Din_T, the output tile Dout_T may have M channels”; paragraph [0056] the MAC core L1_1 may use an adder, a multiplier, a register or the like to perform the above-described convolution computation which corresponds to the multiply-add circuits and accumulators).

Regarding claim 11, Kim as modified in view of Talpes teaches all the limitations of claim 1 as stated above. Further, Kim teaches as modified in view of Talpes teaches wherein the at least one of the neural engines is further configured to (Kim Fig. 2):
receive a first set of elements from the data buffer (Kim Fig. 2 and paragraph [0057] first set of elements – Dout);
receive a second set of elements from the kernel fetcher circuit (Kim Fig. 2, 8-9 and paragraph [0057] second set of elements – bias); and
perform element-wise addition between the first set of elements and the second set of elements as a portion of convolution operation on the first set of elements and the second set of elements producing one or more output channels of the output data (Kim paragraph [0047] “The number of biases of each layer is (the number of output channels}. In other words, for the first layer L1, since the number of output channels is 20, the number of biases used in the first layer L1 is 20. Similarly, the number of biases used in the third layer L3 is 50, and the number of biases used in the fifth layer LS is 500; paragraph [0057] “A bias may be added to the output data Dout with a size of the number M of the channels”).

Regarding claim 20, Kim teaches an electronic device, comprising:
a neural processor circuit, comprising including a plurality of neural engine circuits, a data buffer and a kernel fetcher circuit, the neural engine circuits configured to perform convolution operations on at least a work unit of input data and kernel data (Kim Fig. 3 and paragraph [0065] neural processor circuit – convolutional neural network (CNN) system 100; plurality of neural engine circuits – MAC 121-12i; data duffer - input buffer device 110 and output buffer device 130; kernel fetcher circuit - kernel weight buffer device 140; paragraph [0068] “the MAC computator 120 may include a plurality of MAC cores 121 to 12i. As described in relation to FIG. 2, each of the plurality of MAC cores 121 to 12i may use a plurality of kernels to perform convolution computations on the input tile Din_T; Fig. 4 and paragraphs [0072-0073] work unit of input data and kernel data – input tile Din_T and kernel KER_1 to KER_M):
a system memory external to the neural processor circuit (Kim Fig. 3 system memory – external memory 101),
wherein the data buffer is configured to store at least a portion of the input data received from the system memory for sending to the neural engine circuits, the portion of the input data comprising the work unit of the input data, and store output data received from the neural engine circuits (Kim Fig. 3] data buffer – input buffer device 110 and output buffer device 130; system memory – external memory 101; fig. 3 shows external memory 101 is external to the CNN system 100; paragraph [0066] “the input buffer device 110 may load the part Din_T of the input data from the external memory 101 … the part Din_T of the input data loaded to the input buffer device 110 is called as an input tile”; paragraph [0072] “the input buffer device 110 may load an input tile Din_T that is a part of input data Din. At this point, the input tile Din_ T may have a size of TnxTwxTh”; paragraph [0069] “The output buffer device 130 may load the part Dout_T of the output data of the convolution computation”; paragraph [0074] “The generated output tile Dout_T may be loaded to the output buffer device 130”),
wherein the kernel fetcher circuit is configured to receive one or more kernels from the system memory, and send a corresponding kernel to the neural engine circuits (Kim Fig. 3 kernel fetcher – kernel weight buffer device 140; paragraph [0070] “The weight kernel buffer device 140 may load, from the external memory 101 parameters necessary for convolution computation … and may provide the loaded parameters to the MAC computator 120; paragraphs [0068 and 0073] “The MAC core 121 may use a plurality of kernels KER_1 to KER_M from the weight kernel buffer device 140 to perform convolution computations on the input tile Din_T loaded to the input buffer device 110”), and
wherein at least one of the neural engine circuits is configured to:
receive a plurality of matrix elements of a matrix as at least the portion of the input data from the data buffer over a plurality of processing cycles (Kim Figs. 2 and 4 and paragraphs [0055-0056, 0073] “The MAC core L1_1may multiply a kernel of a KxK size by each piece of overlapping data of the input data Din. The MAC core L1_1 may accumulate data values multiplied for each channel of the input data Din to generate one output data value (i.e. a data value of 1x1x1). The MAC core L1_1 may recursively perform such a computation operation to generate the output data Dout for each of the plurality of kernels KER_1 to KER_M”; “the multiplier of the MAC core L1_1 may perform a multiplication on input values of the input data and corresponding weight values … Thereafter, other input values may be input to the MAC core L1_1 and recursively perform the above-described computation to perform a convolution computation”),
receive a plurality of  (Kim Figs. 2 and 4; paragraph [0073] “The MAC core 121 may use a plurality of kernels KER_1 to KER_M from the weight kernel buffer device 140”), and 
perform multiplication between the matrix and the  (Kim Figs. 2 and 4; paragraphs [0054-0056, 0073] one output channel of the output data – shaded portion of Dout).
Kim does not explicitly teach receive a plurality of vector elements of a vector from the kernel fetcher circuit, each of the vector elements extracted as the corresponding kernel to the at least one neural engine circuit in each of the processing cycles, and perform multiplication between the matrix and the vector as a convolution operation to produce at least one output channel of the output data.
However, on the same field of endeavor, Talpes discloses formatting a weight matrix as a vector of weight values to be processed by a matrix processor (Talpes paragraph [0021).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kim in view of Talpes and format at least one of the kernels KER_1 to KER_M as a vector weight values.
The motivation to do so is because by formatting the kernels as a vector, each value of the vector are stored consecutively within the weight buffer which allows for loading the entire vector of weight values to be loaded using minimal processing resources (Talpes paragraph [0015]).
Therefore, the combination of Kim as modified in view of Talpes teaches receive a plurality of vector elements of a vector from the kernel fetcher circuit, each of the vector elements extracted as the corresponding kernel to the at least one neural engine circuit in each of the processing cycles, and perform multiplication between the matrix and the vector as a convolution operation to produce at least one output channel of the output data.

Claim 2, and 4-5 are rejected under 35 U.S.C. 103 as being unpatentable over Kim in view of Talpes as applied to claim 1 above, and further in view of Ren et al. (NPL – “On Vectorization of Deep Convolutional Neural Networks for Vision Tasks”), hereinafter Ren.
Regarding claim 2, Kim as modified in view of Talpes teaches all the limitations of claim 1 as stated above. Further, Kim as modified in view of Talpes teaches wherein the at least one of the neural engine circuits is further configured to: perform, as part of the convolution operation, multiply-accumulate operations on a subset of the matrix elements  (Kim Fig. 2 and paragraphs [0055-0056] “The MAC core L1_1 may multiply a kernel of a KxK size by each piece of overlapping data of the input data Din. The MAC core L1_1 may accumulate data values multiplied for each channel of the input data Din to generate one output data value (i.e. a data value of 1x1x1). The MAC core L1_1 may recursively perform such a computation operation to generate the output data Dout for each of the plurality of kernels KER_1 to KER_M”).
The combination of Kim and Talpes thus far does not teach wherein the at least one of the neural engine circuits is further configured to: perform, as part of the convolution operation, multiply-accumulate operations on a subset of the matrix elements corresponding to each column of the matrix and each of the vector elements during each of the processing cycles.
However, on the same field of endeavor, Ren teaches performing convolution by formatting/reorganizing the input data as a plurality of column vectors such that the convolution operation is performed by multiply-accumulate operations on the columns of the matrix and a kernel vector (Ren Fig. 2 and pages 2-3 first paragraph of Matlab practice section).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kim in view of Talpes using Ren and format the input data as column vectors as shown in Fig. 2 of Ren such that the convolution operation is performed using multiply-accumulate operations on the columns of the matrix and the kernel vector of weight values.
The motivation to do reorganize the input data as column vectors is to provide a more efficient memory access (Ren page 2 vectorizing convolution section). Furthermore, Talpes also discloses formatting the input data as vectors to allow for loading the data elements using minimal processing resources (Talpes paragraph [0015]).
Therefore, the combination of Kim as modified in view of Talpes and Ren teaches wherein the at least one of the neural engine circuits is further configured to: perform, as part of the convolution operation, multiply-accumulate operations on a subset of the matrix elements corresponding to each column of the matrix and each of the vector elements during each of the processing cycles.

Regarding claim 4, Kim as modified in view of Talpes teaches all the limitations of claim 1 as stated above.
Further, Kim and Talpes teaches wherein two or more of the neural engine circuits are configured to (Kim Fig. 7 two or more of the neural engine circuits – MAC cores 221-22i): receive another plurality of (Kim paragraph [0098] “the CNN system 200 may further include elements for other respective input tiles, or may recursively perform computation operations on each input tile on the basis of the elements illustrated in FIG. 7”, therefore, the MAC cores would receive elements of another tile from the data buffer); 
receive another plurality of matrix elements of another matrix from the kernel fetcher circuit over multiple processing cycles (Kim Fig. 7 shows the MAC cores receiving kernel weight from the weight kernel buffer device; Fig. 2 another plurality of matrix elements of another matrix – matrix elements of one of KER_2 to KER_M); and 
perform multiplication between the other matrix  (Kim paragraph [0055] “The MAC core L1_1 may recursively perform such a computation operation to generate the output data Dout for each of the plurality of kernels KER_1 to KER_M. At this point, the number of channels of the output data Dout may be the same as the number (i.e. M) of the plurality of kernels KER_1 to KER_M”).
The combination of Kim and Talpes thus far does not teach wherein two or more of the neural engine circuits are configured to: receive another plurality of vector elements of another vector from the data buffer over at least one processing cycle; and perform multiplication between the other matrix and the other vector as a convolution operation on the other vector elements and the other matrix elements to produce multiple output channels of the output data.
However, on the same field of endeavor, Ren teaches performing convolution by reorganizing the input data as a plurality of column vectors where each column corresponds to a subset of the input data that is required for a convolution operation and multiplying each column vector with a matrix of kernels where each kernel is arranged as a row. Further, Ren discloses generating multiple output channels of the output data when the kernel matrix and a column vector is multiplied as a convolution operation (Ren Fig. 2 where each column of the reorganized input corresponds to a plurality of vector elements of another vector and map1, map2, map3 corresponds to multiple output channels of the output data).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kim in view of Talpes using Ren and configure two or more MAC cores of Kim to perform the convolution operation disclosed in Fig. 2 of Ren. For example, configuring the MAC core 122 to receive the elements [2, 3, 5, 6] in the second column of the reorganized input matrix and the kernel matrix and perform matrix vector multiplication to generate the second column of the 3x2x2 output data, and configure the MAC 12i to receive the elements [5, 6, 7, 9] in the last column of the reorganized input matrix and the kernel matrix and perform matrix vector multiplication to generate the last column of the 3x2x2 output data.
The motivation to use at least two or more MAC is to perform the convolution operations in parallel (Kim paragraph [0068]). The motivation to do format the input data and kernels as shown in Fig. 2 of Ren is to provide a more efficient memory access (Ren page 2 vectorizing convolution section).
Therefore, the combination of Kim as modified in view of Talpes and Ren teaches wherein two or more of the neural engine circuits are configured to: receive another plurality of vector elements of another vector from the data buffer over at least one processing cycle; receive another plurality of matrix elements of another matrix from the kernel fetcher circuit over multiple processing cycles; and perform multiplication between the other matrix and the other vector as a convolution operation on the other vector elements and the other matrix elements to produce multiple output channels of the output data.

Regarding claim 5, Kim as modified in view of Talpes and Ren teaches all the limitations of claim 4 as stated above. Further, Kim as modified in view of Talpes and Ren teaches wherein each of the two or more neural engine circuits is further configured to: perform, as part of the convolution operation, multiply-accumulate operations on the other vector elements and a subset of the other matrix elements corresponding to a row of the other matrix; and produce, after completion of the multiply-accumulate operations, an output channel of the multiple output channels of the output data (Kim Fig. 7 each of the two or more neural engine circuits – MAC cores 221 to 22i; Kim paragraph [0055] “The MAC core L1_1 may recursively perform such a computation operation to generate the output data Dout for each of the plurality of kernels KER_1 to KER_M. At this point, the number of channels of the output data Dout may be the same as the number (i.e. M) of the plurality of kernels KER_1 to KER_M”; Ren Fig. 2 the other vector elements – elements of each column vector of the reorganized input; subset of the other matrix elements corresponding to a row of the other matrix – one of kernel1, kernel2, and kernel3 rows; output channel of the multiple output channels of the output data – map1, map1, map3 output; see also claim 4 analysis).

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Kim in view of Talpes as applied to claim 11 above, and further in view of Lai et al. (NPL – “CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs”), hereinafter Lai.
Regarding claim 12, Kim as modified in view of Talpes teaches all the limitations of claim 11 as stated above. Further, Kim teaches as modified in view of Talpes teaches wherein the at least one of the neural engines is further configured to (Kim Fig. 2):
(Kim Fig. 2 and paragraph [0056] at least one accumulator of the at least one neural engine circuit – register of the MAC core L1_1; paragraph [0057] at least one bias value – bias); and 
perform the portion of convolution operation as multiply-accumulate operations on the first set of elements and the second set of elements using multiply-add circuits with bypassed multipliers and the at least one  (Kim paragraph [0047] “The number of biases of each layer is (the number of output channels). In other words, for the first layer L1, since the number of output channels is 20, the number of biases used in the first layer L1 is 20. Similarly, the number of biases used in the third layer L3 is 50, and the number of biases used in the fifth layer LS is 500; paragraph [0056] ““the MAC core L1_1 may use an adder, a multiplier, a register or the like to perform the above-described convolution computation”; paragraph [0057] “A bias may be added to the output data Dout with a size of the number M of the channels”).
The combination of Kim and Talpes thus far does not teach wherein the at least one of the neural engine circuits is further configured to: pre-load at least one accumulator of the at least one neural engine circuit with at least one bias value over a processing cycle; and perform the portion of convolution operation as multiply-accumulate operations on the first set of elements and the second set of elements using multiply-add circuits with bypassed multipliers and the at least one pre-loaded accumulator of the at least one neural engine circuit.
However, on the same field of endeavor, Lai discloses performing neural network computations in which at least one accumulator is initialized with at least one bias value and performing multiply-accumulate (MAC) operations using the initialized accumulator (Lai page 4 section 4.2 first paragraph and Fig. 5).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kim in view of Talpes using Lai and initialize at least one accumulator register of the MAC core with the corresponding bias value and use the initialized at least one accumulator register and the adder to perform the portion of convolution operation as multiply-accumulate operations on the first set of elements and the second set of elements.
The motivation for pre-loading the accumulator is to decrease the total number of load instructions during the convolution operation (Lai page 4 section 4.2 first paragraph and Fig. 5).
Therefore, the combination of Kim as modified in view of Talpes and Lai teaches wherein the at least one of the neural engine circuits is further configured to: pre-load at least one accumulator of the at least one neural engine circuit with at least one bias value over a processing cycle; and perform the portion of convolution operation as multiply-accumulate operations on the first set of elements and the second set of elements using multiply-add circuits with bypassed multipliers and the at least one pre-loaded accumulator of the at least one neural engine circuit.

Claims 13, 17, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Kim in view of Talpes and Gauria et al. (US Patent No. 10,310,768 B1), hereinafter Gauria.
Regarding claim 13, Kim teaches a method of operating a neural processor circuit, comprising (Kim Fig. 3 and paragraph [0065] neural processor circuit – convolutional neural network (CNN) system 100):
instructing, receive a portion of input data from a system memory external to the neural processor circuit (Kim Fig. 3 system memory – external memory 101; fig. 3 shows external memory 101 is external to the CNN system 100; paragraph [0066] “the input buffer device 110 may load the part Din_T of the input data from the external memory 101 … the part Din_T of the input data loaded to the input buffer device 110 is called as an input tile”);
storing the portion of the input data in a data buffer of the neural processor circuit (Kim Fig. 3 data buffer – input buffer device 110 and output buffer device 130; paragraph [0066] “the input buffer device 110 may load the part Din_T of the input data from the external memory 101 … the part Din_T of the input data loaded to the input buffer device 110 is called as an input tile”; paragraph [0072] “the input buffer device 110 may load an input tile Din_T that is a part of input data Din);
instructing,  (Kim Fig. 3-4, 7 and paragraph [0073] “The MAC core 121 may use a plurality of kernels KER_1 to KER_M from the weight kernel buffer device 140 to perform convolution computations on the input tile Din_T loaded to the input buffer device 110”; paragraph [0103] “Each of the plurality of MUXes 251 to 25i may select any one of data values from the connected input buffers to provide the data values to the MAC cores 221 to 22i of the MAC computator 220”; plurality of matrix elements of a matrix – elements of the input data matrix Din_T);
instructing,  (Kim Fig. 3 kernel fetcher circuit – kernel weight buffer device 140; paragraph [0070] “The weight kernel buffer device 140 may load, from the external memory 101 parameters necessary for convolution computation … and may provide the loaded parameters to the MAC computator 120; plurality of neural engine circuits – MAC cores 121-12i);
instructing, extracted as a corresponding kernel to the at least one neural engine circuit in each of the processing cycles (Kim Fig. 3 and paragraph [0070] “The weight kernel buffer device 140 may load, from the external memory 101 parameters necessary for convolution computation … and may provide the loaded parameters to the MAC computator 120; paragraphs [0068 and 0073] “The MAC core 121 may use a plurality of kernels KER_1 to KER_M from the weight kernel buffer device 140 to perform convolution computations on the input tile Din_T loaded to the input buffer device 110”); and
performing, by the at least one neural engine circuit, multiplication between the matrix and  (Kim Figs. 2 and 4; paragraphs [0054-0056, 0073] one output channel of the output data – shaded portion of Dout).
Kim does not explicitly teach instructing, by a first rasterizer circuit in a data reader of the neural processor circuit, to cause the data reader to receive a portion of input data from a system memory external to the neural processor circuit; instructing, by a second rasterizer circuit in the data buffer, to cause the data buffer to send a plurality of matrix elements of a matrix as the portion of the input data to at least one of the neural engine circuits; instructing, by a third rasterizer circuit in a kernel fetcher circuit between the plurality of neural engine circuits and the system memory, to cause the kernel fetcher circuit to receive one or more kernels from the system memory; instructing, by the third rasterizer circuit, to cause the kernel fetcher circuit to send to the at least one neural engine circuit a plurality of vector elements of a vector, each of the vector elements extracted as a corresponding kernel to the at least one neural engine circuit in each of the processing cycles; and performing, by the at least one neural engine circuit, multiplication between the matrix and the vector as a convolution operation to produce at least one output channel of output data.
However, on the same field of endeavor, Talpes discloses formatting a weight matrix as a vector of weight values to be processed by a matrix processor (Talpes paragraph [0021]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kim in view of Talpes and format at least one of the kernels KER_1 to KER_M as a vector of weight values.
The motivation to do so is because by formatting the kernels as a vector, each value of the vector are stored consecutively within the weight buffer which allows for loading the entire vector of weight values to be loaded using minimal processing resources (Talpes paragraph [0015]).
Therefore, the combination of Kim as modified in view of Talpes teaches cause the kernel fetcher circuit to send to the at least one neural engine circuit a plurality of vector elements of a vector, each of the vector elements extracted as a corresponding kernel to the at least one neural engine circuit in each of the processing cycles; and performing, by the at least one neural engine circuit, multiplication between the matrix and the vector as a convolution operation to produce at least one output channel of output data.
The combination of Kim as modified in view of Talpes does not teach instructing, by a first rasterizer circuit in a data reader of the neural processor circuit, to cause the data reader to receive a portion of input data from a system memory external to the neural processor circuit; instructing, by a second rasterizer circuit in the data buffer, to cause the data buffer to send a plurality of matrix elements of a matrix as the portion of the input data to at least one of the neural engine circuits; instructing, by a third rasterizer circuit in a kernel fetcher circuit between the plurality of neural engine circuits and the system memory, to cause the kernel fetcher circuit to receive one or more kernels from the system memory; instructing, by the third rasterizer circuit, to cause the kernel fetcher circuit to send to the at least one neural engine circuit a plurality of vector elements of a vector, each of the vector elements extracted as a corresponding kernel to the at least one neural engine circuit in each of the processing cycles.
However, on the same field of endeavor, Gauria discloses a circuit for performing convolution operation that includes an input data pipeline and kernel data pipeline. The input data pipeline includes a data reader that includes a first rasterizer circuit, a second rasterizer circuit and a data buffer. Further, Gauria discloses that the kernel data pipeline includes similar circuitry as the input data pipeline (Gauria Fig. 4 and column 5 line 3 to column 7 line 20 where a data reader corresponds to circuit 150, 151, and 152 and first rasterizer circuit corresponds to at least circuit 150; “The iteration circuits 150 and 160 are generally operational to generate respective sequence of tiles used in the current convolution. In various embodiments, an initial part of the iteration circuits 150 and 160 generate a sequence of the output tiles (or corresponding input blocks) to produce. Next, each two-dimensional or higher-dimensional input block may be broken down into a sequence of input tiles used to produce each output tile. The iteration circuits 150 and 160 may communicate with the control circuit 140 to make sure that data is available before proceeding”; “The address generators 151 and 161 may be operational to fetch data from the memory circuit 92 via the signal MEM into local buffers. The address generator 151 may present addresses in the signal ADDR_A for the input data”; Each circuit 152 and 162 may implement a buffer write (BWR) circuit. The buffer write circuits 152 and 162 are generally operational to receive data from the memory circuit 92 via the signal MEM”; where circuit 153, 154, and 155 corresponds to the data buffer and circuit 153 and/or 155 corresponds to the second rasterizer circuit in the data buffer “Each circuit 153 and 163 may implement a buffer read (BRD) circuit. The buffer read circuits 153 and 163 are generally operational to cause data to be read out of the respective local buffer circuits”; “Each circuit 154 and 164 may implement a local buffer (BUF) circuit”; “The read data circuits 155 and 165 are generally operational to send the data read out of the respective pipelines 142 and 144 to the mathematics circuit 146”; and where circuit 144 corresponds to a kernel fetcher circuit and circuits 163-165 corresponds to the a third rasterizer circuit in the kernel fetcher circuit).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kim in view of Talpes using Gauria and configure the circuit of Kim to include additional circuitry such as a data reader that includes a first rasterizer circuit, a second rasterizer circuit within the data buffer, and a third rasterizer circuit within the kernel fetcher circuit as part of the input data and kernel data pipeline for reading input data and kernel data from the system memory and loading the input data and kernel data to the MAC cores in order to produce the correct sequence of input data and kernel data for the convolution operation and for tracking which input tile and kernel data are in the corresponding data and kernel buffers (Gauria col 6 lines 25-40).
Therefore, the combination of Kim as modified in view of Talpes teaches instructing, by a first rasterizer circuit in a data reader of the neural processor circuit, to cause the data reader to receive a portion of input data from a system memory external to the neural processor circuit; instructing, by a second rasterizer circuit in the data buffer, to cause the data buffer to send a plurality of matrix elements of a matrix as the portion of the input data to at least one of the neural engine circuits; instructing, by a third rasterizer circuit in a kernel fetcher circuit between the plurality of neural engine circuits and the system memory, to cause the kernel fetcher circuit to receive one or more kernels from the system memory; instructing, by the third rasterizer circuit, to cause the kernel fetcher circuit to send to the at least one neural engine circuit a plurality of vector elements of a vector, each of the vector elements extracted as a corresponding kernel to the at least one neural engine circuit in each of the processing cycles.

Regarding claim 17, Kim as modified in view of Talpes and Gauria teaches all the limitations of claim 13 as stated above. Further, Kim teaches as modified in view of Talpes and Gauria teaches further comprising:
instructing, by the second rasterizer circuit, to cause the data buffer to send a second plurality of matrix elements of a second matrix to the at least one neural engine circuit over multiple processing cycles (Kim Figs. 2 and 4 and paragraphs [0055-0056] “The MAC core L1_1 may multiply a kernel of a KxK size by each piece of overlapping data of the input data Din. The MAC core L1_1 may accumulate data values multiplied for each channel of the input data Din to generate one output data value (i.e. a data value of 1x1x1). The MAC core L1_1 may recursively perform such a computation operation to generate the output data Dout for each of the plurality of kernels KER_1 to KER_M”; “the multiplier of the MAC core L1_1 may perform a multiplication on input values of the input data and corresponding weight values … Thereafter, other input values may be input to the MAC core L1_1 and recursively perform the above-described computation to perform a convolution computation” where other input values – corresponds to the second plurality of matrix elements of a second matrix; paragraph [0075] “on other input tiles of the input data Din, the above-described convolution computations may be recursively performed and results of the recursive performances may be combined to generate the output data Dout” where other input tiles corresponds to the second plurality of matrix elements of a second matrix; in conjunction with Fig. 2 although not shown, the other input values/input tiles corresponds to a second NxKxK portion of Din);
instructing, by the third rasterizer circuit, to cause the kernel fetcher circuit to send a third plurality of matrix elements of a third matrix to the at least one neural engine circuit over the multiple processing cycles (Kim Fig. 2 third plurality of matrix elements of a third matrix – a second kernel of KER_1 to KER_M); and
performing multiplication between the second matrix and the third matrix as a convolution operation on the second matrix elements and the third matrix elements producing multiple output channels of the output data (Kim Fig. 2 and paragraphs [0054-0055] multiple output channels of the output data – M channel of output data Dout; “Thereafter, other input values may be input to the MAC core L1_1 and recursively perform the above-described computation to perform a convolution computation”; paragraph [0075] “on other input tiles of the input data Din, the above-described convolution computations may be recursively performed and results of the recursive performances may be combined to generate the output data Dout”; in conjunction with Fig. 2, multiplication between other input tiles (unshaded portion of Din) and the kernel KER_1 to KER_M is recursively performed to generate the unshaded portion of Dout).

Regarding claim 19, Kim as modified in view of Talpes and Gauria teaches all the limitations of claim 13 as stated above. Further, Kim teaches as modified in view of Talpes and Gauria teaches further comprising:
instructing, by the second rasterizer circuit, to cause the data buffer to send a first set of elements to the at least one neural engine circuit (Kim Fig. 2 and paragraph [0057] first set of elements – Dout);
instructing, by the third rasterizer circuit, to cause the kernel fetcher circuit to send a second set of elements to the at least one neural engine circuit (Kim Fig. 2, 8-9 and paragraph [0057] second set of elements – bias); and
performing element-wise addition between the first set of elements and the second set of elements as a convolution operation on the first set of elements and the second set of elements producing one or more output channels of the output data (Kim paragraph [0047] “The number of biases of each layer is (the number of output channels). In other words, for the first layer L1, since the number of output channels is 20, the number of biases used in the first layer L1 is 20. Similarly, the number of biases used in the third layer L3 is 50, and the number of biases used in the fifth layer LS is 500; paragraph [0057] “A bias may be added to the output data Dout with a size of the number M of the channels”).
Claims 14-16 are rejected under 35 U.S.C. 103 as being unpatentable over Kim in view of Talpes and Gauria as applied to claim 13 above, and further in view of Ren et al. (NPL – “On Vectorization of Deep Convolutional Neural Networks for Vision Tasks”), hereinafter Ren.
Regarding claim 14, Kim as modified in view of Talpes and Gauria teaches all the limitations of claim 13 as stated above. Further, Kim as modified in view of Talpes and Gauria teaches wherein performing multiplication between the matrix and the vector as a convolution operation comprising: performing, as part of the convolution operation, multiply-accumulate operations on a subset of the matrix elements  (Kim Fig. 2 and paragraphs [0055-0056] “The MAC core L1_1 may multiply a kernel of a KxK size by each piece of overlapping data of the input data Din. The MAC core L1_1 may accumulate data values multiplied for each channel of the input data Din to generate one output data value (i.e. a data value of 1x1x1). The MAC core L1_1 may recursively perform such a computation operation to generate the output data Dout for each of the plurality of kernels KER_1 to KER_M”).
The combination of Kim, Talpes, and Gauria thus far does not teach wherein performing multiplication between the matrix and the vector as a convolution operation comprising: performing, as part of the convolution operation, multiply-accumulate operations on a subset of the matrix elements corresponding to each column of the matrix and each of the vector elements during each of the processing cycles.
However, on the same field of endeavor, Ren teaches performing convolution by formatting/reorganizing the input data as a plurality of column vectors such that the convolution operation is performed by multiply-accumulate operations on the columns of the matrix and a kernel vector (Ren Fig. 2 and pages 2-3 first paragraph of Matlab practice section).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kim in view of Talpes and Gauria using Ren and format the input data as column vectors as shown in Fig. 2 of Ren such that the convolution operation is performed using multiply-accumulate operations on the columns of the matrix and the kernel vector of weight values.
The motivation to do reorganize the input data as column vectors is to provide a more efficient memory access (Ren page 2 vectorizing convolution section). Furthermore, Talpes also discloses formatting the input data as vectors to allow for loading the data elements using minimal processing resources (Talpes paragraph [0015]).
Therefore, the combination of Kim as modified in view of Talpes, Gauria and Ren teaches wherein performing multiplication between the matrix and the vector as a convolution operation comprising: performing, as part of the convolution operation, multiply-accumulate operations on a subset of the matrix elements corresponding to each column of the matrix and each of the vector elements during each of the processing cycles.

Regarding claim 15, Kim as modified in view of Talpes and Gauria teaches all the limitations of claim 13 as stated above.
Further, Kim as modified in view of Talpes and Gauria teaches further comprising:
instructing, by the second rasterizer circuit, to cause the data buffer to send another plurality of  (Kim paragraph [0098] “the CNN system 200 may further include elements for other respective input tiles, or may recursively perform computation operations on each input tile on the basis of the elements illustrated in FIG. 7”, therefore, the MAC cores would receive elements of another tile from the data buffer; two or more of the neural engine circuits – MAC cores 221-22i); 
instructing, by the third rasterizer circuit, to cause the kernel fetcher circuit to send another plurality of matrix elements of another matrix to the two or more of the neural engine circuits over multiple processing cycles (Kim Fig. 7 shows the MAC cores receiving kernel weight from the weight kernel buffer device; Fig. 2 another plurality of matrix elements of another matrix – matrix elements of one of KER_2 to KER_M); and
performing, by the two or more of the neural engine circuits, multiplication between the other matrix  (Kim paragraph [0055] “The MAC core L1_1 may recursively perform such a computation operation to generate the output data Dout for each of the plurality of kernels KER_1 to KER_M. At this point, the number of channels of the output data Dout may be the same as the number (i.e. M) of the plurality of kernels KER_1 to KER_M”).
The combination of Kim, Talpes, and Gauria thus far does not teach further comprising: instructing, by the second rasterizer circuit, to cause the data buffer to send another plurality of vector elements of another vector to two or more of the neural engine circuits over at least one processing cycle; and performing, by the two or more of the neural engine circuits, multiplication between the other matrix and the other vector as a convolution operation on the other vector elements and the other matrix elements to produce multiple output channels of the output data.
However, on the same field of endeavor, Ren teaches performing convolution by reorganizing the input data as a plurality of column vectors where each column corresponds to a subset of the input data that is required for a convolution operation and multiplying each column vector with a matrix of kernels where each kernel is arranged as a row. Further, Ren discloses generating multiple output channels of the output data when the kernel matrix and a column vector is multiplied as a convolution operation (Ren Fig. 2 where each column of the reorganized input corresponds to a plurality of vector elements of another vector and map1, map2, map3 corresponds to multiple output channels of the output data).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Kim in view of Talpes and Gauria using Ren and configure two or more MAC cores of Kim to perform the convolution operation disclosed in Fig. 2 of Ren. For example, configuring the MAC core 122 to receive the elements [2, 3, 5, 6] in the second column of the reorganized input matrix and the kernel matrix and perform matrix vector multiplication to generate the second column of the 3x2x2 output data, and configure the MAC 12i to receive the elements [5, 6, 7, 9] in the last column of the reorganized input matrix and the kernel matrix and perform matrix vector multiplication to generate the last column of the 3x2x2 output data.
The motivation to use at least two or more MAC is to perform the convolution operations in parallel (Kim paragraph [0068]). The motivation to do format the input data and kernels as shown in Fig. 2 of Ren is to provide a more efficient memory access (Ren page 2 vectorizing convolution section).
Therefore, the combination of Kim as modified in view of Talpes, Gauria and Ren teaches further comprising: instructing, by the second rasterizer circuit, to cause the data buffer to send another plurality of vector elements of another vector to two or more of the neural engine circuits over at least one processing cycle; and performing, by the two or more of the neural engine circuits, multiplication between the other matrix and the other vector as a convolution operation on the other vector elements and the other matrix elements to produce multiple output channels of the output data.

Regarding claim 16, Kim as modified in view of Talpes, Gauria and Ren teaches all the limitations of claim 15 as stated above. Further, Kim as modified in view of Talpes, Gauria and Ren teaches further comprising: performing, as part of the convolution operation, multiply-accumulate operations on the other vector elements and a subset of the other matrix elements corresponding to a row of the other matrix; and producing, after completion of the multiply-accumulate operations, an output channel of the multiple output channels of the output data (Kim Fig. 7 each of the two or more neural engine circuits – MAC cores 221 to 22i; Kim paragraph [0055] “The MAC core L1_1 may recursively perform such a computation operation to generate the output data Dout for each of the plurality of kernels KER_1 to KER_M. At this point, the number of channels of the output data Dout may be the same as the number (i.e. M) of the plurality of kernels KER_1 to KER_M”; Ren Fig. 2 the other vector elements – elements of each column vector of the reorganized input; subset of the other matrix elements corresponding to a row of the other matrix – one of kernel1, kernel2, and kernel3 rows; output channel of the multiple output channels of the output data – map1, map1, map3 output; see also claim 15 analysis).
Allowable Subject Matter
Claims 6, 9-10, and 18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims, and if rewritten to overcome the claim objections and/or 35 U.S.C. 112(b) rejection discussed above.
The following is a statement of reasons for the indication of allowable subject matter:
Claim 6 is directed to a neural processor circuit comprising, among other things, wherein the data buffer is further configured to: interleave the multiple output channels of the output data to generate one output channel of the output data. Claim 9 is directed to a neural processor circuit comprising, among other things, wherein the at least one of the neural engines is further configured to: receive a first set of elements from the data buffer; receive a second set of elements from the data buffer; and perform element-wise multiplication between the first set of elements and the second set of elements as a portion of convolution operation on the first set of elements and the second set of elements producing one or more output channels of the output data.
 The combination of Kim, Talpes, Ren and Lai are the closest prior art found. Kim, Talpes, Ren and Lai teaches the claimed subject matter in accordance with the claim mappings discussed above.  However, none of the prior art references cited discloses a neural engine circuit configured to receive a first set of elements and a second set of elements from the data buffer and performing element-wise multiplication between the first set of elements and the second set of elements as a portion of convolution operation on the first set of elements and the second set of elements producing one or more output channels of the output data as recited in claim 9. 
Furthermore, none of the prior art references cited discloses a data buffer configured to: interleave the multiple output channels of the output data to generate one output channel of the output data as recited in claim 6. Lim et al. (US Patent No. 9,858, 636 B1) discloses a neural engine circuit that includes an output buffer configured to store output values of a convolution operation that produces multiple output channels in an interleaved manner, however, Lim fails to discloses that the multiple output channels are interleaved to generate one output channel. Furthermore, Lim discloses combining two or more output channels such as merging the results of 8 bit pixel data convolution into 16 bit data output using a post-processor, however, the combining/merging is performed by the post-processor 428 and not by the output buffer 524 in Fig. 5. 
 Claim 10 would be allowable for at least the same reason as claim 9 by reason of dependence. Claim 18 recites substantially the same limitations as claim 9 and would be allowable for the same reason as claim 9.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Du et al. (US-PGPUB US 2018/0137407 A1) related to a convolution device comprising of a control unit, a data buffer that includes a memory controller to control the buffer device for reading and writing data, coefficient retrieving controller for retrieving filter coefficient from external memory, and a convolution module comprising of a plurality of convolution units (Fig. 1 and paragraphs [0030-0043]).
Tanaka et al. (US-PGPUB 20210149983 A1) related to performing convolution operation where the multiply-accumulate operation includes pre-loading the bias vector b into the output vector, then performing the multiplication and accumulation operations. Tanaka also discloses that  bias vector b may be added to the output vector at the end of the multiply-accumulate operation.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Carlo Waje whose telephone number is (571)272-5767. The examiner can normally be reached 9:00-6:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on (571) 270-3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. 
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/C.W./
Carlo WajeExaminer, Art Unit 2182                                                                                                                                                                                                        (571)272-5767




/MICHELLE T BECHTOLD/Primary Examiner, Art Unit 2183