DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Applicant’s amendment filed on December 10, 2021 has been considered and entered. 
Accordingly, claims 1-20 are pending in this application. Claims 1, 14-16, 18, and 20 are currently amended; claims 2-13, 17, and 19 are original.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 6-11, 14, 17-18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Park et al. (US Patent No. 11,200,490 B2), hereinafter Park, in view of Huang et al. (US 2020/0410337 A1), hereinafter Huang.
Regarding claim 1, Park teaches
a plurality of processing elements, wherein each processing element of the plurality of processing elements includes a corresponding convolution processor unit that is configured to perform  (Park Fig. 3 plurality of processing elements – neural engines 314A-314N; Fig. 4 convolution processor unit – MAC 404, and Computation core 416; Figs. 9-10 and col 14 lines 24-65 “FIG. 10B is a conceptual diagram illustrating a group convolution mode in which independent convolution operations are executed in parallel by neural engines 314A through 314N”);
obtain a portion of data elements in a convolution data matrix for processing by the same corresponding convolution processor unit, wherein the portion of data elements for the same corresponding convolution processor unit includes different elements from a plurality of different convolution groups included in the convolution data matrix (Park Figs. 9-10 and col 12 lines 44-62; portion of data elements – Cin0-Cin5);
determine multiplication results by multiplying each data element of the portion of data elements in the convolution data matrix with a corresponding data element in a corresponding groupwise convolution weight matrix among a plurality of convolution weight matrices, wherein the portion of data elements in the convolution data matrix that are multiplied by the same corresponding convolution processor unit belong to a plurality of different channels of the convolution data matrix and the plurality of different convolution groups within the convolution data matrix (col 7 lines 60-65 “The neural engine 314 receives the input data 322, performs multiply-accumulate operations (e.g., convolution operations) on the input data 322 based on stored kernel data”; col 8 lines 39-41 “The input value and the corresponding kernel coefficient are multiplied in each of MAD circuits to generate a processed value 412”); and
for each specific channel of the plurality of different channels, sum together at least some of the multiplication results belonging to the same specific channel to determine a corresponding channel convolution result data element (Park col 7 lines 65-67; col 8 lines 42-51; col 9 lines 4-10).
a portion of a groupwise convolution; and wherein the plurality of processing elements is configured to sum together different portions of the channel convolution result data elements from a group of different convolution processor units included in the plurality of processing elements to determine a corresponding groupwise convolution result data element for each convolution group of the plurality of different convolution groups.
	However, on the same field of endeavor, Huang discloses splitting convolution operations into sub-operations to be performed by multiple accelerators. Further, Huang discloses summing together different portions of the sub-operations result data elements from a group of different convolution processor units included in the plurality of accelerators to determine a convolution result data element (Huang Fig. 20 and paragraph [0143] "The partial sum feature maps for the output feature maps generated by the K accelerators are not the final output feature maps of the convolution operation, and additional accumulation may be needed to generate the final output feature maps"). Further, Huang discloses a process and arrangement of adding partial results using an adder by taking in as inputs a partial result from another processing element using an input data bus and a result of the same processing element that includes the adder to generate a new partial sum that can be input to another processing element using an output data bus (Huang Figs 7-8 and paragraphs [0090, 0101]).
	Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Park in view of Huang and configure the system of Park to split the group convolution into sub-operations as taught by Huang. For example, assigning the sub-operations performed on Cin0 and Cin3 to a first neural engine 314A, assigning the sub-operations performed on Cin1 and Cin4 to a second neural engine 314B, and assigning the sub-operations performed on Cin2 and Cin5 to a third neural engine 314C. Further, configure the neural engine to 
sum feature map from a downstream neural engine and the partial sum feature map of the corresponding neural engine where the adder is located consistent with the adder arrangement shown in figures 7 and 8 to generate a new partial feature map that can be input to an upstream accelerator. In other words, chain each neural engine 314A-314C to receive the partial sum feature map generated by the previous neural engine and add it with the partial sum feature map of the current neural engine until all partial sum feature maps are accumulated to generate the final output feature map Cout0-Cout3. As discussed, Huang discloses performing additional accumulation to generate the final output feature maps, and one of ordinary skill in the art would have been capable of applying the known technique of using an adder to perform accumulation to a known device that was ready for improvement and the results would have been predictable to one of ordinary skill in the art. See MPEP 2141.1I1.D. Furthermore, the motivation to split the convolution operations into multiple sub-operations to be performed by multiple neural engines so is to reduce latency for making an inference (Huang paragraph [0129]).
	 Therefore, the combination of Park as modified in view of Huang teaches wherein each processing element of the plurality of processing elements includes a corresponding convolution processor unit is configured to perform a portion of a groupwise convolution; and wherein the plurality of processing elements is configured to sum together different portions of the channel convolution result data elements from a group of different convolution processor units included in the plurality of processing elements to determine a corresponding groupwise convolution result data element for each convolution group of the plurality of different convolution groups.


a point-to-point connection between a first processing element of the plurality of
processing elements and a second processing element of the plurality of processing elements, wherein the point-to-point connection is configured to provide at least a result of the corresponding convolution processor unit of the first processing element to a reduction unit component of the second processing element, wherein the second processing element is configured to reduce at least the provided result of the corresponding convolution processor unit of the first processing element with a result of the corresponding convolution processor unit of the second processing element and output a reduced result (Huang Figs. 7-8 shows input data bus 824 for receiving the partial sum p_in from a downstream processing element to connect the two processing elements together. Configuring each neural engines 314A-N of Park would include an input data bus for receiving the partial result from a downstream neural engine and the adder would add the partial result generated by the corresponding neural engine to obtain a new partial feature map where adder corresponds to the reduction unit component; the new partial feature map corresponds to the reduced result, and the input data bus 824 for receiving the partial sum p_in from a downstream processing element corresponds to the point-to-point connection between the first processing element of the plurality of processing elements and the second processing element. See also claim 1 analysis. The reason to combine is the same as claim 1); and
	a communication bus connecting together at least the first processing element and the
second processing element (Park col 6 lines 64 – col 7 lines 5; communication bus – data line used to send instructions to other components of the neural processor circuit 218).

wherein the reduction unit component includes an adder (Huang Figs. 7-8 adder – adder 825. The reason to combine is the same as claim 1).

	Regarding claim 6, Park as modified in view of Huang as modified teaches all the limitations of claim 2 as stated above. Further, Park as modified in view of Huang teaches wherein the first and second processing elements are configured to receive a convolution operation instruction via the communication bus (Park col 6 lines 64 – col 7 lines 5).

	Regarding claim 7, Park as modified in view of Huang all the limitations of claim 2 as stated above. Further, Park as modified in view of Huang teaches wherein the reduction unit component is configured to add together the result of the corresponding convolution processor unit of the first processing element and the result of the corresponding convolution processor unit of the second processing element (see claim 1 analysis. The reason to combine is the same as claim 1).

	Regarding claim 8, Park as modified in view of Huang teaches all the limitations of claim 2 as stated above. Further, Park as modified in view of Huang teaches the system further comprising a second point-to-point connection configured to send the reduced result to a third processing element of the plurality of processing elements, and wherein the second point-to-point connection connects the second processing element to the third processing element (Park third processing element – third neural engine of the neural engines 314; Huang Figs. 7-8 and 20, paragraph [0174] second point-to-point connection – p_out data bus 826; further, the modification to claim 1 to chain each of the neural 

	Regarding claim 9, Park as modified in view of Huang as modified teaches all the limitations of claim 8 as stated above. Further, Park as modified in view of Huang teaches wherein the third processing element includes a second reduction unit component and the second reduction unit component is connected to the second point-to-point connection (Huang Figs. 7-8 and 20, paragraph [0174]. See also claim 8 analysis. The reason to combine is the same as claim 1).

	Regarding claim 10, Park as modified in view of Huang teaches all the limitations of claim 2 as stated above. Further, Park as modified in view of Huang teaches wherein the result of the corresponding convolution processor unit of the first processing element is a first  (Park Fig. 9 and col 12 lines 44-62).
	Park does not explicitly teach wherein the result of the corresponding convolution processor unit of the first processing element is a first vector of channel convolution result data elements and the result of the corresponding convolution processor unit of the second processing element is a second vector of channel convolution result data elements.
	However, on the same field of endeavor, Huang discloses flattening the result data elements output by the PE array to generate a vector of result data elements (Huang Figs. 7-9 and 20 and paragraphs [0106-0108]).
	Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Park in view of Huang using Huang and flatten the result 
	Therefore, the combination of Park as modified in view of Huang teaches wherein the result of the corresponding convolution processor unit of the first processing element is a first vector of channel convolution result data elements and the result of the corresponding convolution processor unit of the second processing element is a second vector of channel convolution result data elements.
	
	Regarding claim 11, Park as modified in view of Huang as modified teaches all the limitations of claim 1 as stated above. Further, Park as modified in view of Huang teaches wherein the corresponding convolution processor unit of at least one of the plurality of processing elements includes a plurality of calculation units (Park Fig. 4 plurality of calculation units – MAD0-N circuits).

	Regarding claim 14, Park as modified in view of Huang as modified teaches all the limitations of claim 11 as stated above. Further, Park as modified in view of Huang teaches wherein each calculation unit of the plurality of calculation units is configured to receive a plurality of data elements corresponding to a same channel of a different convolution group of the convolution data matrix and a plurality of corresponding weight elements corresponding to a same channel of a groupwise convolution weight matrix among the plurality of convolution weight matrices (Park col 8 line 65 – col 9 lines 7, col 14 lines 51-58. Using the neural engine 314A to process a portion of the first group and the second group in parallel would result in the MAD circuits receiving data elements from a same channel of a different convolution group and corresponding weight elements. For example, with reference to fig. 9, Cin0 and Cin3 corresponds to the same, i.e. first, channel of the first group and the second group respectively).

wherein the convolution data matrix is a three-dimensional machine learning data matrix (Park Figs. 6, 9).

	Regarding claim 18, Park teaches a method comprising:
	determining a first processing result using a first convolution processor unit of a first processing element, wherein the first processing result includes channel convolution result data elements corresponding to a plurality of different channels and a plurality of different convolution groups within a convolution data matrix, and determining the first processing result includes obtaining a portion of data elements in the convolution data matrix belonging to the plurality of different channels and the plurality of different convolution groups (Park Fig. 9 first processing result – Cout0-Cout3; Fig. 4 a first convolution processor unit – MAC 404 and Computation core 416; a first processing element – neural engine 314A; plurality of different channels – input channels Cin0-Cin5; plurality of different convolution groups – group 1 and group 2; convolution data matrix – input data; col 12 lines 44-62; col 14 lines 35-58).
	Park does not explicitly teach providing the first processing result to a reduction unit component of a second processing element via a first point-to-point connection; determining a second processing result using a second convolution processor unit of the second processing element; providing the second processing result to the reduction unit component of the second processing element; summing together the channel convolution result data elements of the first processing result with corresponding channel convolution result data elements of the second processing result to create a reduced result including a plurality of different groupwise convolution result data elements for the plurality of different convolution groups; and sending the reduced result to a third processing element via a second point-to-point connection.
	However, on the same field of endeavor, Huang discloses splitting convolution operations into sub-operations to be performed by multiple accelerators. Further, Huang discloses summing together different portions of the sub-operations result data elements from a group of different convolution processor units included in the plurality of accelerators to determine a convolution result data element (Huang Fig. 20 and paragraph [0143] "The partial sum feature maps for the output feature maps generated by the K accelerators are not the final output feature maps of the convolution operation, and additional accumulation may be needed to generate the final output feature maps"). Further, Huang discloses a process and arrangement of adding partial results using an adder by taking in as inputs a partial result from another processing element using an input data bus and a result of the same processing element that includes the adder to generate a new partial sum that can be input to another processing element using an output data bus (Huang Figs 7-8 and paragraphs [0090, 0101]).
	Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Park in view of Huang and configure the system of Park to split the group convolution performed by neural engine 314A into sub-operations to be performed by multiples neural engines 314 as taught by Huang. For example, assigning the sub-operations performed on Cin0 and Cin3 to a first neural engine 314A, assigning the sub-operations performed on Cin1 and Cin4 to a second neural engine 314B, and assigning the sub-operations performed on Cin2 and Cin5 to a third neural engine 314C. Further, configure the neural engine to include an adder to perform the summation of the different sub-operations result to determine the result Cout0-Cout3 as shown in Fig. 9 of Park by configuring each of the adder to receive the partial sum feature map from a downstream neural engine using a p_in input bus such as input bus 824 and the partial sum feature map of the corresponding neural engine where the adder is located consistent with the adder arrangement shown in figures 7 and 8 to generate a new partial feature map that can be input to an upstream accelerator using a p_out data bus such as data bus 826. In other words, chain each neural engine 314A-314C to receive the partial sum 
Therefore, the combination of Park as modified in view of Huang teaches providing the first processing result to a reduction unit component of a second processing element via a first point-to-point connection; determining a second processing result using a second convolution processor unit of the second processing element; providing the second processing result to the reduction unit component of the second processing element; summing together the channel convolution result data elements of the first processing result with corresponding channel convolution result data elements of the second processing result to create a reduced result including a plurality of different groupwise convolution result data elements for the plurality of different convolution groups; and sending the reduced result to a third processing element via a second point-to-point connection.
	
	Regarding claim 20, Park teaches a system comprising:
	a first processing element including a first convolution processor unit and […] (Park Figs. 3-4 first processing element – neural engine 314A; first convolution processor unit – MAC 404 and computation core 416 of 314A);
a second processing element including a second convolution processor unit and […] (Park Figs. 3-4 second processing element – neural engine 314B; second convolution processor unit – MAC 404 and computation core 416 of 314B);
a third processing element including a third convolution processor unit and […] (Park Figs. 3-4 third processing element – neural engine 314C or 314N; third convolution processor unit – MAC 404 and computation core 416 of 314C or 314N);
a communication bus connecting together at least the first processing element, the second processing element, and the third processing element (Park col 6 lines 64 – col 7 lines 5; communication bus – data line used to send instructions to other components of the neural processor circuit 218).
	Further, Park discloses performing group convolution and different configurations of the neural engines 314 for performing group convolutions including processing a first group and a second group in parallel using a neural engine 314A (Park Figs. 9-10 and col 12 line 44 – col 15 line 35).
	Park does not explicitly teach the first, second, and third processing element includes a corresponding first, second, and third reduction unit; a first point-to-point connection between the first reduction unit component of the first processing element and the second reduction unit component of the second processing element, wherein the first point-to-point connection is configured to provide at least a first output result of the first reduction unit component to the second reduction unit component, and wherein the second reduction unit component is configured to output a second output result including a plurality of different groupwise convolution result data elements for a plurality of different convolution groups by summing together at least the first output result with channel convolution results of the second convolution processor unit, wherein the channel convolution results of the second convolution processor unit correspond to a plurality of different channels and the plurality of different convolution groups within a convolution data matrix, and the second convolution processor unit is configured to obtain for determining the channel convolution results of the second convolution processor unit a portion of data elements in the convolution data matrix belonging to the plurality of different channels and the plurality of different convolution groups; a second point-to-point connection between the second reduction unit component of the second processing element and the third reduction unit component of the third processing element, wherein the second point-to-point connection is configured to provide at least the second output result of the second reduction unit component to the third reduction unit component.
	However, on the same field of endeavor, Huang discloses splitting convolution operations into sub-operations to be performed by multiple accelerators. Further, Huang discloses summing together different portions of the sub-operations result data elements from a group of different convolution processor units included in the plurality of accelerators to determine a convolution result data element (Huang Fig. 20 and paragraph [0143] "The partial sum feature maps for the output feature maps generated by the K accelerators are not the final output feature maps of the convolution operation, and additional accumulation may be needed to generate the final output feature maps"). Further, Huang discloses a process and arrangement of adding partial results using an adder by taking in as inputs a partial result from another processing element using an input data bus and a result of the same processing element that includes the adder to generate a new partial sum that can be input to another processing element using an output data bus (Huang Figs 7-8 and paragraphs [0090, 0101]).
	Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Park in view of Huang and configure the system of Park to split the group convolution performed by neural engine 314A into sub-operations to be performed by multiples neural engines 314 as taught by Huang. For example, assigning the sub-operations performed on Cin0 and Cin3 to a first neural engine 314A, assigning the sub-operations performed on Cin1 and Cin4 to a second neural engine 314B, and assigning the sub-operations performed on Cin2 and Cin5 to a third 
Therefore, the combination of Park as modified in view of Huang teaches a system comprising the first, second, and third processing element includes a corresponding first, second, and third reduction unit; a first point-to-point connection between the first reduction unit component of the first processing element and the second reduction unit component of the second processing element, wherein the first point-to-point connection is configured to provide at least a first output result of the first reduction unit component to the second reduction unit component, and wherein the second reduction unit component is configured to output a second output result including a plurality of different groupwise convolution result data elements for a plurality of different convolution groups by .

Claims 12, and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Park in view Huang as applied to claim 11 above, and further in view of Liu et al. (US-PGPUB 2021/0124560 A1), hereinafter Liu.
Regarding claim 12, Park as modified in view of Huang teaches all the limitations of claim 11 as stated above. Further, Park as modified in view of Huang teaches wherein each calculation unit of the plurality of calculation units includes a different  (Park Fig. 4 and col 8 lines 33-48. It is noted that although the specific structure of the MAD circuits is not shown, it is implied that each MAD circuit includes a multiply circuit and add circuit for performing the multiply and add operations).
Park does not explicitly teach that each calculation unit includes a different vector multiply unit and a different vector adder unit.

Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Park in view of Huang using Liu and configure each MAD circuit of Park similar to fig. 8 or 14 of Liu.
The motivation to do so is because the vector multiply and accumulate (VMAC) unit of Liu advantageously performs 2/4 MAC operations in a single VMAC processing cycle, which reduces hardware power consumption, hardware costs, and processing latency (Liu paragraphs [0109, 0143]).
Therefore, the combination of Park as modified in view of Huang and Liu teaches wherein each calculation unit of the plurality of calculation units includes a different vector multiply unit and a different vector adder unit.

Regarding claim 15, Park as modified in view of Huang teaches all the limitations of claim 14 as stated above. Further, Park teaches wherein at least one of the plurality of processing elements further includes a data input unit (Park Fig. 4 and col 8 lines 6-10 data input unit – input buffer circuit 402).
Further, Huang teaches that each row of PE array processes one input data set comprising multiple input data elements, such as a one-dimensional vector representing a flattened multi-dimensional matrix (Huang paragraph [0101]).
Park does not explicitly teach wherein at least one of the plurality of processing elements further includes a data input unit configured to: process the plurality of data elements corresponding to the same channel of the different convolution group of the convolution data matrix into a data input vector, wherein the data input vector includes data elements corresponding to a two-dimensional sub-matrix of the convolution data matrix.

Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Park in view of Huang using Liu and configure the neural engine to use a register such as register 220 of Liu for storing the input data for neural network and for flattening the input data submatrices of the input data matrix into a data input vector. The substitution of one known element for another yields predictable results to one of ordinary skill in the art. The predictable result is a memory for storing the input data and flattening the input data submatrices of the input data matrix into a data input vector for operation by the MAC circuit/computation core. See MPEP 2141.III.B.
Therefore, the combination of Park as modified in view of Huang and Liu teaches wherein at least one of the plurality of processing elements further includes a data input unit configured to: process the plurality of data elements corresponding to the same channel of the different group of the convolution data matrix into a data input vector, wherein the data input vector includes data elements corresponding to a two-dimensional sub-matrix of the convolution data matrix.

Regarding claim 16, Park as modified in view of Huang teaches all the limitations of claim 14 as stated above. Further, Park teaches wherein at least one of the plurality of processing elements further includes a channel weight input unit (Park Fig. 4 and col 8 lines 19-32 channel weight input unit – kernel extract circuit 432).
 configured to: process the plurality of corresponding weight elements corresponding to the same channel of the groupwise convolution weight matrix among the plurality of convolution weight matrices into a weight input vector, wherein the weight input vector includes data elements corresponding to a two-dimensional sub-matrix of the groupwise convolution weight matrix.
However, on the same field of endeavor, Liu teaches a convolution processor unit that includes a channel weight input unit configured to process the plurality of corresponding weight elements of the convolution weight matrix among the plurality of convolution weight matrices into a weight input vector, wherein the weight input vector includes data elements corresponding to a two-dimensional sub-matrix of the groupwise convolution weight matrix (Liu Fig. 7, 9, and 12B and paragraphs [0104, and 0133] channel weight input unit – register 230).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Park in view of Huang using Liu and configure the neural engines to use a register such as register 230 of Liu for storing the channel weight data for the neural network and for flattening the channel weight data submatrices of the weight data matrix into a weight data vector. The substitution of one known element for another yields predictable results to one of ordinary skill in the art. The predictable result is a memory for storing the weight data and flattening the weight data submatrices of the weight data matrix into a weight data vector for operation by the MAC circuit/computation core. See MPEP 2141.III.B.
Therefore, the combination of Park as modified in view of Huang and Liu teaches wherein at least one of the plurality of processing elements further includes a channel weight input unit configured to: process the plurality of corresponding weight elements corresponding to the same channel of the groupwise convolution weight matrix among the plurality of convolution weight matrices into a weight .

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Park in view of Huang as applied to claim 3 above, and further in view of Lee (US-PGPUB 20090077449 A1).
Regarding claim 4, Park as modified in view of Huang teaches all the limitations of claim 3 as stated above.
Park does not explicitly teach wherein the adder is a vector adder.
However, on the same field of endeavor, Lee discloses a vector adder that is used for performing addition operation and produces a vector output (Lee Fig. 7 and paragraph [0061]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Park in view of Huang using Lee and replace the adder with the vector adder of Lee to sum together the partial sum feature maps. The substitution of one known adder for performing accumulation for another adder such as a vector adder yields predictable results to one of ordinary skill in the art. The predictable result is an adder circuit for performing summation operation. See MPEP 2141.III.B.
Therefore, the combination of Park as modified in view of Huang and Lee teaches wherein the adder is a vector adder.

Claims 5 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Park in view of Huang as applied to claims 3 and 18 above respectively, and further in view of Park et al. (US-PGPUB 20180315155 A1), hereinafter Park ‘155.
Regarding claim 5, Park as modified in view of Huang teaches all the limitations of claim 3 as stated above.
wherein the reduction unit component further includes a multiplexer.
However, on the same field of endeavor, Park ‘155 teaches a reduction unit component, i.e. a channel merger circuit 506 that includes a multiplexer 528 (Park Fig. 5).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Park in view of Huang using Park ‘155 and configure the reduction unit to further include a multiplexer. 
The motivation to do so is to provide different modes for reducing/merging the output stream of the neural engines such as dual-convolution mode, cascade mode, and parallel mode (Park paragraphs [0004, 0086]). 
Therefore, the combination of Park as modified in view of Huang and Park ‘155 teaches wherein the reduction unit component further includes a multiplexer.

Regarding claim 19, Park as modified in view of Huang teaches all the limitations of claim 18 as stated above. Further, Park as modified in view of Huang teaches wherein the reduction unit component includes an adder (Huang Figs. 7-8 adder - adder 825. The reason to combine is the same as claim 18).
Park does not explicitly teach wherein the reduction unit component includes a multiplexer.
However, on the same field of endeavor, Park ‘155 teaches a reduction unit component, i.e. a channel merger circuit 506 that includes a multiplexer 528 (Park Fig. 5).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Park in view of Huang using Park ‘155 and configure the reduction unit to include a multiplexer. 

Therefore, the combination of Park as modified in view of Huang and Park ‘155 teaches wherein the reduction unit component includes a multiplexer.

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Park in view of Huang and Liu as applied to claim 12 above, and further in view of Fowers et al. (US-PGPUB 2019/0325296 A1), hereinafter Fowers.
Regarding claim 13, Park as modified in view of Huang and Liu teaches all the limitations of claim 12 as stated above.
Park does not explicitly teach wherein each of the different vector adder units includes a different adder tree.
However, on the same field of endeavor, Fowers discloses a dot product unit comprising of multipliers and accumulators arranged as an adder tree (Fowers Fig. 3 and paragraph [0050]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify Park in view of Huang and Liu using Fowers and configure each of the vector adder as an adder tree accumulator circuit shown in Fig. 3 of Fowers using simple substitution. The substitution of one known element such as the accumulator arranged as an adder tree shown in Fig. 3 of Fowers for another accumulator circuit yields predictable results to one of ordinary skill in the art. The predictable result is an accumulator circuit for performing summation operation. See MPEP 2141.III.B.
Therefore, Park as modified in view of Huang, Liu and Fowers teaches wherein each of the different vector adder units includes a different adder tree.
Response to Arguments
In view of amendments made, the objection to the drawings and the 35 U.S.C. 112(b) rejection of claims 1-19 has been withdrawn.
Applicant’s arguments, see remarks pages 1-3, filed 12/10/2021, with respect to the rejection of claims 1-20 under 35 U.S.C. 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of amendments made and newly found prior art.
In response to applicant’s arguments with respect to the 35 U.S.C. 103 rejection of claims 1-20, applicant argued that Figure 20 and paragraphs [0141-0143] of Huang teaches that each of its accelerators is assigned a single group and does not teach performing portions of multiple groupwise convolutions in parallel within a same processing element. Therefore, Huang does not teach the corresponding convolution processor unit is configured to: “obtain a portion of data elements in a convolution data matrix for processing by the same corresponding convolution processor unit, wherein the portion of data elements for the same corresponding convolution processor unit includes different elements from a plurality of different convolution groups included in the convolution data matrix”; determine multiplication results by multiplying each data element of the portion of data elements in the convolution data matrix with a corresponding data element in a corresponding groupwise convolution weight matrix among a plurality of convolution weight matrices, wherein the portion of data elements in the convolution data matrix that are multiplied “by the same corresponding convolution processor unit” belong to a plurality of different channels “of the convolution data matrix” and “the plurality of different convolution groups” within the convolution data matrix; and wherein the plurality of processing elements is configured to sum together “different portions” of the channel convolution result data elements from a group of different convolution processor units included in the plurality of processing elements to determine a “corresponding groupwise convolution result data element for each 
Examiner agrees that Huang does not teach performing portions of multiple groupwise convolutions in parallel within a same processing element, therefore, Huang does not teach obtaining a portion of data elements in a convolution data matrix wherein the portion of data elements for the same corresponding convolution processor unit includes “different elements from a plurality of different convolution groups” included in the convolution data matrix. However, Park discloses an apparatus and a method for performing group convolutions and discloses performing convolution on a first group and a second group in parallel using the same neural engine 314A (col 14 lines 51-58). In conjunction with Fig. 9 of Park, performing the convolution operation on group 1 and group 2 in parallel using the same neural engine 314A would include obtaining a portion of data elements in Cin0-Cin5 of the input data 322/convolution data matrix wherein the portion of data elements for the same corresponding convolution processor unit includes different elements from a plurality of different convolution groups, i.e. group 1 and group 2 included in the input data 322. Further, as shown in Fig. 9 of Park the portion of data elements in the convolution data matrix that are multiplied by the neural engine 314 belong to a plurality of different channels, input channels Cin0-Cin5, of the convolution data matrix and the plurality of different convolution groups, group 1 and group 2, within the input data. Furthermore, the combination of Park and Huang teaches splitting the group convolutions and performing portions of group convolutions in multiple neural engines and accumulating the partial results to generate the final output. Therefore, the combination of Park and Huang teaches summing together different portions of the channel convolution result data elements from a group of different convolution processor units included in the plurality of processing elements to determine a corresponding groupwise convolution result data element for each convolution group of the plurality of different convolution groups.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Carlo Waje whose telephone number is (571)272-5767.  The examiner can normally be reached on 9:00-6:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on (571) 270-3995.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.



/C.W./
Carlo WajeExaminer, Art Unit 2182                                                                                                                                                                                                        (571)272-5767


                                                                                                                                                                                            /EMILY E LAROCQUE/Primary Examiner, Art Unit 2182