DETAILED ACTION

Status of Application
Claims 1-20 are pending in the present application.


Information Disclosure Statement
The information disclosure statement (IDS) submitted on 02/18/2022 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.


Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: “data input unit”; “weight input unit” in claims 7 and 8, respectively.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2 and 6-10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al (hereinafter Huang), U.S. Publication No. 2020/0090030 A1, in view of Kopinsky, U.S. Publication No. 2020/0218978 A1, in view of Smelyanskiy, U.S. Publication No. 2016/0179540 A1.
	Referring to claim 1, Huang discloses a processor system, comprising:	
memory [paragraph 36, fig. 3B, perform depthwise convolution over data in HRAM 120] configured to store data elements [fig. 3B, see data elements of input arrays] of a plurality of channels [fig. 3B, each row of fig. 3B corresponding to a channel; ‘also see paragraph 37, “The term “input array’ refers to a channel of a cuboid in an input image”] of a portion of a convolution data matrix [figs. 3A-3B, input arrays of a input image (shown in fig. 3A to be a convolution data matrix)], wherein each memory stores at least one data element from each of the plurality of channels [fig. 3B, data in HRAM 120; each input array storing at least one data element from each channel];
second memory [paragraph 27, HRAM 120; the DSPs 140 read its corresponding coefficients from the flash memory 150 via the flash control interface 130 and temporarily store them in HRAM 120. During the convolution operation, the DSPs 140 instruct the MAC circuits 111 via the control bus 142 according to the programs in the data/program internal memory 141 to perform related multiplications and accumulations over the image data and coefficients in HRAM 120] configured to store data elements [fig. 3B, data elements of filters] of a plurality of convolution weight matrices [fig. 3B, filter matrices] including a separate convolution weight matrix [fig. 3B, filter Kd(1), Kd(2), Kd(M)] for each of the plurality of channels [fig. 3B, each row corresponding to a channel; see also paragraph 37]; and
a hardware channel convolution processor unit [fig. 1A] configured to:
for each data element, multiply the data element with a corresponding data element in the second memory to determine a corresponding multiplication result in multiplication results [paragraphs 27, 36-37, figs. 3A-3B, During the convolution operation…perform related multiplications and accumulations over the image data and coefficients in HRAM 120; performing convolution operation to produce output results (see fig. 3B)]; and
for each specific channel of the plurality of channels, sum together ones of the multiplication results corresponding to the specific channel to determine one corresponding channel convolution result data element in a corresponding channel convolution result matrix [paragraphs 27, 36-37, figs. 3A-3B, During the convolution operation…perform related multiplications and accumulations over the image data and coefficients in HRAM 120; performing convolution operation to produce output arrays (see fig. 3B); see stacked output array].
Huang does not explicitly disclose a first group of registers.
However, Kopsinsky discloses a first group of registers [figs. 4A-4B, paragraph 97, embodiments of the invention may load (or broadcast, as commonly referred to in the art) a specific, non-zero input data element (e.g., 1I1) to occupy all values of one or more vector registers (e.g., element 4B of FIG. 2); see convolution data matrix shown as INPUT (I) in fig. 4A being loaded into input vector registers].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings of Kopsinsky in the invention of Huang, to implement a first group of registers, in order to provide efficient compressed convolution so as to allow commercially available CPUs to provide NN execution performance that may be competitive with accelerators, GPUs and TPUs [Kopsinsky, paragraphs 2, 8].
The modified Huang does not explicitly disclose a second group of registers;
wherein each register of the second group of registers stores at least one data element from each of the plurality of convolution weight matrices.
However, Smelyanskiy discloses a second group of registers [paragraph 161, This example code may begin by broadcasting all kernel weights into spare vector registers];
wherein each register of the second group of registers stores at least one data element from each of the plurality of convolution weight matrices [paragraph 161, This example code may begin by broadcasting all kernel weights into spare vector registers].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings of Smelyanskiy in the invention of the modified Huang, to implement a second group of registers; wherein each register of the second group of registers stores at least one data element from each of the plurality of convolution weight matrices, in order to provide efficient execution of calculations [Smelyanskiy, paragraph 34] while having as many instructions execute as fast as possible [Smelyanskiy, paragraph 39].
Referring to claim 2, the modified Huang discloses the system of claim 1, wherein a total count of the stored data elements of the first group of registers is the same as a total count of the stored data elements of the second group of registers [Huang, figs. 3B, 5A, the examiner notes that Huang is not limited to different dimensions of the input array and filter and that the dimension can be the same such as shown in fig. 5A; when the dimensions are the same, the number of elements in the input array and filter are the same].
Referring to claim 6, the modified Huang discloses the system of claim 1, wherein the convolution data matrix is a three-dimensional machine learning data matrix [Huang, fig. 3A].
Referring to claim 7, the modified Huang discloses the system of claim 1, further comprising a data input unit configured to: process the data elements stored in the first group of registers by channel into a plurality of data input vectors, wherein each of the plurality of data input vectors includes data elements corresponding to a two-dimensional sub-matrix of the convolution data matrix [Kopinsky, figs. 4A-4B, input vector registers in fig. 4B which correspond to bolded submatrix in INPUT(I) in fig. 4A].
Referring to claim 8, the modified Huang discloses the system of claim 1, further comprising a weight input unit configured to: process the data elements stored in the second group of registers into a plurality of weight input vectors, wherein each of the plurality of weight input vectors includes data elements corresponding to one of the plurality of convolution weight matrices [Smelyanskiy, paragraph 161, This example code may begin by broadcasting all kernel weights into spare vector registers; the kernel weights corresponding to kernel filters/matrices].
Referring to claim 9, the modified Huang discloses the system of claim 1, wherein each of the plurality of convolution weight matrices is a 3x3, 5x5, 7x7, 9x9, or 11x11 matrix [Huang, fig. 3B, input array matrices dimensions; Kopinsky, paragraph 70, “It may be appreciated that the matrices depicted in FIG. 3 may be of any dimension that may be appropriate to a specific application”]. 
Referring to claim 10, the modified Huang discloses the system of claim 1, wherein the data elements stored in the first group of registers are 4-bit, 8-bit, 2-byte, or 4-byte elements [Kopinsky, paragraphs 132-133].
Claims 3-5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Huang, in view of Kopinsky, in view of Smelyanskiy, as applied to claim 1 above, and further in view of Culurciello et al (hereinafter Culurciello), U.S. Publication No. 2018/0341495 A1.
Referring to claim 3, the modified Huang does not explicitly disclose the system of claim 1, wherein the hardware channel convolution processor unit comprises a plurality of calculation units and each calculation unit of the plurality of calculation units is configured to receive a plurality of data elements of the first group of registers corresponding to a same channel of the convolution data matrix and a plurality of corresponding data elements of the second group of registers corresponding to the separate convolution weight matrix for the same channel of the convolution data matrix.
However, Culurciello discloses wherein the hardware channel convolution processor unit comprises a plurality of calculation units and each calculation unit of the plurality of calculation units is configured to receive a plurality of data elements of the first group of registers corresponding to a same channel of the convolution data matrix and a plurality of corresponding data elements of the second group of registers corresponding to the separate convolution weight matrix for the same channel of the convolution data matrix [paragraphs 54, 56, One operating mode for the accelerators 100 and 200 is the cooperative mode. In the cooperative mode, the 16 words of a vector in a trace are split up among a group of 16 multiply-accumulate (MAC) units in each vMAC. This splits up the computation of a single output among several MACs because each MAC receives a different 16-bit word portion of the 256-bit vector of map trace data received from the maps cache 132, with the 16 MACs in each vMAC of the accelerators 100 and 200 each receiving a different 16-bit word; the examiner notes that maps cache contains the convolution data matrix; paragraph 57, Each MAC requires maps and kernels to process a layer. Maps can be shared across MACs to some extent. In that case, however, they require different kernels; this suggests that when maps are not shared across MACs, as in the case of assigning different portions to each of the plurality of processing elements, the kernels are not different, e.g., the kernel is the same (same subset of the set of convolution weight matrices)].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings of Culurciello in the invention of the modified Huang, to implement wherein the hardware channel convolution processor unit comprises a plurality of calculation units and each calculation unit of the plurality of calculation units is configured to receive a plurality of data elements of the first group of registers corresponding to a same channel of the convolution data matrix and a plurality of corresponding data elements of the second group of registers corresponding to the separate convolution weight matrix for the same channel of the convolution data matrix, in order to provide improvements to the computational efficiency of existing CNN models and other CNN models [Culurciello, paragraph 25].
Referring to claim 4, the modified Huang discloses the system of claim 3, wherein each calculation unit of the plurality of calculation units includes a different vector multiply unit and a different vector adder unit [Culurciello, paragraphs 54, 56].
Referring to claim 5, the modified Huang discloses the system of claim 4, wherein each of the different vector adder units includes a different adder tree [Culurciello, paragraph 10, Each vMAC includes a plurality of multiply-accumulate units (MACs), each MAC including a multiplier unit configured to multiply a first word that forms a portion of the at least one vector in the map trace in the first memory cache by a second word that forms a portion of the at least one vector in the kernel trace in the second memory cache to produce an intermediate product, and an adder unit that adds the intermediate product to a third word to generate a sum of the intermediate product and the third word as an output].
Claim 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Huang, in view of Kopinsky, in view of Smelyanskiy, as applied to claim 1 above, and further in view of Tirumalai et al (hereinafter Tirumalai), U.S. Publication No. 2002/0007484 A1.
Referring to claim 11, the modified Huang does not explicitly disclose the system of claim 1, wherein a total count of the stored data elements of each of the first group of registers is a multiple of a cache line size.
However, Tirumalai discloses wherein a total count of the stored data elements of each of the first group of registers is a multiple of a cache line size [paragraph 55, Thus, if equal numbers of the first and second arrays' elements were to be loaded in the present example, one prefetch of the second variable's elements would be required for every two of the first variable's elements. If all such variables in a loop are so analyzed, a least common multiple may be calculated. In the present example, if the cache line size is 8 bytes long, the least common multiple would be two].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings of Tirumalai in the invention of the modified Huang, to implement wherein a total count of the stored data elements of each of the first group of registers is a multiple of a cache line size, in order to provide prefetches that are performed in an efficient manner [Tirumalai, paragraphs 54, 55].
Claims 12 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Culurciello et al (hereinafter Culurciello), U.S. Publication No. 2018/0341495 A1.
	Referring to claim 12, Culurciello discloses a method, comprising: 
receiving a convolution operation instruction specifying a convolution data matrix and a set of convolution weight matrices [paragraphs 64-66, instruction set that the accelerators 100 and 200 use to control the operation of various elements the control core 104 and the compute core 120 in one or more compute clusters 102 is provided below; The accelerators 100 and 200 execute instructions that are 32 bits wide and are designed to process the traces of map data (convolution data matrix) and kernel data (convolution weight matrices)]; 
assigning a different portion of the convolution data matrix to each of a plurality of processing elements [paragraphs 54, 56, One operating mode for the accelerators 100 and 200 is the cooperative mode. In the cooperative mode, the 16 words of a vector in a trace are split up among a group of 16 multiply-accumulate (MAC) units in each vMAC. This splits up the computation of a single output among several MACs because each MAC receives a different 16-bit word portion of the 256-bit vector of map trace data received from the maps cache 132, with the 16 MACs in each vMAC of the accelerators 100 and 200 each receiving a different 16-bit word; the examiner notes that maps cache contains the convolution data matrix (see paragraph 36, The maps cache 132 stores map trace data (also referred to as “maps”) corresponding to a contiguous set of map data in a contiguous regions of memory in the original input to the CNN)]; 
transmitting a plurality of data elements corresponding to the different assigned portion of the convolution data matrix to each of the plurality of processing elements [paragraphs 54, 56, One operating mode for the accelerators 100 and 200 is the cooperative mode. In the cooperative mode, the 16 words of a vector in a trace are split up among a group of 16 multiply-accumulate (MAC) units in each vMAC. This splits up the computation of a single output among several MACs because each MAC receives a different 16-bit word portion of the 256-bit vector of map trace data received from the maps cache 132, with the 16 MACs in each vMAC of the accelerators 100 and 200 each receiving a different 16-bit word]; 
receiving from the plurality of processing elements channel convolution result data elements of a channel convolution result matrix determined using hardware channel convolution processor units of the plurality of processing elements [paragraphs 54, 56, 58, The partial results are accumulated together using a reduce operation that is performed by the separate gather adder 258. The partial results from the vMAC are latched into the shift register 252 which feeds one partial result per cycle to one input of the gather adder 258. The output from the gather adder 258 is truncated to 16 bits and written back to the maps cache 132; also see output result matrix 512 shown in fig. 5]; and
storing the channel convolution result matrix to a memory location [paragraph 58, The output from the gather adder 258 is truncated to 16 bits and written back to the maps cache 132].
Culurciello does not explicitly disclose broadcasting to each of the plurality of processing elements assigned a same channel of the convolution data matrix a same subset of the set of convolution weight matrices.
However, Culurciello discloses: paragraph 57, Each MAC requires maps and kernels to process a layer. Maps can be shared across MACs to some extent. In that case, however, they require different kernels; this suggests that when maps are not shared across MACs, as in the case of assigning different portions to each of the plurality of processing elements, the kernels are not different, e.g., the kernel is the same (same subset of the set of convolution weight matrices).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Culurciello, to implement broadcasting to each of the plurality of processing elements assigned a same channel of the convolution data matrix a same subset of the set of convolution weight matrices, in order to provide improvements to the computational efficiency of existing CNN models and other CNN models [Culurciello, paragraph 25].
Referring to claim 14, the modified Culurciello discloses the method of claim 12, wherein the convolution data matrix is a three-dimensional machine learning data matrix [Culurciello, fig. 5, Input matrix] and each of the set of convolution weight matrices is a two-dimensional matrix [Culurciello, fig. 5, see Kernel Trace 508 with the slice at top which is a two dimensional matrix].
Claim 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Culurciello, in view of Huang et al (hereinafter Huang), U.S. Publication No. 2020/0090030 A1.
Referring to claim 13, the modified Culurciello does not explicitly disclose the method of claim 12, wherein the convolution data matrix and the channel convolution result matrix are stored using a channel-first layout format.
However, Huang discloses wherein the convolution data matrix [fig. 3A] and the channel convolution result matrix [fig. 3B, 3D depthwise output array] are stored using a channel-first layout format [both matrices are stored such that the data element at width, height, and channel location (1,1,1) is stored adjacent to the data element at width, height, and channel location (1,1,2) of the same matrix, see figs. 3A and 3B].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings of Huang in the invention of the modified Huang, to implement wherein a total count of the stored data elements of each of the first group of registers is a multiple of a cache line size, in order to achieve high energy efficiency and low area complexity [Huang, paragraph 2].
Claims 15-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Culurciello, in view of Huang et al (hereinafter Huang), U.S. Publication No. 2020/0090030 A1, in view of Kopinsky, U.S. Publication No. 2020/0218978 A1, and further in view of Smelyanskiy, U.S. Publication No. 2016/0179540 A1.
Referring to claim 15, the modified Culurciello does not explicitly disclose the method of claim 12, wherein each of the plurality of processing elements: 
stores in a first memory data elements of a plurality of channels of a portion of the convolution data matrix, wherein each register of the first group of registers stores at least one data element from each of the plurality of channels; and 
stores in a memory data elements of a subset of the set of convolution weight matrices including a separate convolution weight matrix for each of the plurality of channels.
However, Huang discloses stores in first memory [paragraph 36, fig. 3B, perform depthwise convolution over data in HRAM 120] data elements [fig. 3B, see data elements of input arrays] of a plurality of channels [fig. 3B, each row of fig. 3B corresponding to a channel; ‘also see paragraph 37, “The term “input array’ refers to a channel of a cuboid in an input image”] of a portion of the convolution data matrix [figs. 3A-3B, input arrays of a input image (shown in fig. 3A to be a convolution data matrix)], wherein each memory stores at least one data element from each of the plurality of channels [fig. 3B, data in HRAM 120; each input array storing at least one data element from each channel]; and 
stores in a second memory [paragraph 27, HRAM 120; the DSPs 140 read its corresponding coefficients from the flash memory 150 via the flash control interface 130 and temporarily store them in HRAM 120. During the convolution operation, the DSPs 140 instruct the MAC circuits 111 via the control bus 142 according to the programs in the data/program internal memory 141 to perform related multiplications and accumulations over the image data and coefficients in HRAM 120] data elements [fig. 3B, data elements of filters] of a subset of the set of convolution weight matrices [fig. 3B, filter matrices] including a separate convolution weight matrix [fig. 3B, filter Kd(1), Kd(2), Kd(M)] for each of the plurality of channels [fig. 3B, each row corresponding to a channel; see also paragraph 37].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings of Huang in the invention of modified Culurciello, to implement stores in a first memory data elements of a plurality of channels of a portion of the convolution data matrix, wherein each register of the first group of registers stores at least one data element from each of the plurality of channels; and stores in a memory data elements of a subset of the set of convolution weight matrices including a separate convolution weight matrix for each of the plurality of channels, in order to achieve high energy efficiency and low area complexity [Huang, paragraph 2].
The modified Culurciello does not explicitly disclose a first group of registers.
However, Kopsinsky discloses a first group of registers [figs. 4A-4B, paragraph 97, embodiments of the invention may load (or broadcast, as commonly referred to in the art) a specific, non-zero input data element (e.g., 1I1) to occupy all values of one or more vector registers (e.g., element 4B of FIG. 2); see convolution data matrix shown as INPUT (I) in fig. 4A being loaded into input vector registers].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings of Kopsinsky in the invention of the modified Culurciello, to implement a first group of registers, in order to provide efficient compressed convolution so as to allow commercially available CPUs to provide NN execution performance that may be competitive with accelerators, GPUs and TPUs [Kopsinsky, paragraphs 2, 8].
The modified Culurciello does not explicitly disclose a second group of registers;
wherein each register of the second group of registers stores at least one data element is from each of the subset of the set of convolution weight matrices.
However, Smelyanskiy discloses a second group of registers [paragraph 161, This example code may begin by broadcasting all kernel weights into spare vector registers];
wherein each register of the second group of registers stores at least one data element is from each of the subset of the set of convolution weight matrices [paragraph 161, This example code may begin by broadcasting all kernel weights into spare vector registers].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings of Smelyanskiy in the invention of the modified Culurciello, to implement a second group of registers; wherein each register of the second group of registers stores at least one data element is from each of the subset of the set of convolution weight matrices, in order to provide efficient execution of calculations [Smelyanskiy, paragraph 34] while having as many instructions execute as fast as possible [Smelyanskiy, paragraph 39].
Referring to claim 16, the modified Culurciello discloses the method of claim 15, wherein each of the hardware channel convolution processor units of the plurality of processing elements: 
for each data element in the first group of registers, multiplies the data element in the first group of registers with a corresponding data element in the second group of registers to determine a corresponding multiplication result in multiplication results [Huang, paragraphs 27, 36-37, figs. 3A-3B, During the convolution operation…perform related multiplications and accumulations over the image data and coefficients in HRAM 120; performing convolution operation to produce output results (see fig. 3B)]; and 
for each specific channel of the plurality of channels, sums together ones of the multiplication results corresponding to the specific channel to determine one corresponding channel convolution result data element in the corresponding channel convolution result matrix [Huang, paragraphs 27, 36-37, figs. 3A-3B, During the convolution operation…perform related multiplications and accumulations over the image data and coefficients in HRAM 120; performing convolution operation to produce output arrays (see fig. 3B); see stacked output array].
Referring to claim 17, the modified Culurciello discloses the method of claim 16, wherein each of the hardware channel convolution processor units of the plurality of processing elements comprises a plurality of calculation units and each calculation unit receives a plurality of data elements of the first group of registers corresponding to a same channel of the convolution data matrix and a plurality of corresponding data elements of the second group of registers corresponding to the separate convolution weight matrix for the same channel of the convolution data matrix [Culurciello, paragraphs 54, 56, One operating mode for the accelerators 100 and 200 is the cooperative mode. In the cooperative mode, the 16 words of a vector in a trace are split up among a group of 16 multiply-accumulate (MAC) units in each vMAC. This splits up the computation of a single output among several MACs because each MAC receives a different 16-bit word portion of the 256-bit vector of map trace data received from the maps cache 132, with the 16 MACs in each vMAC of the accelerators 100 and 200 each receiving a different 16-bit word; the examiner notes that maps cache contains the convolution data matrix; paragraph 57, Each MAC requires maps and kernels to process a layer. Maps can be shared across MACs to some extent. In that case, however, they require different kernels].

Allowable Subject Matter
Claims 18-20 are allowed.
The following is a statement of reasons for the indication of allowable subject matter:  The prior art of record taken alone or in combination fails to teach and/or fairly suggest saving in the first group of registers data elements that overlap between the first portion of the convolution data matrix and a second portion of the convolution data matrix; and storing in the first group of registers a three-dimensional slice of data elements of a plurality of channels of the second portion of the convolution data matrix, wherein the data elements of the three-dimensional slice are different from the data elements of the first portion of the convolution data matrix, in combination with other recited limitations in claim 18.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Ovsiannikov et al, U.S. Publication No. 2019/0392287 A1, discloses “a first weight register, a second weight register, an activations buffer, a first multiplier, and a second multiplier, the processor being configured to perform a first convolution of an array of activations with a first kernel of weights, the performing of the first convolution including: broadcasting a first subarray of the array of activations to: the first tile, and the second tile; forming a first tensor product, the first tensor product being a tensor product of a first subarray of the first kernel of weights with the first subarray of the array of activations; storing the first tensor product in the memory; broadcasting a second subarray of the array of activations to: the first tile, and the second tile; forming a second tensor product, the second tensor product being a tensor product of a second subarray of the first kernel of weights with the second subarray of the array of activations; and adding the first tensor product and the second tensor product” [paragraph 25].

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARLEY J ABAD whose telephone number is (571)270-3425. The examiner can normally be reached Mon-Thurs 8 AM - 7 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Idriss Alrobaye can be reached on (571) 270-1023. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Farley Abad/           Primary Examiner, Art Unit 2181