DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 10/01/2018, 08/12/2019 and 08/06/2021 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Specification
The disclosure is objected to because of the following informalities:  
Paragraph [0048] states the following in relation to figure 2: “ storing in persistent storage 128 or for passing the data to network interface wl0….” However, in figure 2 persistent storage is labeled as 228 and network interface is labeled as 210.  
Appropriate correction is required.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Huang et al. US 2018/0173571 Al (“Huang”).
Regarding claim 1 Huang teaches a neural processor circuit, comprising: a plurality of neural engine circuits configured to perform convolution operations on at least a work unit of input data and kernel data(Huang, paras. 0107-0125, see also fig. 9, “[A]n integrated circuit on a chip is provided for performing matrix-matrix and/or matrix-vector multiplication operations. FIG. 9 illustrates an example of the chip for convolution computation, in accordance with embodiments of the invention. A computing unit of the chip may perform a plurality of parallel operations in response to the instructions associated with the CNN model. The computing unit may comprise a plurality of calculation circuits for performing operations in CNN. The computing unit may comprise a plurality of multipliers and accumulators to perform convolutions of input values with a plurality of kernels involved in a convolution layer. The same computing unit may be used for different convolution layers.” Huang teaches: the computing unit may comprise a plurality of calculation circuits for performing operations in CNN. The computing unit may comprise a plurality of multipliers and accumulators to perform convolutions of input values with a plurality of kernels involved in a convolution layer (i.e. a plurality of neural engine circuits configured to perform convolution operations on at least a work unit of input data and kernel data)); 
a data buffer between the plurality of neural engine circuits and a system memory external to the neural processor circuit, the data buffer configured to store at least a portion of the input data received from the system memory for sending to the neural engine circuits and to store output data received from the neural engine circuits, the portion of the input data comprising the work unit of the input data(Huang, para. 0060, see also fig.2, “In some cases, the CNN model data may be transferred from the main memory to an on-chip RAM 209 whereas the input data may be transferred to an input data buffer on the chip. Typically, both of the input data and the CNN model data are transferred and stored into contiguous regions of the on-chip RAM.”); 
and a kernel fetcher circuit between the plurality of neural engine circuits and the system memory, the kernel fetcher circuit configured to receive one or more kernels from the system memory, and send a corresponding kernel to the neural engine circuits(Huang, paras. 0107-0125, see also fig. 9, “[A]n integrated circuit on a chip is provided for performing matrix-matrix and/or matrix-vector multiplication operations. FIG. 9 illustrates an example of the chip for convolution computation, in accordance with embodiments of the invention. A computing unit of the chip may perform a plurality of parallel operations in response to the instructions associated with the CNN model. The computing unit may comprise a plurality of calculation circuits for performing operations in CNN. The computing unit may comprise a plurality of multipliers and accumulators to perform convolutions of input values with a plurality of kernels involved in a convolution layer. The same computing unit may be used for different convolution layers. Datapaths may be controlled by one or more multiplexers to determine the input feature maps and kernels to be fetched and supplied to the computing unit….” & see Huang, paras. 0126-0128, “In some cases, the input data may be stored in a buffer while the  may receive a set of control signals to determine which address space to fetch the parameters, bias and/or input data/input feature map.” Huang teaches: the input data may be stored in a buffer while the parameters and bias are stored in the RAM. The one or more multiplexers 1001 may receive a set of control signals to determine which address space to fetch the parameters(i.e. and a kernel fetcher circuit between the plurality of neural engine circuits and the system memory) Datapaths may be controlled by one or more multiplexers to determine kernels to be fetched and supplied to the computing unit (i.e. , the kernel fetcher circuit configured to receive one or more kernels from the system memory, and send a corresponding kernel to the neural engine circuits));
wherein at least one of the neural engine circuits is configured to: receive a plurality of sub-channels of the portion of the input data from the data buffer(Huang, para. 0038, “In some cases, a convolution layer may be a depth wise separable convolution. In such scenario, a convolution layer may be factorized into a depth wise convolution and a 1 x l pointwise convolution to combine the outputs of the depth wise convolution. The convolution layer may be split into a layer for filtering (i.e., depth wise convolution layer) and a layer for combining (i.e., pointwise convolution layer). In some cases, in a depth wise convolution layer, a single filter may be applied to each input channel, and in a pointwise convolution layer, a 1 x 1 convolution may be performed to combine the output of the depth wise layer.” Huang teaches in a depth wise convolution layer, a single filter may be applied to each input channel (i.e. receive a plurality of sub-channels of the portion of the input data from the data buffer)), 
receive a kernel of the one or more kernels from the kernel fetcher circuit, the kernel being decomposed into a corresponding sub-kernel for each sub-channel of the portion of the input data, perform a convolution operation on each sub-channel of the portion of the input data and the corresponding sub-kernel, and accumulate corresponding outputs of each sub-channel portion of the convolution operation to generate a single channel of the output data(Huang, paras. 0131-0133, see also figs. 6,7,10, and 11, “As shown in FIG. 11, the computing unit may comprise 128 multipliers 1101 connected to a plurality of adders 1103 for convolution operations. In some cases, the plurality of adders may form a two-level adder network. The same computing unit configuration may be used for processing input feature map and parameters with variable sizes such as different number of channels and/or different kernel sizes…[i]n the example illustrated in FIG. 11, the input feature map may have eight channels, as further illustrated in FIG. 6. In each cycle, a portion of the input features as stored in four rows and eight slices 1107 are used. The parameters for one layer include four kernels each having 2x2 parameters across eight channels, as further illustrated in FIG. 7. In each cycle, a portion of the parameters as stored in four rows and eight slices 1109 are used. In some cases, in each cycle, 1 point of a kernel across all channels of all filters are used, multiply with four points in the input feature map[.] The input features in 1107 and the parameters in 1109 may be fetched and supplied to the 128 multipliers with each parameter feeding into four multipliers. Each of the multipliers may include a first input to receive a value of the input data and a second input to receive a kernel parameter/weight. The multipliers may perform multiplication operation of integer or fixed-point inputs. For example, the multiplier may be 8-bit fixed-point multipliers. A first level adder or accumulator such as adder 0 may be used for summing products from outputs of multipliers 1-4. The adder/accumulator may be 4-input adder/accumulator.” Huang teaches: The same computing unit configuration may be used for processing input feature map and parameters with variable sizes such as different number of channels and/or different kernel sizes the parameters for one layer include four kernels each having 2x2 parameters across eight channels (i.e. the kernel being decomposed into a corresponding sub-kernel for each sub-channel of the portion of the input data) the input features in 1107 and the parameters in 1109 may be fetched and supplied to the 128 multipliers with each parameter feeding into four multipliers. Each of the multipliers may include a first input to receive a value of the input data and a second input to receive a kernel parameter/weight. The multipliers may perform multiplication operation of integer or fixed-point inputs A first level adder or accumulator such as adder 0 may be used for summing products from outputs of multipliers 1-4 (i.e. perform a convolution operation on each sub-channel of the portion of the input data and the corresponding sub-kernel, and accumulate corresponding outputs of each sub-channel portion of the convolution operation to generate a single channel of the output data)).  
Regarding claim 2, Huang teaches the neural processor circuit of claim 1, wherein: the data buffer is further configured to de-interleave a channel of the portion of the input data into the plurality of sub-channels of the portion of the input data, and the at least one neural engine circuit is further configured to receive the plurality of sub-channels of the portion of the input data over a plurality of processing cycles(Huang, paras. 0131-0132, see also figs. 6,7, and 11, “As shown in FIG. 11, the computing unit may comprise 128 multipliers 1101 connected to a plurality of adders 1103 for convolution operations. In some cases, the plurality of adders may form a two-level adder network. The same computing unit configuration may be used for processing input feature map and parameters with variable sizes such as different number of channels and/or different kernel sizes…[i]n the example illustrated in FIG. 11, the input feature map may have eight channels, as further illustrated in FIG. 6. In each cycle, a portion of the input features as stored in four rows and eight slices 1107 are used. The parameters for one layer include ).  
Regarding claim 3, Huang teaches the neural processor circuit of claim 1, wherein the kernel comprises padded zeroes and a size of the kernel with the padded zeroes is a multiple of two in each dimension of the kernel(Huang, para. 0046, “The spatial size of the output volume can be computed as a function of the input volume size W, the kernel field size of the convolution layer neurons K, the stride with which they are applied S and the amount of zero padding P… [i]n general, setting zero padding to be P=(K-1)/2 when the stride is S=l ensures that the input volume and output volume will have the same size spatially.” Huang teaches: the kernel field size of the convolution layer neurons K setting zero padding to be P=(K-1)/2 when the stride is S=l ensures that the input volume and output volume will have the same size spatially (i.e. wherein the kernel comprises padded zeroes and a size of the kernel with the padded zeroes is a multiple of two in each dimension of the kernel)).  
Regarding claim 4, Huang teaches the neural processor circuit of claim 1, wherein the at least one neural engine is further configured to: receive one or more channels of the portion of the input data from the data buffer; receive another kernel of the one or more kernels from the kernel fetcher circuit, the other kernel being decomposed into a plurality of sub-kernels(Huang, paras. 0107-0125, see also fig. 9, “[A]n integrated circuit on a chip is provided for performing matrix-matrix and/or matrix-vector multiplication operations. FIG. 9 illustrates an example of the chip for convolution computation, in accordance with embodiments of the invention. A computing unit of the chip may perform a plurality of parallel operations in response to the instructions associated with the CNN model. The computing unit may comprise a plurality of calculation circuits for performing operations in CNN. The computing unit may comprise a see Huang, paras. 0140-0142, see also fig. 12B, “For example, the input feature map may have 16 channels and arranged into 32 slices. The input feature map may be convolved with one kernel containing eight parameters for each channel. The kernel shape can be lx8, 8xl, 2x4 or 4x2. The parameters may be arranged into four slices in the same manner as shown in the previous example. In a clock cycle, four rows and 32 slices of the input feature map and four slices of parameters may be fetched and supplied to the 128 multipliers with each parameter feeding into eight multipliers (e.g., multipliers 0,16, 32, 48, 64, 80, 96, 112) and each input value feeding into one multiplier.” Huang teaches: For example, the input feature map may have 16 channels and arranged into 32 slices in a clock cycle, four rows and 32 slices of the input feature map may be fetched (i.e. receive one or more channels of the portion of the input data) the integrated circuit may further comprise other components, the components may include buffers for efficient reuse of input or intermediate data. In some embodiments, the Resize Buffer is approximately 24 KB. In general, the size of the buffer can be in any range, such as from 100 kB to 500 kB (i.e. from the data buffer) with one kernel containing eight parameters for each channel. The kernel shape can be lx8, 8xl, 2x4 or 4x2 in a clock cycle, four slices of parameters may be fetched (i.e. receive another kernel of the one or more kernels from the kernel fetcher circuit, the other kernel being decomposed into a plurality of sub-kernels)); perform another convolution operation on the one or more channels of the portion of input data and the sub-kernels to generate multiple sub-channel outputs for each channel of the portion of the input data; and store the sub-channel outputs for each channel of the portion of the input data in the data buffer(Huang, paras. 0140-0142, “In a clock cycle, four rows and 32 slices of the input feature map and four slices of parameters may be fetched and supplied to the 128 multipliers with each parameter feeding into eight multipliers (e.g., multipliers 0, 16, 32, 48, 64, 80, 96, 112) and each input value feeding into one multiplier. An accumulator such as Accu2 1206 may be used for summing products from outputs of multiplier 2. The configuration may comprise 128 accumulators each of which is configured to sum products from a multiplier. The sum result produced by each accumulator is a convolution result of a filter applied to a channel of the input feature map… Accu0 after a clock cycle is the convolution of the first channel of the input feature map with the first channel of a kernel (i.e., a first kernel) across a row of the kernel. In the next clock cycle, the results of the Accu0 would be HOW1C0*K0RlS1C0 for i=0-15. The number of multiplication is determined by the kernel size. In the depicted example, because the kernel contains eight parameters, the Accu0 may sum up across the entire kernel for 8 cycles in order to output a convolution result. The convolution operations will be applied across the entire input feature map. The output data point may be saved in a temporary memory….” Huang teaches: the configuration may comprise 128 accumulators each of which is configured to sum products from a multiplier. The sum result produced by each accumulator is a convolution result of a filter applied to a channel of the input feature map (i.e. perform another convolution operation on the one or more channels of the portion of input data and the sub-kernels to generate multiple sub-channel outputs for each channel of the portion of the input data) the convolution operations will be applied across the entire input feature map. The output data point may be saved in a temporary memory (i.e. store the sub-channel outputs for each channel of the portion of the input data in the data buffer)).   
Regarding claim 5, Huang teaches the neural processor circuit of claim 4, wherein each sub-channel output of the sub-channel outputs is generated using a different accumulator of a plurality of accumulators in the at least one neural engine(Huang, paras. 0140-0142, see also fig. 12B, “For example, the input feature map may have 16 channels and arranged into 32 slices. The input feature map may be convolved with one kernel containing eight parameters for each channel. The kernel shape can be lx8, 8xl, 2x4 or 4x2. The parameters may be arranged into four slices in the same manner as shown in the previous example. In a clock cycle, four rows and 32 slices of the input feature map and four slices of parameters may be fetched and supplied to the 128 multipliers with each parameter feeding into eight multipliers (e.g., multipliers 0,16, 32, 48, 64, 80, 96, 112) and each input value feeding into one multiplier. An accumulator such as Accu2 1206 may be used for summing products from outputs of multiplier 2. The configuration may comprise 128 accumulators each of which is configured to sum products from a multiplier. The sum result produced by each accumulator is a convolution result of a filter applied to a channel of the input feature map.”).  
Regarding claim 6, Huang teaches the neural processor circuit of claim 4, wherein the data buffer is further configured to interleave the sub-channel outputs for each channel of the portion of the input data to produce a channel output having a size in accordance with a size of the other kernel(Huang, paras. 0140-0142, see also fig. 12B, “In the exemplary configuration, 128 accumulators are used for summing the multiplication results produced by 16 ).  
Regarding claim 7, Huang teaches the neural processor circuit of claim 4, wherein two or more of the sub-kernels comprise padded zeroes across at least one dimension of the two or more sub-kernels(Huang, para. 0106, see also fig. 8, “FIG. 8 illustrates examples of padding the slices to accommodate kernels of different sizes and number of channels. In the example described above, a memory access query may take up four rows and eight slices of data. In the case when the input data is image data 801 with dimension of 128xl28 pixel and three channels, the input data may be padded with a row of zeros such that the input data with original dimension of 128x128x3 is transformed to 128x64x8 which is aligned with a 4-row query configuration. In the example when the parameters are from K kernels each is 5x5 in size across eight channels 803 (i.e., 5x5x3), the parameters may be arranged and padded with zeros such that the parameters data are transformed to 5x3x8 to be aligned with the 4-row query configuration.” Huang teaches: In the example described above, a memory access query may take up four rows.In the example when the parameters are from K kernels each is 5x5 in size across eight channels 803 (i.e., 5x5x3), the parameters may be arranged and padded with zeros such that the parameters data are transformed to 5x3x8 to be aligned with the 4-row query configuration (i.e. wherein two or more of the sub-kernels comprise padded zeroes across at least one dimension of the two or more sub-kernels)).   
Regarding claim 8, Huang teaches the neural processor circuit of claim 1, wherein the at least one neural engine is further configured to: receive another plurality of sub-channels of the portion of the input data from the data buffer(Huang, paras. 0131-0132, see also figs. 6,7, and 11, “As shown in FIG. 11, the computing unit may comprise 128 multipliers 1101 connected to a ); receive another kernel of the one or more kernels from the kernel fetcher circuit, the other kernel being decomposed into a plurality of sub-kernels (Huang, paras. 0131-0133, see also figs. 6,7,10, and 11, “As shown in FIG. 11, the computing unit may comprise 128 multipliers 1101 connected to a plurality of adders 1103 for convolution operations. In some cases, the plurality of adders may form a two-level adder network. The same computing unit configuration may be used for processing input feature map and parameters with variable sizes such as different number of channels and/or different kernel sizes…[i]n the example illustrated in FIG. 11, the input feature map may have eight channels, as further illustrated in FIG. 6. In each cycle, a portion of the input features as stored in four rows and eight slices 1107 are used. The parameters for one layer include four kernels each having 2x2 parameters across eight channels, as further illustrated in FIG. 7.”); perform another convolution operation on each sub-channel of the another plurality of sub-channels of the portion of the input data and the sub-kernels to generate multiple sub-channel outputs for each sub-channel of the portion of the input data; and store the sub-channel outputs for each sub-channel of the portion of the input data in the data buffer(Huang, paras. 0140-0142, “In a clock cycle, four rows and 32 slices of the input feature  may be used for summing products from outputs of multiplier 2. The configuration may comprise 128 accumulators each of which is configured to sum products from a multiplier. The sum result produced by each accumulator is a convolution result of a filter applied to a channel of the input feature map… Accu0 after a clock cycle is the convolution of the first channel of the input feature map with the first channel of a kernel (i.e., a first kernel) across a row of the kernel. In the next clock cycle, the results of the Accu0 would be HOW1C0*K0RlS1C0 for i=0-15. The number of multiplication is determined by the kernel size. In the depicted example, because the kernel contains eight parameters, the Accu0 may sum up across the entire kernel for 8 cycles in order to output a convolution result. The convolution operations will be applied across the entire input feature map. The output data point may be saved in a temporary memory….” Huang teaches: the configuration may comprise 128 accumulators each of which is configured to sum products from a multiplier. The sum result produced by each accumulator is a convolution result of a filter applied to a channel of the input feature map (i.e. perform another convolution operation on each sub-channel of the another plurality of sub-channels of the portion of the input data and the sub-kernels to generate multiple sub-channel outputs for each sub-channel of the portion of the input data) the convolution operations will be applied across the entire input feature map. The output data point may be saved in a temporary memory (i.e. and store the sub-channel outputs for each sub-channel of the portion of the input data in the data buffer.)).   
Regarding claim 9, Huang teaches the neural processor circuit of claim 8, wherein each sub-channel output of the sub-channel outputs is generated using a different accumulator of a plurality of accumulators in the at least one neural engine circuit(Huang, paras. 0140-0142, see also fig. 12B, “For example, the input feature map may have 16 channels and arranged into 32 slices. The input feature map may be convolved with one kernel containing eight parameters for each channel. The kernel shape can be lx8, 8xl, 2x4 or 4x2. The parameters may be arranged into four slices in the same manner as shown in the previous example. In a clock cycle, four rows and 32 slices of the input feature map and four slices of parameters may be fetched and supplied to the 128 multipliers with each parameter feeding into eight multipliers (e.g., multipliers 0,16, 32, 48, 64, 80, 96, 112) and each input value feeding into one multiplier. An accumulator such as Accu2 1206 may be used for summing products from outputs of multiplier 2. The configuration may comprise 128 accumulators each of which is configured to sum products from a multiplier. The sum result produced by each accumulator is a convolution result of a filter applied to a channel of the input feature map.”).  
Regarding claim 10, Huang teaches the neural processor circuit of claim 8, wherein the plurality of sub-kernels comprise a subset of repeated sub-kernels, and two or more of the plurality of sub-kernels comprise padded zeroes across at least one dimension of the two or more sub-kernels((Huang, para. 0106, see also fig. 8, “FIG. 8 illustrates examples of padding the slices to accommodate kernels of different sizes and number of channels. In the example described above, a memory access query may take up four rows and eight slices of data. In the case when the input data is image data 801 with dimension of 128xl28 pixel and three channels, the input data may be padded with a row of zeros such that the input data with original dimension of 128x128x3 is transformed to 128x64x8 which is aligned with a 4-row query configuration. In the example when the parameters are from K kernels each is 5x5 in size across eight channels 803 (i.e., 5x5x3), the parameters may be arranged and padded with zeros such that the parameters data are Huang teaches: In the example described above, a memory access query may take up four rows.In the example when the parameters are from K kernels each is 5x5 in size across eight channels 803 (i.e., 5x5x3), the parameters may be arranged and padded with zeros such that the parameters data are transformed to 5x3x8 to be aligned with the 4-row query configuration (i.e. a subset of repeated sub-kernels, and two or more of the plurality of sub-kernels comprise padded zeroes across at least one dimension of the two or more sub-kernels)).  
Regarding claim 11, Huang teaches the neural processor circuit of claim 8, wherein the data buffer is further configured to interleave the sub-channel outputs for each sub-channel of the portion of the input data to produce the output data(Huang, paras. 0131-0132, see also figs. 6,7, and 11, “As shown in FIG. 11, the computing unit may comprise 128 multipliers 1101 connected to a plurality of adders 1103 for convolution operations. In some cases, the plurality of adders may form a two-level adder network. The same computing unit configuration may be used for processing input feature map and parameters with variable sizes such as different number of channels and/or different kernel sizes…[i]n the example illustrated in FIG. 11, the input feature map may have eight channels, as further illustrated in FIG. 6. In each cycle, a portion of the input features as stored in four rows and eight slices 1107 are used. The parameters for one layer include four kernels each having 2x2 parameters across eight channels, as further illustrated in FIG. 7. In each cycle, a portion of the parameters as stored in four rows and eight slices 1109 are used.”).  
Regarding claim 12, Huang teaches the neural processor circuit of claim 1, wherein at least one of the neural engine circuits is further configured to: receive one or more patches of the portion of the input data from the data buffer over a processing cycle; receive a plurality of kernels from the kernel fetcher circuit over the processing cycle(Huang, para. Huang teaches: In each cycle, the computing unit may be able to handle a plurality of input values in parallel. In the depicted example, the computing unit may be capable of handling 128 input feature map data and 128 parameters in parallel (i.e. receive one or more patches of the portion of the input data from the data buffer over a processing cycle; receive a plurality of kernels from the kernel fetcher circuit over the processing cycle)); and perform convolution operations on each of the one or more patches of the portion of the input data and the plurality of kernels to produce multiple output channels of the output data(Huang, paras. 0140-0142, “In a clock cycle, four rows and 32 slices of the input feature map and four slices of parameters may be fetched and supplied to the 128 multipliers with each parameter feeding into eight multipliers (e.g., multipliers 0, 16, 32, 48, 64, 80, 96, 112) and each input value feeding into one multiplier. An accumulator such as Accu2 1206 may be used for summing products from outputs of multiplier 2. The configuration may comprise 128 accumulators each of which is configured to sum products from a multiplier. The sum result produced by each  The convolution operations will be applied across the entire input feature map. The output data point may be saved in a temporary memory….” Huang teaches: the convolution operations will be applied across the entire input feature map. The output data point may be saved in a temporary memory (i.e. and perform convolution operations on each of the one or more patches of the portion of the input data and the plurality of kernels to produce multiple output channels of the output data)).  
Regarding claim 13, Huang teaches the neural processor circuit of claim 12, wherein the at least one neural engine circuit is further configured to: perform multiply-accumulate operations on one of the one or more patches of the portion of the input data and multiple kernels of the plurality of kernels producing the multiple output channels of the output data in the accumulators(Huang, paras. 0140-0142, see also fig. 12B, “For example, the input feature map may have 16 channels and arranged into 32 slices. The input feature map may be convolved with one kernel containing eight parameters for each channel. The kernel shape can be lx8, 8xl, 2x4 or 4x2. The parameters may be arranged into four slices in the same manner as shown in the previous example. In a clock cycle, four rows and 32 slices of the input feature map and four slices of parameters may be fetched and supplied to the 128 multipliers with each parameter feeding into eight multipliers (e.g., multipliers 0,16, 32, 48, 64, 80, 96, 112) and each input value feeding into  An accumulator such as Accu2 1206 may be used for summing products from outputs of multiplier 2. The configuration may comprise 128 accumulators each of which is configured to sum products from a multiplier. The sum result produced by each accumulator is a convolution result of a filter applied to a channel of the input feature map.”).  
Referring to independent claim 14, it is rejected on the same basis as independent claim 1 since they are analogous claims.
Referring to dependent claims 15-19 they are rejected on the same basis as dependent claims 2, 4, 6, 8 and 11 and 12 and 13 since they are analogous claims.
Regarding claim 20, Huang teaches an electronic device, comprising: a neural processor circuit including a plurality of neural engine circuits, a data buffer and a kernel fetcher circuit, the neural engine circuits configured to perform convolution operations on at least a work unit of input data and kernel data(Huang, paras. 0107-0125, see also fig. 9, “[A]n integrated circuit on a chip is provided for performing matrix-matrix and/or matrix-vector multiplication operations. FIG. 9 illustrates an example of the chip for convolution computation, in accordance with embodiments of the invention. A computing unit of the chip may perform a plurality of parallel operations in response to the instructions associated with the CNN model. The computing unit may comprise a plurality of calculation circuits for performing operations in CNN” & see Huang, para. 0060, see also fig.2, “In some cases, the CNN model data may be transferred from the main memory to an on-chip RAM 209 whereas the input data may be transferred to an input data buffer on the chip” & see Huang, paras. 0107-0125, see also fig. 9, “Datapaths may be controlled by one or more multiplexers to determine the input feature maps and kernels to be fetched and supplied to the computing unit….” Huang teaches: an integrated circuit on a chip the computing unit may comprise a plurality of calculation circuits for performing operations in CNN (i.e. a neural processor circuit including a plurality of neural engine circuits) input data buffer on the chip (i.e. a data buffer) controlled by one or more multiplexers to determine the input feature maps and kernels to be fetched (i.e. and a kernel fetcher circuit) FIG. 9 illustrates an example of the chip for convolution computation (i.e. the neural engine circuits configured to perform convolution operations on at least a work unit of input data and kernel data)); and a system memory external to the neural processor circuit(Huang, para. 0060, see also fig.2, “In some cases, the CNN model data may be transferred from the main memory to an on-chip RAM 209 whereas the input data may be transferred to an input data buffer on the chip.” Huang teaches: transferred from the main memory (i.e. a system memory external to the neural processor circuit ) ), wherein the data buffer is configured to store at least a portion of the input data received from the system memory for sending to the neural engine circuits, the portion of the input data comprising the work unit of the input data, and store output data received from the neural engine circuits(Huang, para. 0060, see also fig.2, “In some cases, the CNN model data may be transferred from the main memory to an on-chip RAM 209 whereas the input data may be transferred to an input data buffer on the chip. Typically, both of the input data and the CNN model data are transferred and stored into contiguous regions of the on-chip RAM.”), wherein the kernel fetcher circuit is configured to receive one or more kernels from the system memory, and send a corresponding kernel to the neural engine circuits(Huang, paras. 0107-0125, see also fig. 9, “[A]n integrated circuit on a chip is provided for performing matrix-matrix and/or matrix-vector multiplication operations. FIG. 9 illustrates an example of the chip for convolution computation, in accordance with embodiments of the invention. A computing unit of the chip may perform a plurality of parallel operations in response to the instructions associated with the CNN model. The computing unit may comprise a plurality of calculation circuits for Huang teaches: Datapaths may be controlled by one or more multiplexers to determine kernels to be fetched and supplied to the computing unit (i.e. , the kernel fetcher circuit configured to receive one or more kernels from the system memory, and send a corresponding kernel to the neural engine circuits)), and wherein at least one of the neural engine circuits is configured to: receive a plurality of sub-channels of the portion of the input data from the data buffer (Huang, para. 0038, “In some cases, a convolution layer may be a depth wise separable convolution. In such scenario, a convolution layer may be factorized into a depth wise convolution and a 1 x l pointwise convolution to combine the outputs of the depth wise convolution. The convolution layer may be split into a layer for filtering (i.e., depth wise convolution layer) and a layer for combining (i.e., pointwise convolution layer). In some cases, in a depth wise convolution layer, a single filter may be applied to each input channel, and in a pointwise convolution layer, a 1 x 1 convolution may be performed to combine the output of the depth wise layer.” Huang teaches in a depth wise convolution layer, a single filter may be applied to each input channel (i.e. receive a plurality of sub-channels of the portion of the input data from the data buffer), receive a kernel of the one or more kernels from the kernel fetcher circuit, the kernel being decomposed into a corresponding sub-kernel for each sub-channel of the portion of the input data,  perform a convolution operation on each sub-channel of the portion of the input data and the corresponding sub-kernel, and accumulate corresponding outputs of each sub-channel portion of the convolution operation to generate a single channel of the output data(Huang, paras. 0131-0133, see also figs. 6,7,10, and 11, “As shown in FIG. 11, the computing unit may comprise 128 multipliers 1101 connected to a plurality of adders 1103 for convolution operations. In some cases, the plurality of adders may form a two-level adder network. The same computing unit configuration may be used for processing input feature map and parameters with variable sizes such as different number of channels and/or different kernel sizes…[i]n the example illustrated in FIG. 11, the input feature map may have eight channels, as further illustrated in FIG. 6. In each cycle, a portion of the input features as stored in four rows and eight slices 1107 are used. The parameters for one layer include four kernels each having 2x2 parameters across eight channels, as further illustrated in FIG. 7. In each cycle, a portion of the parameters as stored in four rows and eight slices 1109 are used. In some cases, in each cycle, 1 point of a kernel across all channels of all filters are used, multiply with four points in the input feature map[.] The input features in 1107 and the parameters in 1109 may be fetched and supplied to the 128 multipliers with each parameter feeding into four multipliers. Each of the multipliers may include a first input to receive a value of the input data and a second input to receive a kernel parameter/weight. The multipliers may perform multiplication operation of integer or fixed-point inputs. For example, the multiplier may be 8-bit fixed-point multipliers. A first level adder or accumulator such as adder 0 may be used for summing products from outputs of multipliers 1-4. The adder/accumulator may be 4-input adder/accumulator.” Huang teaches: The same computing unit configuration may be used for processing input feature map and parameters with variable sizes such as different number of channels and/or different kernel sizes the parameters for one layer include four kernels each having 2x2 parameters across eight channels (i.e. the kernel being decomposed into a corresponding sub-kernel for each sub-channel of the portion of the input data) the input features in 1107 and the parameters in 1109 may be fetched and supplied to the 128 multipliers with each parameter feeding into four multipliers. Each of the multipliers may include a first input to receive a value of the input data and a second input to receive a kernel parameter/weight. The multipliers may perform multiplication operation of integer or fixed-point inputs A first level adder or accumulator such as adder 0 may be used for summing products from outputs of multipliers 1-4 (i.e. perform a convolution operation on each sub-channel of the portion of the input data and the corresponding sub-kernel, and accumulate corresponding outputs of each sub-channel portion of the convolution operation to generate a single channel of the output data)).  
		Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Guo, Kaiyuan, et al. "Angel-eye: A complete design flow for mapping cnn onto customized hardware." 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 2016 (details a FPGA hardware design in which an array of Processing Elements(PE’s) are disclosed).
US 2018/0189641 Al (details a hardware accelerator with a system of a chip).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Adam Clark Standke whose telephone number is (571)270-1806. The examiner can normally be reached 10AM-7PM M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
Adam Clark Standke
Assistant Examiner
Art Unit 2129



/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129