Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on September 26, 2022, in which claims 1, 3, 8, and 19 are currently amended. Claims 1-24 are currently pending.

Response to Arguments
Applicant’s arguments with respect to rejection of claims 1-24 under 35 U.S.C. 102 based on amendment have been considered and are persuasive. The argument is moot in view of a new ground of rejection set forth below.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

	Claims 1-15, 19, and 21-24 are rejected under U.S.C. §103 as being unpatentable over the combination of Goyal (US 20170316312 A1) and Boesch (US 20180189642 A1).

	 Regarding claim 1, Goyal teaches A system comprising: a neural network model memory adapted to store a neural network model comprising a plurality of layers, each layer having at least one dimension and comprising a plurality of synaptic weights;([Abstract] "A hardware-based programmable deep learning processor (DLP) is proposed, wherein the DLP comprises with a plurality of accelerators...one or more vector floating point units (VectorFPUs) each configured to perform floating point vector operations, and a data engine configured to retrieve and store multi-dimensional data to both on-chip and external memories." [¶0007] "FIG. 2 depicts an example of a neural network, which includes a plurality of layers in accordance with some embodiments.")
	a plurality of neural cores, each neural core comprising([Abstract] "Specifically, the DLP includes a plurality of tensor engines configured to perform operations for pattern recognition and classification based on a neural network." tensor engine interpreted as synonymous with neural core.)
	a computation unit, the computation unit adapted to apply a plurality of synaptic weights to a plurality of input activations to produce a plurality of output activations, the computation unit having a plurality of vector units, and([¶0020] "In the example of FIG. 1, the system 100 includes a hardware-based programmable deep learning processor (DLP) 102, wherein the DLP 102 further includes at least a plurality of tensor engines (TEs) 104, which are dedicated hardware blocks/components each including one or more microprocessors and on-chip memory units storing software instructions programmed by a user for various machine learning operations." [¶0022] "As shown by the example of FIG. 2, there are three stages in the processing pipeline for each layer of a fully connected (FC) neural network—multiplication of neuron inputs Xi of a layer with weights Wij, addition of multiplication results and bias vector Bj, and application of an activation function to produce an output Yj" [¶0023] "For pattern recognition and classification, e.g., image pattern recognition, a convolutional neural network for convolution operations on input data may have three types of layers—one or more convolutional layers, each of which is configured to apply one or more local filters and/or a non-linear activation function to data from the input layer...each of which is configured to perform a linear or multi-layer perceptron (MLP) operation on the FC neural network and apply a non-linear activation function to output from the neuron." DLP interpreted as synonymous with computation unit)
	an activation memory adapted to store the input activations and the output activations;(FIG. 1 106 [¶0020] "The DLP 102 further includes an on-system/on-chip memory (OSM) 106 and one or more deep learning controllers (DLCs) 108 configured to access a plurality of external memory resources (e.g., DRAMs) through multiple input/output channels via memory controller(s)." [¶0024] "Here, each of the plurality of tensor engines 104 is fully programmable and is configured to retrieve and process input data from the OSM 106").
	However, Goyal does not explicitly teach wherein the system is adapted to partition the plurality of cores into a plurality of partitions based on a comparison of at least a portion of dimensions of the layer and a quantity of the vector units..

	Boesch, in the same field of endeavor, teaches wherein the system is adapted to partition the plurality of cores into a plurality of partitions based on a comparison of at least a portion of dimensions of the layer and a quantity of the vector units.([¶0071] "A plurality of CA's may be grouped to achieve larger computational entities, which provides flexibility to neural network designers by enabling choices for desirable balancing of available data bandwidth, power, and available processing resources. Kernel sets may be partitioned in batches and processed sequentially, and intermediate results may be stored in on-chip memory. Various kernel sizes (e.g., up to 12×12), various batch sizes (e.g., up to 16), and parallel kernels (e.g., up to 4) can be handled by a single CA instance, and any size kernel can be accommodated with the accumulator input." [¶0265] "FIG. 6D is a block diagram illustrating an exemplary convolution operation. In the block diagram, a 3-pixel-by-3-pixel (3×3) kernel having a stride of one (1) is convolved. The following acts are performed." [¶0266] "At the start of a line in first cycle, a first CA MAC unit 620 of each of three clusters (i.e., three 1st MAC units) performs calculations of the first column for the first output value of the first line." [¶0270] "The calculation sequence illustrated in FIG. 6D performs such that on every cycle, one 3×3 pixel kernel batch is convolved, which provides a significant reuse of data values fetched from fifth CA internal buffer 618 (e.g., the feature line buffer). FIG. 6D shows how four 3×3 pixel kernels with 36 MAC operations per cycle are performed using only a single access to the feature strip buffer per calculated output value." Boesch teaches that the number of MAC units is equal to the number of columns in each 3x3 filter and that the cores (convolution accelerators) are partitioned according to filters.).

	Goyal as well as Boesch are directed towards accelerating convolutional neural networks.  Therefore, Goyal as well as Boesch are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Goyal with the teachings of Boesch by substituting the tensor engines and corresponding vector FPUs in Goyal with the Convolution Accelerator and corresponding MAC units in Boesch.  Boesch provides as additional motivation for combination ([¶0120] "The high-performance, energy efficient hardware accelerated DCNN processor described herein includes an energy efficient set of DCNN hardware convolution accelerators that support kernel decompression, fast data throughput, and efficient mathematical operation. The processor also includes an on-chip reconfigurable data transfer fabric that improves data reuse and reduces on-chip and off-chip memory traffic, and a power efficient array of DSPs that support complete, real-world computer vision applications.").  This motivation for combination also applies to the remaining claims which depend on this combination. 

	 Regarding claim 2, the combination of Goyal, and Boesch teaches The system of claim 1, further comprising: at least one controller operatively coupled to the neural network model memory and to the plurality of cores, the at least one controller being adapted to, for each layer of the neural network model(Goyal [¶0020] "The DLP 102 further includes an on-system/on-chip memory (OSM) 106 and one or more deep learning controllers (DLCs) 108 configured to access a plurality of external memory resources (e.g., DRAMs) through multiple input/output channels via memory controller(s).")
	configure the plurality of cores to implement the layer, and(Goyal [¶0027] "For scalable matrix-matrix multiplication, the DLP 102 is configured to partition a large dense or sparse matrix into smaller portions and distribute the portions of the matrix across multiple tensor engines 104...The MatrixMul engine 408 of each tensor engine 104 is then configured to perform a matrix-matrix multiplication on its corresponding portion of the partitioned matrix" configuring the plurality of cores to implement the layer interpreted as synonymous with partitioning cores based on layer size.)
	provide input activations for the layer to the plurality of cores.(Goyal [¶0024] "Here, each of the plurality of tensor engines 104 is fully programmable and is configured to retrieve and process input data from the OSM 106 and/or the external memory resources via the DLCs 108," input data and input activations are interpreted as synonymous.  DLC and controller are interpreted as synonymous).
	
	 Regarding claim 3, the combination of Goyal, and Boesch teaches The system of claim 2, further comprising a network on a chip coupled to the plurality of cores.(Boesch [¶0162] "Continuing in the description of FIG. 4A, in addition to the stream switch 500, the CAF 400 may also include a system bus interface module 404. The system bus interface module 404 provides an interface to other modules of SoC 110...each DMA controller 406, external device interface 408, processing module 410, and convolution accelerator 600 has an interface to the configuration network with a defined set of configuration registers (e.g., formed in CAF control registers 402).").
	
	 Regarding claim 4, the combination of Goyal, and Boesch teaches The system of claim 3, wherein input activations are provided to the plurality of cores via the network.(Goyal [¶0023] "For pattern recognition and classification, e.g., image pattern recognition, a convolutional neural network for convolution operations on input data may have three types of layers—one or more convolutional layers, each of which is configured to apply one or more local filters and/or a non-linear activation function to data from the input layer...each of which is configured to perform a linear or multi-layer perceptron (MLP) operation on the FC neural network and apply a non-linear activation function to output from the neuron.").
	
	 Regarding claim 5, the combination of Goyal, and Boesch teaches The system of claim 3, wherein configuring the plurality of cores comprises distributing parameters to the plurality of cores via the network.(Goyal [¶0026] " The weight matrix W of N1×N2 is stored in column major form, wherein corresponding weights for the vector are also read once in blocks of size B at a time first from the first column and then from the second column, etc. Each time a block of weights is read from the weight matrix, they are multiplied element-wise with the block of the vector, summed, and added by the MatrixMul engine 408" weights are interpreted as synonymous with parameters.  Storing is interpreted as synonymous with distributing. MatrixMul engine 408 is an aspect of the plurality of cores, see FIG. 4).
	
	 Regarding claim 6, the combination of Goyal, and Boesch teaches The system of claim 5, wherein configuring the plurality of cores further comprises distributing instructions to the plurality of cores via the network.(Goyal [¶0018] "In addition, the DLP runs a complete pipeline of deep learning processing/operations offloaded from a host/computing device" [¶0020] "DLP 102 further includes at least a plurality of tensor engines (TEs) 104, which are dedicated hardware blocks/components each including one or more microprocessors and on-chip memory units storing software instructions programmed by a user for various machine learning operations." offloading instructions from host/computing device is interpreted as synonymous with distributing instructions to the DLP where it can be further distributed to the user programmable cores.).
	
	 Regarding claim 7, the combination of Goyal, and Boesch teaches The system of claim 1, wherein the plurality of partitions for each layer is further determined based on spatial dimensions of the input activations for that layer.(Goyal [¶0026] "The weight matrix W of N1×N2 is stored in column major form," FIG. 3  [¶0027] "For scalable matrix-matrix multiplication, the DLP 102 is configured to partition a large dense or sparse matrix into smaller portions and distribute the portions of the matrix across multiple tensor engines 104. In some embodiments, separate Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Format can be adopted for the corresponding portion of the large matrix distributed to each of the tensor engines 104" the size of the layer is interpreted as being fully based on the spatial dimension and the feature dimensions.  Goyal FIG. 3 shows that each layer having three dimensions: height, width (the spatial dimensions from the input image) and a feature dimension.  Input activation is interpreted as N1 represents layer input. N2 represents layer output.  Goyal explicitly teaches that the distributed matrix depends on the spatial dimensions and feature dimensions.  N1 and N2 are taught as dense matrices that may be explicitly partitioned among tensor engines.).
	
	 Regarding claim 8, the combination of Goyal, and Boesch teaches The system of claim 1, wherein the plurality of partitions for each layer is further determined based on spatial dimensions and a size of the feature dimensions of the input activations for that layer.(Boesch [¶0071] "A plurality of CA's may be grouped to achieve larger computational entities, which provides flexibility to neural network designers by enabling choices for desirable balancing of available data bandwidth, power, and available processing resources. Kernel sets may be partitioned in batches and processed sequentially, and intermediate results may be stored in on-chip memory. Various kernel sizes (e.g., up to 12×12), various batch sizes (e.g., up to 16), and parallel kernels (e.g., up to 4) can be handled by a single CA instance, and any size kernel can be accommodated with the accumulator input." [¶0265] "FIG. 6D is a block diagram illustrating an exemplary convolution operation. In the block diagram, a 3-pixel-by-3-pixel (3×3) kernel having a stride of one (1) is convolved. The following acts are performed." [¶0266] "At the start of a line in first cycle, a first CA MAC unit 620 of each of three clusters (i.e., three 1st MAC units) performs calculations of the first column for the first output value of the first line." [¶0270] "The calculation sequence illustrated in FIG. 6D performs such that on every cycle, one 3×3 pixel kernel batch is convolved, which provides a significant reuse of data values fetched from fifth CA internal buffer 618 (e.g., the feature line buffer). FIG. 6D shows how four 3×3 pixel kernels with 36 MAC operations per cycle are performed using only a single access to the feature strip buffer per calculated output value." Boesch teaches that the number of MAC units is equal to the number of columns in each 3x3 filter and that the cores (convolution accelerators) are partitioned according to filters.)
	and wherein the comparison includes a comparison of the size of the feature dimensions with the quantity of the vector units.(Boesch [¶0252] "As indicated in Table 5, feature data and kernel data dimensions vary depending on the processing layer, the type of network, and for other reasons. Based on these network-centric variables, conventional processing hardware with processing capabilities that are fixed at design time cannot be used to implement neural networks. In contrast, flexible buffering capabilities of each CA 600, which permit in some cases buffers to be changed in dimension, may be exploited to expand the level of parallelism in a neural network." FIG. 6A shows that the local buffers of each convolution accelerator are directly coupled to the plurality of MAC units which are interpreted as synonymous with the vector units.).
	
	 Regarding claim 9, the combination of Goyal, and Boesch teaches The system of claim 1, wherein the plurality of partitions for each layer is further determined based on spatial dimensions of the output activations for that layer.(Goyal [¶0026] "The weight matrix W of N1×N2 is stored in column major form," FIG. 3  [¶0027] "For scalable matrix-matrix multiplication, the DLP 102 is configured to partition a large dense or sparse matrix into smaller portions and distribute the portions of the matrix across multiple tensor engines 104. In some embodiments, separate Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Format can be adopted for the corresponding portion of the large matrix distributed to each of the tensor engines 104" the size of the layer is interpreted as being fully based on the spatial dimension and the feature dimensions.  Goyal FIG. 3 shows that each layer having three dimensions: height, width (the spatial dimensions from the input image) and a feature dimension.  Input activation is interpreted as N1 represents layer input. N2 represents layer output.  Goyal explicitly teaches that the distributed matrix depends on the spatial dimensions and feature dimensions.  N1 and N2 are taught as dense matrices that may be explicitly partitioned among tensor engines.).
	
	 Regarding claim 10, the combination of Goyal, and Boesch teaches  The system of claim 1, wherein the plurality of partitions for each layer is further determined based on spatial dimensions and feature dimensions of the output activations for that layer.(Goyal [¶0026] "The weight matrix W of N1×N2 is stored in column major form," FIG. 3  [¶0027] "For scalable matrix-matrix multiplication, the DLP 102 is configured to partition a large dense or sparse matrix into smaller portions and distribute the portions of the matrix across multiple tensor engines 104. In some embodiments, separate Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Format can be adopted for the corresponding portion of the large matrix distributed to each of the tensor engines 104" the size of the layer is interpreted as being fully based on the spatial dimension and the feature dimensions.  Goyal FIG. 3 shows that each layer having three dimensions: height, width (the spatial dimensions from the input image) and a feature dimension.  Input activation is interpreted as N1 represents layer input. N2 represents layer output.  Goyal explicitly teaches that the distributed matrix depends on the spatial dimensions and feature dimensions.  N1 and N2 are taught as dense matrices that may be explicitly partitioned among tensor engines.).
	
	 Regarding claim 11, the combination of Goyal, and Boesch teaches The system of claim 1, wherein the plurality of partitions for each layer is further determined based on one or more of spatial dimensions of the input activations, feature dimensions of the input activations, spatial dimensions of the output activations, or feature dimensions of the output activations for that layer.(Goyal [¶0026] "The weight matrix W of N1×N2 is stored in column major form," FIG. 3  [¶0027] "For scalable matrix-matrix multiplication, the DLP 102 is configured to partition a large dense or sparse matrix into smaller portions and distribute the portions of the matrix across multiple tensor engines 104. In some embodiments, separate Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Format can be adopted for the corresponding portion of the large matrix distributed to each of the tensor engines 104" the size of the layer is interpreted as being fully based on the spatial dimension and the feature dimensions.  Goyal FIG. 3 shows that each layer having three dimensions: height, width (the spatial dimensions from the input image) and a feature dimension.  Input activation is interpreted as synonymous with N1 represents layer input. N2 represents layer output.  Goyal explicitly teaches that the distributed matrix depends on the spatial dimensions and feature dimensions.  N1 and N2 are taught as dense matrices that may be explicitly partitioned among tensor engines.).
	
	 Regarding claim 12, the combination of Goyal, and Boesch teaches The system of claim 11, wherein the plurality of partitions for each layer is further determined by a dimension of the plurality of cores.(Goyal [¶0024] "The DLP 102 is configured to distribute the sub-tasks among the tensor engines 104 under both scenarios where the number of sub-tasks is greater than the number of tensor engines 104 and where the number of sub-tasks is fewer than the number of tensor engines 104" The partitioning and distributing of the sub-tasks is explicitly taught as being with respect to the number of tensor engines.  Tensor engine is interpreted as being synonymous with neural core.).
	
	 Regarding claim 13, the combination of Goyal, and Boesch teaches The system of claim 1, wherein the cores within each of the plurality of partitions are configured to compute partial sums.(Goyal [¶0026] "Each time a block of weights is read from the weight matrix, they are multiplied element-wise with the block of the vector, summed, and added by the MatrixMul engine 408 as a partial sum to the corresponding output value" [¶0027] "MatrixMul engine 408 of each tensor engine 104 is then configured to perform a matrix-matrix multiplication on its corresponding portion of the partitioned matrix" MatrixMul engine 408 is part of tensor engine which is interpreted as synonymous with core.).
	
	 Regarding claim 14, the combination of Goyal, and Boesch teaches The system of claim 13, wherein the partial sums are aggregated to compute a result for the associated layer.(Goyal [¶0026] "Each time a block of weights is read from the weight matrix, they are multiplied element-wise with the block of the vector, summed, and added by the MatrixMul engine 408 as a partial sum to the corresponding output value" corresponding output value is interpreted as synonymous with result for an associated layer).
	
	 Regarding claim 15, the combination of Goyal, and Boesch teaches The system of claim 14, wherein the partial sums are transmitted via a network for aggregation.(Goyal [¶0027] "MatrixMul engine 408 of each tensor engine 104 is then configured to perform a matrix-matrix multiplication on its corresponding portion of the partitioned matrix" [¶0026] " During the entire process, the weight matrix is read N/T times and the input matrix is read K/T times while the output matrix is written/stored only once to the memory." [¶0027] " the MatrixMul engine 408 in each tensor engine 104 is configured to achieve efficient vector-matrix multiplication by minimizing or avoiding data movement for multiplication between a sparse vector and a dense or sparse matrix, wherein only data that corresponds to non-zero values in the sparse vector is loaded into memory 406" See FIG. 1 Data movement is interpreted as synonymous with transmitted via a network.  Iterative multiplication and addition to partial sum for the corresponding output is interpreted as synonymous with aggregation.).
	
	 Regarding claim 19, Goyal teaches A method comprising: reading a neural network model comprising a plurality of layers, each layer having at least one dimension and comprising a plurality of synaptic weights;(FIG. 2 [¶0026] "The weight matrix W of N1×N2 is stored in column major form, wherein corresponding weights for the vector are also read once in blocks of size B at a time first from the first column and then from the second column, etc." See FIG. 2 for layers, FIG. 3 for dimensions.)
	configuring the plurality of cores to implement the layer, and([¶0027] "For scalable matrix-matrix multiplication, the DLP 102 is configured to partition a large dense or sparse matrix into smaller portions and distribute the portions of the matrix across multiple tensor engines 104...The MatrixMul engine 408 of each tensor engine 104 is then configured to perform a matrix-matrix multiplication on its corresponding portion of the partitioned matrix" configuring the plurality of cores to implement the layer interpreted as synonymous with partitioning cores based on layer size.)
	providing to the plurality of cores input activations for the layer,([¶0024] "Here, each of the plurality of tensor engines 104 is fully programmable and is configured to retrieve and process input data from the OSM 106 and/or the external memory resources via the DLCs 108," input data and input activations are interpreted as synonymous.  DLC and controller are interpreted as synonymous)
	and applying the synaptic weights associated with the layer to the input activations to produce a plurality of output activations.([¶0026] " The weight matrix W of N1×N2 is stored in column major form, wherein corresponding weights for the vector are also read once in blocks of size B at a time first from the first column and then from the second column, etc. Each time a block of weights is read from the weight matrix, they are multiplied element-wise with the block of the vector, summed, and added by the MatrixMul engine 408" weights are interpreted as synonymous with parameters.  Storing is interpreted as synonymous with distributing. MatrixMul engine 408 is an aspect of the plurality of cores, see FIG. 4).
	However, Goyal does not explicitly teach for each layer of the neural network model comparing at least a portion of dimensions of a layer and a quantity of vector units
	partitioning a plurality of cores into a plurality of partitions based on the comparison.

	Boesch, in the same field of endeavor, teaches for each layer of the neural network model comparing at least a portion of dimensions of a layer and a quantity of vector units([¶0071] "A plurality of CA's may be grouped to achieve larger computational entities, which provides flexibility to neural network designers by enabling choices for desirable balancing of available data bandwidth, power, and available processing resources. Kernel sets may be partitioned in batches and processed sequentially, and intermediate results may be stored in on-chip memory. Various kernel sizes (e.g., up to 12×12), various batch sizes (e.g., up to 16), and parallel kernels (e.g., up to 4) can be handled by a single CA instance, and any size kernel can be accommodated with the accumulator input." [¶0265] "FIG. 6D is a block diagram illustrating an exemplary convolution operation. In the block diagram, a 3-pixel-by-3-pixel (3×3) kernel having a stride of one (1) is convolved. The following acts are performed." [¶0266] "At the start of a line in first cycle, a first CA MAC unit 620 of each of three clusters (i.e., three 1st MAC units) performs calculations of the first column for the first output value of the first line." [¶0270] "The calculation sequence illustrated in FIG. 6D performs such that on every cycle, one 3×3 pixel kernel batch is convolved, which provides a significant reuse of data values fetched from fifth CA internal buffer 618 (e.g., the feature line buffer). FIG. 6D shows how four 3×3 pixel kernels with 36 MAC operations per cycle are performed using only a single access to the feature strip buffer per calculated output value." Boesch teaches that the number of MAC units is equal to the number of columns in each 3x3 filter and that the cores (convolution accelerators) are partitioned according to filters.)
	partitioning a plurality of cores into a plurality of partitions based on the comparison([¶0252] "As indicated in Table 5, feature data and kernel data dimensions vary depending on the processing layer, the type of network, and for other reasons. Based on these network-centric variables, conventional processing hardware with processing capabilities that are fixed at design time cannot be used to implement neural networks. In contrast, flexible buffering capabilities of each CA 600, which permit in some cases buffers to be changed in dimension, may be exploited to expand the level of parallelism in a neural network." FIG. 6A shows that the local buffers of each convolution accelerator are directly coupled to the plurality of MAC units which are interpreted as synonymous with the vector units.).

	Goyal as well as Boesch are directed towards accelerating convolutional neural networks.  Therefore, Goyal as well as Boesch are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Goyal with the teachings of Boesch by substituting the tensor engines and corresponding vector FPUs in Goyal with the Convolution Accelerator and corresponding MAC units in Boesch.  Boesch provides as additional motivation for combination ([¶0120] "The high-performance, energy efficient hardware accelerated DCNN processor described herein includes an energy efficient set of DCNN hardware convolution accelerators that support kernel decompression, fast data throughput, and efficient mathematical operation. The processor also includes an on-chip reconfigurable data transfer fabric that improves data reuse and reduces on-chip and off-chip memory traffic, and a power efficient array of DSPs that support complete, real-world computer vision applications.").  This motivation for combination also applies to the remaining claims which depend on this combination. 

	 Regarding claim 21, the combination of Goyal, and Boesch teaches The method of claim 19, wherein configuring the plurality of cores comprises distributing parameters to the plurality of cores via a network.(Goyal [¶0026] " The weight matrix W of N1×N2 is stored in column major form, wherein corresponding weights for the vector are also read once in blocks of size B at a time first from the first column and then from the second column, etc. Each time a block of weights is read from the weight matrix, they are multiplied element-wise with the block of the vector, summed, and added by the MatrixMul engine 408" weights are interpreted as synonymous with parameters.  Storing is interpreted as synonymous with distributing. MatrixMul engine 408 is an aspect of the plurality of cores, see FIG. 4).
	
	 Regarding claim 22, the combination of Goyal, and Boesch teaches The method of claim 19, wherein configuring the plurality of cores comprises distributing instructions to the plurality of cores via a network.(Goyal [¶0018] "In addition, the DLP runs a complete pipeline of deep learning processing/operations offloaded from a host/computing device" [¶0020] "DLP 102 further includes at least a plurality of tensor engines (TEs) 104, which are dedicated hardware blocks/components each including one or more microprocessors and on-chip memory units storing software instructions programmed by a user for various machine learning operations." offloading instructions from host/computing device is interpreted as synonymous with distributing instructions to the DLP where it can be further distributed to the user programmable cores.).

Regarding claim 23, the combination of Goyal, and Boesch teaches The method of claim 19, wherein the plurality of partitions for each layer is further determined based on one or more of spatial dimensions of the input activations, feature dimensions of the input activations, spatial dimensions of the output activations, or feature dimensions of the output activations for that layer.(Goyal [¶0026] "The weight matrix W of N1×N2 is stored in column major form," FIG. 3  [¶0027] "For scalable matrix-matrix multiplication, the DLP 102 is configured to partition a large dense or sparse matrix into smaller portions and distribute the portions of the matrix across multiple tensor engines 104. In some embodiments, separate Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Format can be adopted for the corresponding portion of the large matrix distributed to each of the tensor engines 104" the size of the layer is interpreted as being fully based on the spatial dimension and the feature dimensions.  Goyal FIG. 3 shows that each layer having three dimensions: height, width (the spatial dimensions from the input image) and a feature dimension.  Input activation is interpreted as N1 represents layer input. N2 represents layer output.  Goyal explicitly teaches that the distributed matrix depends on the spatial dimensions and feature dimensions.  N1 and N2 are taught as dense matrices that may be explicitly partitioned among tensor engines.).
	
	 Regarding claim 24, the combination of Goyal, and Boesch teaches The system of claim 23, wherein the plurality of partitions for each layer is further determined by a dimension of the plurality of cores.(Goyal [¶0024] "The DLP 102 is configured to distribute the sub-tasks among the tensor engines 104 under both scenarios where the number of sub-tasks is greater than the number of tensor engines 104 and where the number of sub-tasks is fewer than the number of tensor engines 104" The partitioning and distributing of the sub-tasks is explicitly taught as being with respect to the number of tensor engines.  Tensor engine is interpreted as being synonymous with neural core.).

	Claims 16-18 and 20 are rejected under U.S.C. §103 as being unpatentable over the combination of Goyal and Boesch and Huang (US20190180170A1).

	 Regarding claim 16, the combination of Goyal, and Boesch teaches The system of claim 2.
	However, the combination of Goyal, and Boesch doesn't explicitly teach the at least one controller is further adapted to, upon computation of output activations of a layer, redistribute the output activations among the plurality of cores..

	Huang, in the same field of endeavor, teaches the at least one controller is further adapted to, upon computation of output activations of a layer, redistribute the output activations among the plurality of cores.([Abstract] "Performing the task can include computing an intermediate result using the first array of processing engines, copying the intermediate result to the second set of memory banks, and computing a final result using the second array of processing engines, where the final result corresponds to an outcome of performing the task." FIG. 5).

	Goyal as well as Boesch are directed towards accelerating convolutional neural networks.  Therefore, Goyal as well as Boesch are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Goyal with the teachings of Boesch by substituting the tensor engines and corresponding vector FPUs in Goyal with the Convolution Accelerator and corresponding MAC units in Boesch.  Boesch provides as additional motivation for combination ([¶0120] "The high-performance, energy efficient hardware accelerated DCNN processor described herein includes an energy efficient set of DCNN hardware convolution accelerators that support kernel decompression, fast data throughput, and efficient mathematical operation. The processor also includes an on-chip reconfigurable data transfer fabric that improves data reuse and reduces on-chip and off-chip memory traffic, and a power efficient array of DSPs that support complete, real-world computer vision applications.").  This motivation for combination also applies to the remaining claims which depend on this combination. 

	 Regarding claim 17, the combination of Goyal, Boesch, and Huang teaches The system of claim 16, wherein the redistribution is via a network.(Goyal FIG. 1 [¶0019] "FIG. 1 depicts an example of a diagram of a system 100 configured to support hardware-based deep learning processing. Although the diagrams depict components as functionally separate, such depiction is merely for illustrative purposes. It will be apparent that the components portrayed in this figure can be arbitrarily combined or divided into separate software, firmware and/or hardware components. Furthermore, it will also be apparent that such components, regardless of how they are combined or divided, can execute on the same host or multiple hosts, and wherein the multiple hosts can be connected by one or more networks.").
	
	 Regarding claim 18, the combination of Goyal, Boesch, and Huang teaches The system of claim 16, wherein the redistribution is determined based on one or more of spatial dimensions of the input activations, feature dimensions of the input activations, spatial dimensions of the output activations, or feature dimensions of the output activations for that layer.(Goyal [¶0026] "The weight matrix W of N1×N2 is stored in column major form," FIG. 3  [¶0027] "For scalable matrix-matrix multiplication, the DLP 102 is configured to partition a large dense or sparse matrix into smaller portions and distribute the portions of the matrix across multiple tensor engines 104. In some embodiments, separate Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Format can be adopted for the corresponding portion of the large matrix distributed to each of the tensor engines 104" Huang teaches redistributing data through a DMA engine based on the output of a first calculation.  Goyal teaches distributing trough a DMA engine based on spatial dimensions of the layer.).	

	 Regarding claim 20, the combination of Goyal and Boesch teaches The method of claim 19, further comprising: computing partial sums within each partition;(Goyal [¶0026] "Each time a block of weights is read from the weight matrix, they are multiplied element-wise with the block of the vector, summed, and added by the MatrixMul engine 408 as a partial sum to the corresponding output value" [¶0027] "MatrixMul engine 408 of each tensor engine 104 is then configured to perform a matrix-matrix multiplication on its corresponding portion of the partitioned matrix" MatrixMul engine 408 is part of tensor engine which is interpreted as synonymous with core.)
	aggregating the partial sums to compute the output activations.(Goyal [¶0026] "Each time a block of weights is read from the weight matrix, they are multiplied element-wise with the block of the vector, summed, and added by the MatrixMul engine 408 as a partial sum to the corresponding output value" corresponding output value is interpreted as synonymous with output activation.).
	However, the combination of Goyal and Boesch does not explicitly teach transmitting the partial sums among cores within each partition;.

	Huang, in the same field of endeavor, teaches transmitting the partial sums among cores within each partition;([Abstract] "Performing the task can include computing an intermediate result using the first array of processing engines, copying the intermediate result to the second set of memory banks, and computing a final result using the second array of processing engines, where the final result corresponds to an outcome of performing the task." FIG. 5).

	Goyal, Boesch, and Huang are both directed towards distributed training of a neural network.  Therefore, Goyal, Boesch, and Huang are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network system of Goyal and Boesch with that of Huang by redistributing the output activations after computation. It would be obvious to one of ordinary skill in the art that in order to obtain the activation results in a distributed system said results should be distributed. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Huang ([¶0065] “FIG. 4 illustrates an example of the effect of storing the weight values for a neural network on-chip instead of in off-chip memory.”).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Lu (“Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs”, 2017) is directed towards partitioning convolution filters among processing elements based on the dimension of the filter.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720.  The examiner can normally be reached on M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        

/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124