DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The present application was filed on 04/28/2017.
This action is in response to arguments and/or remarks filed on 02/04/2021. In the current amendments, claims 1, 22 and 36 have been amended. Claims 1-38 are pending and have been examined. 


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-7, 9-10, 12-16, 18, 20, 22-23, 25, 27-30, 32, 34-35 and 38 are rejected under 35 U.S.C. 103 as being unpatentable over Gokhale (“Nn-X- a hardware accelerator for convolutional neural networks”, hereinafter; Gokhale) in view of Liu et al. (“Sparse Convolutional Neural Networks”, hereinafter: Liu) in view of Mahale et al. (“VOP: Architecture of a Processor for Vector Operations in On-line Learning of Neural Networks”, hereinafter: Mahale) and further in view of Glorot et al. (“Deep Sparse Rectifier Neural Networks”, hereinafter: Glorot). 
Regarding claim 1 (Currently Amended) 
Gokhale teaches a hardware-based programmable deep learning processor (DLP), (Fig. 3.1, “A block diagram of the nn-X system. nn-X is composed of a coprocessor, a host processor and external memory” pg. 10 section 3 “In this implementation, the coprocessor is on-chip”)
comprising: an on-system memory (OSM) (pg. 13 section 3.2 “The convolution engine is implemented as fully pipelined logic and uses memory[corresponds to on-system memory] to cache incoming data. This cache is needed for pipelined implementation of the convolution operation [14]. For a row of width W and a k x k convolution filter, the size of this cache is W x k x 2 bytes”)
and one or more controllers configured to access a plurality of external memory resources (Fig. 3.1, “A block diagram of the nn-X system. nn-X is composed of a coprocessor, a host processor and external memory”) via direct memory access (DMA); (pg. 19 “Figure 4.2 shows the components involved in a DMA transaction.” Section 4.1 and 4.2 has DMA transactions initiated by host processor for data transfer between external memory and coprocessor)
a plurality of programmable tensor engines (Fig. 3.1, “The coprocessor has three main components: processing elements called collections, a system bus called the memory router and a configuration bus to control flow of data. Collections perform the most typical operations in ConvNets: convolutions, pooling and non-linearity” since the convolution operations are two-dimensional array/matrix operation this corresponds to 2D tensor see section 3.2.1 “The convolution engine is implemented as fully pipelined logic and uses memory to cache incoming data. This cache is needed for pipelined implementation of the convolution operation [14]. For a row of width W and a k x k convolution filter, the size of this cache is W x k x 2 bytes”)
configured to perform a plurality of operations (Examiner notes each collection has convolution engine performing convolution operations which corresponds to plurality of operations see Fig. 3.2 “A collection comprises a router and three operators. The router has “all-to-all” connections forming a crossbar switch” also see section 3.2.1) on input data to generate deep learning processing results for (pg. 2 Fig. 1 “Architecture of a typical convolutional neural network for object recognition: a convolutional feature extractor followed by a classifier (like a multi-layer perceptron) for generic multi-class object recognition. Once trained, the network can parse arbitrarily large input images, generating a classification map as the output.”)
wherein each of the tensor engines (Fig. 3.1, “The coprocessor has three main components: processing elements called collections, a system bus called the memory router and a configuration bus to control flow of data. Collections perform the most typical operations in ConvNets: convolutions, pooling and non-linearity” since the convolution operations are two-dimensional array/matrix operation this corresponds to 2D tensor) further comprises a plurality types of hardware engines (Examiner notes there are several hardware engines that include convolution engine from pg. 13 section 3 which sate “The convolution engine is implemented as fully pipelined logic and uses memory to cache incoming data.” and the DMA engines from pg. 19 “Xilinx provides soft DMA IP (called DMA engines) to perform transactions over the AXI bus from memory to the programmable logic… The DMA engines attach themselves to these HP ports. The engines include a buffer to store data until it is required by the coprocessor.”) to accelerate the operations on data at each layer of the neural network, (Fig. 4.2 “The DMA engines[corresponds to hardware engines], which are soft IP and are implemented in the programmable logic, receive data from memory and store it in a buffer. From here, data is transferred to the coprocessor which is also in the programmable logic.”)
…
a data engine configured to prefetch the input data from the OSM and/or the external memory resources. (Examiner notes that Gokhale teaches retrieving data[corresponds to fetching] and subsequently storing it in a buffer in that way it is transferred for later use. See pg. 20 “Fig. 4.2 Components involved in performing a DMA[corresponds to main memory] transaction. Data is stored in memory by the host processor. From here, a DMA transaction is initiated by the host. The DMA engines, which are soft IP and are implemented in the programmable logic, receive data from memory and store it in a buffer. From here, data is transferred to the coprocessor which is also in the programmable logic.”)
Gokhale does not teach wherein the types of hardware engines include: one or more matrix multiplier engines each configured to perform a plurality of dense and/or sparse vector-matrix and matrix-matrix multiplication operations; 
one or more convolutional network engines each configured to perform a plurality of convolution operations by applying a function to increase a sparsity of the vectors and/or matrices, and wherein the one or more convolutional network engines each is configured to reduce a number of computations on zero values of the vectors and/or matrices;
one or more vector floating point units each configured to perform a vector operation in floating point format.
Liu teaches wherein the types of hardware engines include: one or more matrix multiplier engines each configured to perform a plurality of dense and/or sparse vector-matrix (abstract “We also propose an efficient sparse matrix multiplication algorithm on CPU[corresponds to hardware engines] for Sparse Convolutional Neural Networks (SCNN) models.”)
and matrix-matrix multiplication operations; (Fig. 3 shows multiplying a dense matrix and a sparse matrix[corresponds to matrix-matrix multiplication] with input A: 8X12 dense matrix and B: 12x8 sparse matrix see pg. 810 “Figure 3: An example that illustrates how our algorithm generates code for multiplying a dense matrix and a sparse matrix”)
…
and wherein the one or more convolutional network engines each is configured to reduce a number of computations on zero values of the vectors and/or matrices; (pg. 809 section 4.1 “To avoid extra storage and calculation of zero values, the non-zero elements in a sparse matrix are typically stored continuously, with their locations indexed in some specific structure. This leads to indirect jumping memory access when traversing the matrix, which is much slower than the continuous direct access used in the dense case… We propose an efficient, sparse-dense matrix multiplication algorithm for executing the sparse convolutional kernels.”)
Gokhale and Liu are analogous art because they are both directed to convolutional neural network.  
Gokhale to incorporate the teaching of Liu to include an efficient sparse matrix multiplication algorithm. 
One of ordinary skill in the art would have been motivated to make this modification in order to reduce the amount of computation required to process images, by sparse decompositions of the convolutional kernels as disclosed by Liu (pg. 807 left col first paragraph).
Gokhale in view of Liu does not teach … one or more convolutional network engines each configured to perform a plurality of convolution operations by applying a function to increase a sparsity of the vectors and/or matrices, 
…
one or more vector floating point units each configured to perform a vector operation in floating point format.
Mahale teaches one or more vector floating point units each configured to perform a vector operation in floating point format. (Pg. 395 left col “As we target VOP to support large set of applications involving vector operations, we introduce support for floating point data. We replace the fixed point units with pipe-lined floating point units. We merge adder and subtracter into a single unit that performs both operations by flipping sign bit of one of the operands.” Also see pg. 394 left col “We extend this work to VOPs with 64 bit double precision floating point units suitable for data sets with larger data ranges.”)
Gokhale, Liu and Mahale are analogous art because they are all directed to neural network.   
Gokhale in view of Liu to incorporate the teaching of Mahale to include vector operations in floating point. 
One of ordinary skill in the art would have been motivated to make this modification in order to reduce computation complexity during update of synaptic weights as disclosed by Mahale (pg. 396 left col).
Gokhale in view of Liu with Mahale does not teach one or more convolutional network engines each configured to perform a plurality of convolution operations by applying a function to increase a sparsity of the vectors and/or matrices. 
Glorot teaches one or more convolutional network engines each configured to perform a plurality of convolution operations by applying a function to increase a sparsity of the vectors and/or matrices. (pg. 318 left col “The rectifier activation function allows a network to easily obtain sparse representations. For example, after uniform initialization of the weights, around 50% of hidden units continuous output values are real zeros, and this fraction can easily increase with sparsity-inducing regularization.”)
Gokhale, Liu, Mahale and Glorot are analogous art because they are all directed to neural network.   
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gokhale in view of Lu with Mahale to incorporate the teaching of Glorot to include rectifying nonlinearities as alternatives to the hyperbolic tangent or sigmoid in deep artificial neural networks. 
L1 regularizer on the activation values to promote sparsity and prevent potential numerical problems with unbounded activation” as disclosed by Glorot (pg. 2 left col second paragraph).

DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 22
	Claim 22 recites analogous limitations to independent claim 1 and therefore is rejected on the same ground as independent claim 1. 

Regarding claim 2
Gokhale in view of Liu with Mahale and Glorot teaches claim 1.
Gokhale further teaches wherein: the DLP is configured to multiplex the data prefetched from the OSM and/or the external memory (Examiner notes that Gokhale teaches retrieving data[corresponds to fetching] and subsequently storing it in a buffer in that way it is transferred for later use. See pg. 20 “Fig. 4.2 Components involved in performing a DMA[corresponds to main memory] transaction. Data is stored in memory by the host processor. From here, a DMA transaction is initiated by the host. The DMA engines, which are soft IP and are implemented in the programmable logic, receive data from memory and store it in a buffer. From here, data is transferred to the coprocessor which is also in the programmable logic.”) resources to each of the tensors engines via a crossbar. (Pg. 12 Fig. 3.2 “A collection comprises a router and three operators. The router has “all-to-all” connections forming a crossbar switch. The configuration bus forms the select line for the mux-demux combination.”)

Regarding claim 3
Gokhale in view of Liu with Mahale and Glorot teaches claim 2.
Gokhale further teaches wherein: each of the plurality of tensor engines (Fig. 3.1, “The coprocessor has three main components: processing elements called collections, a system bus called the memory router and a configuration bus to control flow of data. Collections perform the most typical operations in ConvNets: convolutions, pooling and non-linearity” since the convolution operations are two-dimensional array/matrix operation this corresponds to 2D tensor) further includes a programmable CPU having its own instruction RAM and data RAM configured to store instructions from a host and the retrieved data from the OSM and/or the external memory resources, respectively. (Fig. 3.1, “The coprocessor has three main components: processing elements called collections, a system bus called the memory router and a configuration bus to control flow of data. Collections perform the most typical operations in ConvNets: convolutions, pooling and non-linearity” also the ARM cortex has a separate RAM and data RAM configured to store instructions from the host and the host runs on linux operating system see section 3.1 “Two ARM Cortex-A9 CPUs function as the host processor for the nn-X implementation described here. The host runs the Linux operating system.”)
DM2\7720785 3Atty. Docket No.: R2197-03102 
PATENT
Regarding claim 4
Gokhale in view of Liu with Mahale and Glorot teaches claim 3.
Gokhale further teaches wherein: the DLP is configured to accept a plurality of instructions from the host and submit the instructions to the tensor engines (Pg. 11 section 3.1 “Two ARM Cortex-A9 CPUs function as the host processor for the nn-X implementation described here. The host runs the Linux operating system. The processor is responsible for parsing a network, compiling it into configuration instructions for the coprocessor[corresponds to tensor engines] and processing operations that are not implemented on the coprocessor.”) and their respective components in the DLP via a DLP interface, (pg. 10 last paragraph “The nn-X coprocessor is implemented on the embedded programmable logic and interfaces with the host via the AXI bus.”)
wherein the instructions are stored in the instruction RAM of the tensor engines. (Pg. 11 section 3.1 “Two ARM Cortex-A9 CPUs function as the host processor for the nn-X implementation described here. The host runs the Linux operating system. The processor is responsible for parsing a network, compiling it into configuration instructions for the coprocessor and processing operations that are not implemented on the coprocessor. The compiled instructions are stored in memory.”)

Regarding claim 5
Gokhale in view of Liu with Mahale and Glorot teaches claim 3. 
Gokhale further teaches wherein: the DLP is also configured to provide the deep learning processing results by the DLP back to the host via the DLP interface. (Pg. 8 “the architecture could be designed as a coprocessor which does not interpret the results but simply sends them out to a host processor which is responsible for interpreting results and taking appropriate actions”)

Regarding claim 6
Gokhale in view of Liu with Mahale and Glorot teaches claim 1. 
Mahale further teaches wherein: the configuration of the neural network is dynamically adjusted based on current deep learning application of the DLP. (Abstract “For real-time on-line learning, update[corresponds to adjusted] of synaptic weights is done using an existing Incremental Pseudo Inverse (IPI) algorithm in the place of compute intensive pseudo inverse algorithm.”)
Gokhale, Liu, Glorot and Mahale are analogous art because they are all directed to neural network.   
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gokhale in view of Liu with Glorot to incorporate the teaching of Mahale to include vector operations in floating point. 
One of ordinary skill in the art would have been motivated to make this modification in order to reduce computation complexity during update of synaptic weights as disclosed by Mahale (pg. 396 left col).

Regarding claim 7
Gokhale in view of Liu with Mahale and Glorot teaches claim 1. 
Gokhale further teaches wherein: the neural network includes a plurality of layers each having a plurality of neurons connecting to neurons on a neighboring layer, (abstract “ConvNets consist of multiple layers that contain groups of artificial neurons, which are mathematical approximations of biological neurons. A ConvNet can consist of millions of neurons and require billions of computations to produce one output.” Also see pg. 2 section 1.1 “A ConvNet comprises several convolution layers. These are followed by a classifier that classifies the outputs of the convolutional layers as one of the multiple objects it is trained on. A typical ConvNet is shown in Figure 1.1. Each convolution layer is followed by a pooling operation and a non-linearity.”)
wherein data processed progresses from one layer to the next in sequence along a processing pipeline. (pg. 13 section 3.2.1 “Output data can then be routed to other operators in the same collection to perform cascaded pipelined sequences of operations. It can also be sent to a neighboring collection to be combined with the output of a convolution performed there.”)

DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 23
	Claim 23 recites analogous limitations to claim 7 and therefore is rejected on the same ground as claim 7. 


Regarding claim 9
Gokhale in view of Liu with Mahale and Glorot teaches claim 7. 
Gokhale further teaches wherein: the neural network utilized for convolution operations has three types of layers: (Fig. 1.1 shows convolutional neural network of 3 layers) one or more convolutional layers, (pg. 2 section 1.1 “A ConvNet comprises several convolution layers. These are followed by a classifier that classifies the outputs of the convolutional layers as one of the multiple objects it is trained on. A typical ConvNet is shown in Figure 1.1”) each of which is configured to apply one or more local filters and/or a non-linear activation function to data from the input layer, (Fig. 1.1 “Architecture of a typical convolutional neural network for object recognition: a convolutional feature extractor followed by a classifier (like a multi-layer perceptron) for generic multi-class object recognition.”)
 PATENTone or more sub-sampling layers, each of which is configured to aggregate information amongst a set of neighbors of a neuron of the layer; (Examiner notes there are sub-sampling layer of convolution and max-pool layers at each layers see Fig. 1.1 “Architecture of a typical convolutional neural network for object recognition: a convolutional feature extractor followed by a classifier (like a multi-layer perceptron) for generic multi-class object recognition.”)
one or more classification layers, (pg. 2 section 1.1 “A ConvNet comprises several convolution layers. These are followed by a classifier that classifies the outputs of the convolutional layers as one of the multiple objects it is trained on. A typical ConvNet is shown in Figure 1.1”)
(Fig. 1.1 “Architecture of a typical convolutional neural network for object recognition: a convolutional feature extractor followed by a classifier (like a multi-layer perceptron) for generic multi-class object recognition.”)
and apply a non-linear activation function to output from the neuron. (Pg. 3 section 1.1.2 “The non-linearity is used as an activation function to convert the input space into linearly separable output spaces. This operator acts to model the rate of firing of the action potential in a biological neuron.”)

DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 25
	Claim 25 recites analogous limitations to claim 9 and therefore is rejected on the same ground as claim 9. 

Regarding claim 10 
Gokhale in view of Liu with Mahale and Glorot teaches claim 9. 
Gokhale further teaches wherein: one or more kernels are applied to source pixels in an image for image classification, (pg. 5 second col “For a layer with 9x9 kernels, 3 input feature maps, 96 output feature maps and an input image of 224x224, 2.17 billion operations are required to produce the 96 outputs.”)
wherein a center element of each kernel is placed over a source pixel to replace the source pixel with a weighted sum of the source pixel and its neighboring pixels. (Pg. 13 section 3.2.1 “The convolution engine is implemented as fully pipelined logic and uses memory to cache incoming data. This cache is needed for pipelined implementation of the convolution operation [14]. For a row of width W and a k x k convolution filter, the size of this cache is W x k x 2 bytes. The convolution engine can also perform pooling operations. The kernel can be used to implement a smooth pooling function (for example, Gaussian) or perform a running average of pixels or data words (with a uniform kernel).”)

Regarding claim 12
Gokhale in view of Liu with Mahale and Glorot teaches claim 1. 
Gokhale further teaches wherein: the DLP (pg. 11 section 3.1 “Two ARM Cortex-A9 CPUs function as the host processor for the nn-X implementation described here.”) is configured to partition each operation for pattern classification among the plurality of tensor engines, (pg. 18 “ConvNets process the red, green and blue channels of an image separately. Most images from cameras or those stored on disk are compressed using the JPEG standard. Decompressed JPEG images have 24 bits per pixel, 8 bits each for the red, green and blue (RGB) channels. This allows us to decompress the image into red, green and blue image streams and store each channel as a separate array in memory. Each channel is then sent in to the accelerator independently. This is shown in Figure 4.1. The figure shows a pictorial representation of the state the memory is in before a DMA transaction. The JPEG frame is stored in system memory. It is then separated into its three channels and stored in the DMA buffers.”)
(Fig. 3.1, “The coprocessor has three main components: processing elements called collections, a system bus called the memory router and a configuration bus to control flow of data. Collections perform the most typical operations in ConvNets: convolutions, pooling and non-linearity” since the convolution operations are two-dimensional array/matrix operation this corresponds to 2D tensor see section 3.2.1 “The convolution engine is implemented as fully pipelined logic and uses memory to cache incoming data. This cache is needed for pipelined implementation of the convolution operation [14]. For a row of width W and a k x k convolution filter, the size of this cache is W x k x 2 bytes”) is configured to perform a sub-task of the operation in parallel. (Examiner notes that sub-task corresponds to each task at each layers in the convolution see Pg. 34 “One advantage is nn-X’s large parallelism; eight convolutional engines of 10x10 can deliver up to 227 G-ops/s while running at 142MHz.” also see pg. 13 “Convolution is inherently parallel and can be accelerated on data parallel architectures. The operation was introduced in Chapter 1, Section 1.1”)
DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 27
	Claim 27 recites analogous limitations to claim 12 and therefore is rejected on the same ground as claim 12. 
Regarding claim 13
Gokhale in view of Liu with Mahale and Glorot teaches claim 12.
Gokhale further teaches wherein: the DLP is configured to replicate a sub-task among multiple tensor engines (Examiner notes that a sub-task corresponds to each task at each layer and the tasks are being replicated see pg. 2 section 1.1 “A ConvNet comprises several convolution layers. These are followed by a classifier that classifies the outputs of the convolutional layers as one of the multiple objects it is trained on. A typical ConvNet is shown in Figure 1.1. Each convolution layer is followed by a pooling operation and a non-linearity. In this work, we consider a layer of a ConvNet to comprise all three of the above mentioned operators while a convolution layer consists of only the convolution operator. Inputs of a layer are typically images (or frames from a video) but can be any kind of locally correlated data, like audio signals. The outputs of one layer act as inputs to the next. The inputs (and by extension, the outputs) are called feature maps.” Also see Pg. 34 “One advantage is nn-X’s large parallelism; eight convolutional engines of 10x10 can deliver up to 227 G-ops/s while running at 142MHz.”) or move a sub-task from one tensor engine to another for efficient use of compute resources.  

DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 28
 	Claim 28 recites analogous limitations to claim 13 and therefore is rejected on the same ground as claim 13. 

Regarding claim 14
Gokhale in view of Liu with Mahale and Glorot teaches claim 1. 
Mahale further teaches wherein: each of the vector floating point units is a simplified arithmetic-logic unit (ALU) that handles on vector operations only and does (Examiner notes that there is no mention of the coprocessor handling loops, branches and branch prediction therefore the floating point unit is sufficient from Mahale see pg. 395 first paragraph “As we target VOP to support large set of applications involving vector operations, we introduce support for floating point data. We replace the fixed point units with pipe-lined floating point units.”)
Gokhale, Liu, Glorot and Mahale are analogous art because they are all directed to neural network.   
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gokhale in view of Glorot with Liu to incorporate the teaching of Mahale to include vector operations in floating point. 
One of ordinary skill in the art would have been motivated to make this modification in order to reduce computation complexity during update of synaptic weights as disclosed by Mahale (pg. 396 left col).

Regarding claim 15
Gokhale in view of Liu with Mahale and Glorot teaches claim 1. 
Liu teaches wherein: each of the matrix multiplier engines is configured to perform one or more of: multiplication between a dense vector or matrix and a dense matrix, (pg. 809 section 4.1 “We propose an efficient, sparse-dense matrix multiplication algorithm for executing the sparse convolutional kernels” also see pg. 810 section 4.3 “We focus on the sparse-dense matrix multiplication problem C = A × B. A ∈ R m×k is a dense matrix and B ∈ R k×n is a fixed sparse matrix.”)
(pg. 809 section 4.1 “We propose an efficient, sparse-dense matrix multiplication algorithm for executing the sparse convolutional kernels” also see pg. 810 section 4.3 “We focus on the sparse-dense matrix multiplication problem C = A × B. A ∈ R m×k is a dense matrix and B ∈ R k×n is a fixed sparse matrix.”)
and multiplication between a sparse vector and a sparse matrix, (Examiner notes that matrix is a vector pg. 809 section 4.1 “We propose an efficient, sparse-dense matrix multiplication algorithm for executing the sparse convolutional kernels” also see pg. 810 section 4.3 “We focus on the sparse-dense matrix multiplication problem C = A × B. A ∈ R m×k is a dense matrix and B ∈ R k×n is a fixed sparse matrix.”)
wherein a sparse vector or matrix has more zero elements than nonzero elements, while a dense vector or matrix has more nonzero elements than zero elements. (Examiner notes that it is well understood in the art that sparse vector/matrix means most of the elements are zero and dense matrix mean most of the elements are nonzero see FIG. 3 “An example sparse matrix B. The shadowed squares represent non-zero elements and the blank squares represent zero elements. Figure 3 an example that illustrates how our algorithm generates code for multiplying a dense matrix and a sparse matrix”)
DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 29
Claim 29 recites analogous limitations to claim 15 and therefore is rejected on the same ground as claim 15.

Regarding claim 16 
Gokhale in view of Liu with Mahale and Glorot teaches claim 1. 
Gokhale further teaches wherein: each of the matrix multiplier engines (processors 104 see FIG. 1) is configured to reduce the number of times the input data and a weight matrix need to be read at each layer of the neural network (pg. 4 second paragraph “A second purpose of the pooling operator is to reduce the size of the feature maps for subsequent layers. The number of feature maps increases as we go deeper into the hidden layers. Smaller feature maps reduce the computational complexity.”)
 and wherein the number of times the output matrix needs to be written at each layer of the neural network is once. (Pg. 2 section 1.1 “A ConvNet comprises several convolution layers. These are followed by a classifier that classifies the outputs of the convolutional layers as one of the multiple objects it is trained on. A typical ConvNet is shown in Figure 1.1. Each convolution layer is followed by a pooling operation and a non-linearity. In this work, we consider a layer of a ConvNet to comprise all three of the above mentioned operators while a convolution layer consists of only the convolution operator.” Also see Pg. 5 first paragraph “The kernels can be of any size but are typically between 7 x 7 and 11 x 11 for the first layer. The subsequent layers have smaller kernels but this is not necessary. It is typically done to reduce both training and processing times.”)



Regarding claim 30
Claim 30 recites analogous limitations to claim 16 and therefore is rejected on the same ground as claim 16.

 DM2\7720785 3Atty. Docket No.: R2197-03102Regarding Claim 18
Gokhale in view of Liu with Mahale and Glorot teaches claim 1. 
Gokhale further teaches wherein: the tensor engine (Fig. 3.1, “The coprocessor[corresponds to tensor engine] has three main components: processing elements called collections, a system bus called the memory router and a configuration bus to control flow of data. Collections perform the most typical operations in ConvNets: convolutions, pooling and non-linearity”) is configured to reuse data in the memory across one or more of the convolutional network engines efficiently to reduce data movement (pg. 4 section1.1.3 “A second purpose of the pooling operator is to reduce the size of the feature maps for subsequent layers. The number of feature maps increases as we go deeper into the hidden layers. Smaller feature maps reduce the computational complexity.”)
 for read and/or write to memory during the convolution operations. (Pg. 36 “The N = 1 case, as demonstrated in figure 6.1(a), occurs when the network has one input stream and M filters. This is typically found in the first layer of a ConvNet when using a greyscale input image. Such a layer would produce M outputs and no intermediate results. In nn-X, one input stream is routed to multiple collections by the memory router. It is then processed by the collection’s operators before being routed back to the memory router as one output of the current layer. The memory router then sends this output to memory where it awaits its turn to be sent back in as an input to the next layer. This process is repeated with different kernels in different collections to produce multiple outputs in parallel as the input is routed to all eight collections at the same time.”)

DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 32
 Claim 32 recites analogous limitations to claim 18 and therefore is rejected on the same ground as claim 18.

 DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 20
Gokhale in view of Liu with Mahale and Glorot teaches claim 18. 
Gokhale further teaches wherein: each of the convolutional network engines (Fig. 3.1, “A block diagram of the nn-X system. nn-X is composed of a coprocessor, a host processor and external memory” pg. 10 section 3 “In this implementation, the coprocessor is on-chip”) is configured to apply different kernels to the same portion of the input data at each layer of the neural network, (Pg. 36 “The N = 1 case, as demonstrated in figure 6.1(a), occurs when the network has one input stream and M filters. This is typically found in the first layer of a ConvNet when using a greyscale input image. Such a layer would produce M outputs and no intermediate results. In nn-X, one input stream is routed to multiple collections by the memory router. It is then processed by the collection’s operators before being routed back to the memory router as one output of the current layer. The memory router then sends this output to memory where it awaits its turn to be sent back in as an input to the next layer. This process is repeated with different kernels in different collections to produce multiple outputs in parallel as the input is routed to all eight collections at the same time.”)
wherein that specific portion of the input data has already been loaded into the memory and does not need to be reloaded again during the convolution operations. (Pg. 11 section 3.1 “The processor is responsible for parsing a network, compiling it into configuration instructions for the coprocessor and processing operations that are not implemented on the coprocessor. The compiled instructions are stored in memory. These instructions are created at compile time and do not change throughout the program.”)

DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 34
 Claim 34 recites analogous limitations to claim 20 and therefore is rejected on the same ground as claim 20.

DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 38
Gokhale in view of Liu with Mahale and Glorot teaches claim 16.
Gokhale further teaches wherein the input data is a portion of an image, and wherein a size of the input data matches an input width of with each hardware engine. (Pg. 15 second paragraph “The advantage of this implementation is that it requires a very small amount of memory to compute the maximum over a 2D region. In fact, the total memory required is equal to W, the maximum width of the input image.” Examiner interprets memory router as hardware engine since memory router is a physical hardware device)

Claims 8 and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Gokhale (“Nn-X- a hardware accelerator for convolutional neural networks”) in view of Liu et al. in view of Mahale et al. in view of Glorot et al. and further in view of Han et al. (“Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman coding” hereinafter: Han).
Regarding claim 8
Gokhale in view of Liu with Mahale and Glorot teaches claim 7. 
Liu further teaches and/or the matrices to be multiplied by the matrix multiplier engines (abstract “We also propose an efficient sparse matrix multiplication algorithm on CPU for Sparse Convolutional Neural Networks (SCNN) models.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gokhale in view of Mahale with Glorot to incorporate the teaching of Liu to include an efficient sparse matrix multiplication algorithm. 
One of ordinary skill in the art would have been motivated to make this modification in order to reduce the amount of computation required to process images, by sparse decompositions of the convolutional kernels as disclosed by Liu (pg. 807 left col first paragraph).
Gokhale in view of Liu with Mahale and Glorot does not teach wherein: the DLP is configured to trim the neural network by pruning the neurons at each layer of the 
Han teaches wherein: the DLP is configured to trim the neural network by pruning the neurons at each layer of the neural network (pg. 2 section 2 “Finally, we retrain the network to learn the final weights for the remaining sparse connections. Pruning reduced the number of parameters by 9× and 13× for AlexNet and VGG-16 model.”)  as well as edges connecting the neurons of different layers to create a compact neural network while maintaining accuracy of the neural network to reduce size of the vectors. (pg. 5 section 5 “We pruned, quantized, and Huffman encoded four networks: two on MNIST and two on ImageNet data-sets. The network parameters and accuracy-1 before and after pruning are shown in Table 1. The compression pipeline saves network storage by 35× to 49× across different networks without loss of accuracy.”)
Gokhale, Liu, Mahale, Glorot and Han are analogous art because they are directed to convolutional neural network. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gokhale in view of Liu with Mahale and Glorot to incorporate the teaching of Han to include compressing neural network system that reduces the storage and sustain accuracy by pruning and trained quantization to compress the network as disclosed by Han (pg. 2).
DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 24
.

Claims 11 and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Gokhale (“Nn-X- a hardware accelerator for convolutional neural networks”) in view of Liu et al. in view of Mahale et al. in view of Glorot et al. and further in view of Ross et al. (US Pat. No. 10438117 A1).
Regarding claim 11
Gokhale in view of Liu with Mahale and Glorot teaches claim 10. 
Gokhale further teaches wherein: each kernel (pg. 27 Fig. 5.1 “Performance of the nn-X coprocessor with increasing collections. The input was a 500x500 image to a 8 collections, each featuring 10x10 disparate convolution kernels.”) is a multi-dimensional matrix having its own values for elements in the matrix, (pg. 40 section 7.2 “The architecture comprises the three main mathematical operators used in ConvNets - the two dimensional convolution operator, the maxpooling operator and the non-linear operator.”)
Gokhale in view of Liu with Mahale and Glorot does not teach wherein the dimensions represent (x, y, time) coordinates as well as depth of the elements of the kernel.  
Ross teaches wherein the dimensions represent (x, y, time) coordinates (Col 8 lines 35-40 “the spatial dimensions correspond to a space or position of a set of activation inputs. For example, if the neural network is processing an image, which has two dimensions, the matrix structures can have two spatial dimensions, which correspond to spatial coordinates, i.e., XY coordinates, of the image.”) as well as depth of the elements of the kernel. (Col 4 lines 26-30 “Each kernel includes a set of weight inputs, which when applied to activation inputs of the layer, can cause activation values to be generated, which can be used to generate an output for the layer.”)
Gokhale, Liu, Mahale, Glorot and Ross are analogous art because they are all directed to neural network. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gokhale in view of Liu with Mahale and Glorot to incorporate the teaching of Ross to include neural network processor that can flatten convolutions that enables computations of multiple convolution calculations with fewer clock cycles as disclosed by Ross (col 3 lines 1-10).  

DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 26
Gokhale in view of Liu with Mahale and Glorot teaches claim 25.
Gokhale further teaches the method further comprising: applying one or more kernels to source pixels in an image for image classification, (pg. 5 second col “For a layer with 9x9 kernels, 3 input feature maps, 96 output feature maps and an input image of 224x224, 2.17 billion operations are required to produce the 96 outputs.”)
wherein the center element of each kernel is placed over a source pixel to replace the source pixel with a weighted sum of itself and its neighboring pixels (Pg. 13 section 3.2.1 “The convolution engine is implemented as fully pipelined logic and uses memory to cache incoming data. This cache is needed for pipelined implementation of the convolution operation [14]. For a row of width W and a k x k convolution filter, the size of this cache is W x k x 2 bytes. The convolution engine can also perform pooling operations. The kernel can be used to implement a smooth pooling function (for example, Gaussian) or perform a running average of pixels or data words (with a uniform kernel).”) and each kernel is a multi-dimensional matrix having its own values for elements in the matrix. (Pg. 40 section 7.2 “The architecture comprises the three main mathematical operators used in ConvNets - the two dimensional convolution operator, the maxpooling operator and the non-linear operator.”)
Gokhale in view of Liu with Mahale and Glorot does not teach wherein the dimensions represent (x, y, time) coordinates as well as depth of the elements of the kernel.  
Ross teaches wherein the dimensions represent (x, y, time) coordinates (Col 8 lines 35-40 “the spatial dimensions correspond to a space or position of a set of activation inputs. For example, if the neural network is processing an image, which has two dimensions, the matrix structures can have two spatial dimensions, which correspond to spatial coordinates, i.e., XY coordinates, of the image.”) as well as depth of the elements of the kernel. (Col 4 lines 26-30 “Each kernel includes a set of weight inputs, which when applied to activation inputs of the layer, can cause activation values to be generated, which can be used to generate an output for the layer.”)
Gokhale, Liu, Mahale, Glorot and Ross are analogous art because they are all directed to neural network. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gokhale in view of Liu with Mahale and Glorot to incorporate the teaching of Ross to include neural network processor that can flatten convolutions that enables computations of multiple convolution calculations with fewer clock cycles as disclosed by Ross (col 3 lines 1-10).  
Claims 17, 19, 21, 31, 33 and 35 are rejected under 35 U.S.C. 103 as being unpatentable over Gokhale (“Nn-X- a hardware accelerator for convolutional neural networks”, hereinafter; Gokhale) in view of Liu et al. (“Sparse Convolutional Neural Networks”, hereinafter: Liu) in view of Mahale et al. (“VOP: Architecture of a Processor for Vector Operations in On-line Learning of Neural Networks”, hereinafter: Mahale) in view of Glorot et al. and further in view of Shoaib et al. (US 2017/0132496 A1).
 DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 17 (Currently Amended) 
Gokhale in view of Liu with Mahale and Glorot teaches claim 1. 
Gokhale in view of Liu with Mahale and Glorot does not teach wherein: each of the matrix multiplier engines is configured to reduce data movement associated with multiplication involving a sparse vector, wherein only data that corresponds to non-zero values in the sparse vector is loaded into the memory of the tensor engine upon request.
Shoaib teaches wherein: each of the matrix multiplier engines (processors 104 see FIG. 1) is configured to reduce data movement associated with multiplication involving a sparse vector, (para [0071] “In process block 606, a sparse, frequency domain representation of a convolutional weighting kernel is determined. The sparse, frequency domain representation comprises one or more sparse matrices and a dense matrix.” Also see para [0035] “The sparse matrix or matrices multiplied by the dense matrix is equal to the initial data matrix. Determining a sparse representation is also referred to as sparse matrix decomposition.”)
wherein only data that corresponds to non-zero values in the sparse vector is loaded into the memory of the tensor engine upon request. (Para [0045] “A second memory 218 is configured to store coefficients 220 for fully connected layers and/or the dense matrix 222[corresponds to non-zero values] of the sparse, frequency domain representation.”) 
Gokhale, Liu, Mahale, Glorot and Shoaib are analogous art because they are all directed to neural network.  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gokhale in view of Liu with Mahale and Glorot to incorporate the teaching of Shoaib to include plurality of convolution operations by exploring sparsity of vectors and/or matrices in order to reduced memory and computational requirements by representing the convolutional weighting kernel as a sparse, frequency domain representation (e.g., one or more sparse matrices and a dense matrix) as disclosed by Shoaib (para [0018]). 

DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 31
 Claim 31 recites analogous limitations to claim 17 and therefore is rejected on the same ground as claim 17.
Regarding claim 19
Gokhale in view of Liu with Mahale and Glorot teaches claim 18. 
Gokhale in view of Liu with Mahale and Glorot does not teach wherein: each of the convolutional network engines is configured to keep and repeatedly apply a same kernel on different parts of the input data at each layer of the neural network wherein the kernel is loaded into the memory only once during the convolution operations.
Shoaib teaches wherein: each of the convolutional network engines (processors 104 see FIG. 1) is configured to keep and repeatedly apply a same kernel on different parts of the input data at each layer of the neural network wherein the kernel is loaded into the memory only once during the convolution operations. (Examiner notes that the same kernel is repeatedly used at each different layers see para [0071] “In process block 606, a sparse, frequency domain representation of a convolutional weighting kernel is determined. The sparse, frequency domain representation comprises one or more sparse matrices and a dense matrix. In a plurality of convolutional layers of a deep convolutional neural network, in process block 608, the input image is processed based on the frequency domain representation of the input image, the one or more sparse matrices, and a frequency domain nonlinear function. In a plurality of fully connected layers of the deep convolutional neural network, in process block 610, the input image is processed based on an output of the plurality of convolutional layers.”)  
Gokhale, Liu, Mahale, Glorot and Shoaib are analogous art because they are all directed to neural network.  
 Gokhale in view of Liu with Mahale and Glorot to incorporate the teaching of Shoaib to include plurality of convolution operations by exploring sparsity of vectors and/or matrices in order to reduced memory and computational requirements by representing the convolutional weighting kernel as a sparse, frequency domain representation (e.g., one or more sparse matrices and a dense matrix) as disclosed by Shoaib (para [0018]). 
DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 33
 Claim 33 recites analogous limitations to claim 19 and therefore is rejected on the same ground as claim 19.


DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 21
Gokhale in view of Liu with Mahale and Glorot teaches claim 18. 
Gokhale in view of Liu with Mahale and Glorot does not teach wherein: each of the convolutional network engines is configured to save and reuse convolution output by a kernel on an overlapping part of two portions of the input data without calculating the output again at each convolution layer when the kernel is applied to the two portions of the data in stride and the two data portions overlap.
Shoaib teaches wherein: each of the convolutional network engines (processors 104 see FIG. 1) is configured to save and reuse convolution output by a kernel on an overlapping part of two portions of the input data without calculating the output again at each convolution layer when the kernel is applied to the two portions of (para [0028] “During a 3D convolution weighting stage, a 3D input volume of pixels of dimensionality NxNxD are convolved with H kernels of dimension kxkxD and a stride S (linear step offset). Each 3D kernel is shifted in a sliding-window-like fashion with a stride across the input Volume. During each shift, every weight belonging to the 3D kernel can be multiplied and added with every pair-wise input element from the overlapping region of the 3D input volume.”)
Gokhale, Liu, Mahale, Glorot and Shoaib are analogous art because they are all directed to neural network.  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gokhale in view of Liu with Mahale and Glorot to incorporate the teaching of Shoaib to include plurality of convolution operations by exploring sparsity of vectors and/or matrices in order to reduced memory and computational requirements by representing the convolutional weighting kernel as a sparse, frequency domain representation (e.g., one or more sparse matrices and a dense matrix) as disclosed by Shoaib (para [0018]). 

DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 35
 Claim 35 recites analogous limitations to claim 21 and therefore is rejected on the same ground as claim 21.

Claim 36 is rejected under 35 U.S.C. 103 as being unpatentable over Gokhale (“Nn-X- a hardware accelerator for convolutional neural networks”) in view of Liu et al. Mahale et al. in view of Glorot et al. (“Deep Sparse Rectifier Neural Networks”, hereinafter: Glorot) and further in view of Tanomoto et al. (“A CGRA-based Approach for Accelerating Convolutional Neural Networks”, hereinafter: Tanomoto).
DM2\7720785 3Atty. Docket No.: R2197-03102Regarding claim 36 (Currently Amended)
Gokhale in view of Liu with Mahale and Glorot teaches claim 16.
Gokhale in view of Liu with Mahale and Glorot does not teach wherein in a vector-matrix multiplication operation the vector is read only once by reusing data. 
Tanomoto teaches wherein in a vector-matrix multiplication operation the vector is read only once by reusing data. (pg. 74 right col “Fig.2 and Fig.3 show the structures of forward propagation in convolution layer and full-connected layer, respectively. The computation pattern of full-connected layer is very simple: multiple vector-matrix multiplications of an a data vector and a weight matrix… The convolution is calculated by an inner product of two vectors: a vector of data sub-region and a vector of weight sub-region. Therefore, by unrolling multiple vector-vector inner products, the entire computation forms a matrix multiplication with certain data reuses” also see pg. 78 left col “Calculation of the weight error in Fig.10 has a some memory access pattern for input data (In) as the forward propagation. Therefore the input data can be effectively reused.”)
Gokhale, Liu, Mahale, Glorot and Tanomoto are analogous art because they are all directed to neural network. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gokhale in view of Liu with Mahale with Glorot to incorporate the teaching of Tanomoto to include CNN computations 
One of ordinary skill in the art would have been motivated to make this modification in order to improve “CGRA with distributed scratchpad memory blocks for efficient temporal blocking to reduce memory bandwidth pressure” and reduce large computation hardware requirements disclosed by Tanomoto (Abstract).


Allowable Subject Matter
Claim 37 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-38 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Rahman et al. (“Efficient FPGA Acceleration of Convolutional Neural Networks Using Logical-3D Compute Array”) teaches CNN that reuse an input network that is simple 2D mesh-like array of registers. 
Zhang et al. (“Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks”) teaches CNN design to quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. 

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to VAN C MANG whose telephone number is (571)270-7598.  The examiner can normally be reached on Mon - Fri 8:00-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on 5712729767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.







/ANN J LO/Supervisory Patent Examiner, Art Unit 2126