DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 13-18 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 13 recites the following limitations:
a subtract circuit to compute the difference between a sum(*) and a first input operator ("OpC") to yield a first intermediate operator ("Op1"); 
a pass-through second ("OpE") and third ("OpD") input operators to yield second (“Op3”) and third (“Op4”) intermediate operators respectively;93     Attorney Docket No. : Patent Application  P118216-US Inner Product Convolutional Neural Network Accelerator   
a multiply circuit to compute a product of a fourth ("OpA") and fifth ("OpB") input operator to yield a fourth intermediate input operator (“Op2”); and 
a reduction circuit to compute a reduction of Op1, Op3, Op4, and Op2.

It is unclear how one of ordinary skill in the art can compute a difference between two “operators”. For example, operators might be “+”, “-“, “=” etc. while an operand can be denoted as “1”, “2”, “a”, “b” etc. A sample equation could be “1 + a = b” where “1”, “a” 
	For examination purposes, Examiner will be interpreting the operators in claim 13 as operands which is supported by the Applicant’s specification in paragraph [0113-0114].	
	Claims 14 and 16-18 also appear to have the incorrect usage of the term operator and are rejected for the same reasons. Claim 15 is a dependent claim that does not cure the deficiencies and is also rejected for the same reasons.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 19-21, and 24-25 are rejected under 35 U.S.C. 103 as being unpatentable over Shalev et al. (US-10489479-B1) in view of Narayanaswami et al. (US-9836691-B1).
Regarding Claim 1,
Shalev (US 10489479 B1) teaches a convolutional neural network (CNN) accelerator, comprising:
a CNN circuit for performing a multiple-layer CNN computation, wherein the multiple layers are to receive an input feature according to an input feature map (IFM) and a weight matrix per output feature, wherein an output of a first layer provides an input for a next layer (Col. 1 lines 19-45; In the convolutional layers of the CNN, a three-dimensional (3D) array of input data (commonly referred to as a 3D matrix or tensor) of dimensions M.times.N.times.D is convolved with a four-dimensional tensor made up of L kernels of dimensions j.times.k.times.D and stride S. Here M and N are the dimensions of the sampling space (also referred to as the X- and Y-dimensions), for example pixels of an image, while D (also referred to herein as the Z-dimension) is the number of input feature values given for each sample. Each 3D kernel is shifted in strides of size S across the input volume. Following each shift, every weight belonging to the 3D kernel is multiplied by each corresponding input element from the overlapping region of the 3D input array, and the products are summed to create an element of a 3D output array.); and
a mapping circuit to access a three-dimensional input matrix stored as a Z-major matrix (Fig. 3A; Col. 7 lines 1-12; To multiply tensors 60 and 62 together, data access logic 31 in engine 20, extracts sub-tensors comprising sets of vectors 66 and 70 of input values, extending in the Z-direction, from tensors 60 and 62, respectively... Prior to distribution of the input values to processing elements 24, transposition engine 38 transposes matrix 72. Thus, vectors 66 in this example are arranged as rows of matrix 68, while vectors 70 are arranged as columns of matrix 72.); 
wherein the CNN circuit is to perform an inner-product direct convolution on the Z-major matrix (Fig. 3A; Col. 7 lines 1-12; To multiply tensors 60 and 62 together, data access logic 31 in engine 20, extracts sub-tensors comprising sets of vectors 66 and 70 of input values, extending in the Z-direction, from tensors 60 and 62, respectively. Data access logic 31 "lowers" these sub -tensors into 2D matrices 68 and 72, whose elements will then be broadcast by data manipulation unit 40 to processing elements 24 for multiplication and accumulation. Prior to distribution of the input values to processing elements 24, transposition engine 38 transposes matrix 72. Thus, vectors 66 in this example are arranged as rows of matrix 68, while vectors 70 are arranged as columns of matrix 72. In this manner, execution unit 22 computes the convolution of matrices 68 and 72 to generate an output matrix 74.), 
Shalev does not explicitly disclose
wherein the CNN circuit is to perform an inner-product direct convolution …wherein the direct convolution lacks a lowering operation.
However, Narayanaswami (US 9836691 B1) teaches
wherein the CNN circuit is to perform an inner-product direct convolution …wherein the direct convolution lacks a lowering operation (Col. 11 lines 45-53; FIG. 4A illustrates an example input activation tensor 404, example weight tensors 406, and an example output tensor 408. FIG. 4B illustrates an example deep loop nest 402 that can be executed by processing unit 102 to perform tensor computations relating to dot product computations or matrix multiplication. In FIG. 4A, computations can include multiplication of activation tensor 404 with parameter/weight tensor 406 on one or more computation cycles to produce outputs/results in the form of output tensor 408.).
	It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the convolutional neural network accelerator of Shalev with the tensor convolution computations of Shalev.
	Doing so would allow for reducing the number of instructions required to complete the tensor computation. Reducing the number of instructions for tensor computations increases computation bandwidth which would improve the system by reducing its overall bandwidth requirement (Col. 3 lines 3-15; Computation bandwidth of the processing unit is increased by reducing the number of instructions that the processor is required to execute when traversing a tensor to perform one or more computations. Instructions for performing tensor computations for a given neural network layer can be encoded and distributed amongst one or more computing systems of an example hardware computing system. Distribution of the encoded instructions to the various compute systems allows for increased computation bandwidth within a single system. Instruction quantity in a compute system is reduced because a single system is responsible only for a subset of the total computations needed for a given tensor.)
Regarding Claim 2,
Shalev and Narayanaswami teach the CNN accelerator of claim 1. Shalev further teaches wherein the inner-product convolution is a vector-by-vector operation (Col. 4 lines 36-40; In other words, the vectors of feature values in each set are effectively stacked together by the data access logic so as to make up a matrix, which is multiplied with the corresponding matrix from the other set, for example as part of a convolution computation.).
Regarding Claim 3,
Shalev and Narayanaswami teach the CNN accelerator of claim 1. Shalev further teaches wherein the mapping circuit is to access the three-dimensional input as an X-major matrix and to rotate the matrix on three axes to generate the Z-major matrix (Fig. 6; Col. 9 lines 30-37; For this purpose, data access logic 31 extracts a sub -tensor 106 from tensor 100 comprising h vectors 108 of length n, and then lowers this sub -tensor to create a matrix 110 comprising two sets of these vectors, having dimensions 2n.times.h. Similarly, data access logic 31 extracts two columns 114, 116 from tensor 102, each comprising n vectors 118 of length w, and lowers this sub -tensor to create a matrix 120 of dimensions w.times.2n. The 3D input (i.e. 100) accessed as X-major matrix (i.e 106 and 108) and rotated to generate Z-major matrix (i.e. 120).).
Regarding Claim 19,
Shalev (US 10489479 B1) teaches one or more tangible, non-transitory computer-readable storage mediums having stored thereon instructions for providing a convolutional neural network (CNN) accelerator, comprising instructions to:
provision a CNN circuit for performing a multiple-layer CNN computation, wherein the multiple layers are to receive an input feature according to an input feature map (IFM) and a weight matrix per output94 (Col. 1 lines 19-45; In the convolutional layers of the CNN, a three-dimensional (3D) array of input data (commonly referred to as a 3D matrix or tensor) of dimensions M.times.N.times.D is convolved with a four-dimensional tensor made up of L kernels of dimensions j.times.k.times.D and stride S. Here M and N are the dimensions of the sampling space (also referred to as the X- and Y-dimensions), for example pixels of an image, while D (also referred to herein as the Z-dimension) is the number of input feature values given for each sample. Each 3D kernel is shifted in strides of size S across the input volume. Following each shift, every weight belonging to the 3D kernel is multiplied by each corresponding input element from the overlapping region of the 3D input array, and the products are summed to create an element of a 3D output array.) Attorney Docket No. : Patent Application P118216-US Inner Product ConvolutionalNeural Network Accelerator feature, wherein an output of a first layer provides an input for a next layer;and 
provision a mapping circuit to access a three-dimensional input matrix stored as a Z-major matrix (Fig. 3A; Col. 7 lines 1-12; To multiply tensors 60 and 62 together, data access logic 31 in engine 20, extracts sub-tensors comprising sets of vectors 66 and 70 of input values, extending in the Z-direction, from tensors 60 and 62, respectively... Prior to distribution of the input values to processing elements 24, transposition engine 38 transposes matrix 72. Thus, vectors 66 in this example are arranged as rows of matrix 68, while vectors 70 are arranged as columns of matrix 72.);
wherein the CNN circuit is to perform an inner-product direct convolution on the Z-major matrix (Fig. 3A; Col. 7 lines 1-12; To multiply tensors 60 and 62 together, data access logic 31 in engine 20, extracts sub-tensors comprising sets of vectors 66 and 70 of input values, extending in the Z-direction, from tensors 60 and 62, respectively. Data access logic 31 "lowers" these sub -tensors into 2D matrices 68 and 72, whose elements will then be broadcast by data manipulation unit 40 to processing elements 24 for multiplication and accumulation. Prior to distribution of the input values to processing elements 24, transposition engine 38 transposes matrix 72. Thus, vectors 66 in this example are arranged as rows of matrix 68, while vectors 70 are arranged as columns of matrix 72. In this manner, execution unit 22 computes the convolution of matrices 68 and 72 to generate an output matrix 74.), Shalev does not explicitly disclose
wherein the CNN circuit is to perform an inner-product direct convolution… wherein the direct convolution lacks a lowering operation.
However, Narayanaswami (US 9836691 B1) teaches
wherein the CNN circuit is to perform an inner-product direct convolution… wherein the direct convolution lacks a lowering operation (Col. 11 lines 45-53; FIG. 4A illustrates an example input activation tensor 404, example weight tensors 406, and an example output tensor 408. FIG. 4B illustrates an example deep loop nest 402 that can be executed by processing unit 102 to perform tensor computations relating to dot product computations or matrix multiplication. In FIG. 4A, computations can include multiplication of activation tensor 404 with parameter/weight tensor 406 on one or more computation cycles to produce outputs/results in the form of output tensor 408.).

	Doing so would allow for reducing the number of instructions required to complete the tensor computation. Reducing the number of instructions for tensor computations increases computation bandwidth which would improve the system by reducing its overall bandwidth requirement (Col. 3 lines 3-15; Computation bandwidth of the processing unit is increased by reducing the number of instructions that the processor is required to execute when traversing a tensor to perform one or more computations. Instructions for performing tensor computations for a given neural network layer can be encoded and distributed amongst one or more computing systems of an example hardware computing system. Distribution of the encoded instructions to the various compute systems allows for increased computation bandwidth within a single system. Instruction quantity in a compute system is reduced because a single system is responsible only for a subset of the total computations needed for a given tensor.)
Regarding Claim 20, 
Claim 20 is the computer-readable medium corresponding to the system of claim 1. Claim 20 is substantially similar to claim 3 and is rejected on the same grounds. 
Regarding Claim 21, 
Claim 21 is the computer-readable medium corresponding to the system of claim 1. Claim 21 is substantially similar to claim 2 and is rejected on the same grounds. 
Regarding Claim 24,

provisioning a low-precision CNN circuit for performing a multiple- layer CNN computation, wherein the multiple layers are to receive an input feature according to an input feature map (IFM) and a weight matrix per95 output feature, wherein an output of a first layer provides an input for a next layer (Col. 1 lines 19-45; In the convolutional layers of the CNN, a three-dimensional (3D) array of input data (commonly referred to as a 3D matrix or tensor) of dimensions M.times.N.times.D is convolved with a four-dimensional tensor made up of L kernels of dimensions j.times.k.times.D and stride S. Here M and N are the dimensions of the sampling space (also referred to as the X- and Y-dimensions), for example pixels of an image, while D (also referred to herein as the Z-dimension) is the number of input feature values given for each sample. Each 3D kernel is shifted in strides of size S across the input volume. Following each shift, every weight belonging to the 3D kernel is multiplied by each corresponding input element from the overlapping region of the 3D input array, and the products are summed to create an element of a 3D output array.). Attorney Docket No. : Patent Application  P118216-US Inner Product Convolutional Neural Network Accelerator output feature, wherein an output of a first layer provides an input for a next layer; and
 provisioning a mapping circuit to access a three-dimension input matrix stored as a Z-major matrix (Fig. 3A; Col. 7 lines 1-12; To multiply tensors 60 and 62 together, data access logic 31 in engine 20, extracts sub-tensors comprising sets of vectors 66 and 70 of input values, extending in the Z-direction, from tensors 60 and 62, respectively... Prior to distribution of the input values to processing elements 24, transposition engine 38 transposes matrix 72. Thus, vectors 66 in this example are arranged as rows of matrix 68, while vectors 70 are arranged as columns of matrix 72.); 
wherein the CNN circuit is to perform a vector-by-vector inner- product direct convolution on the Z-major matrix (Fig. 3A; Col. 7 lines 1-12; To multiply tensors 60 and 62 together, data access logic 31 in engine 20, extracts sub-tensors comprising sets of vectors 66 and 70 of input values, extending in the Z-direction, from tensors 60 and 62, respectively. Data access logic 31 "lowers" these sub -tensors into 2D matrices 68 and 72, whose elements will then be broadcast by data manipulation unit 40 to processing elements 24 for multiplication and accumulation. Prior to distribution of the input values to processing elements 24, transposition engine 38 transposes matrix 72. Thus, vectors 66 in this example are arranged as rows of matrix 68, while vectors 70 are arranged as columns of matrix 72. In this manner, execution unit 22 computes the convolution of matrices 68 and 72 to generate an output matrix 74.),	
Shalev does not explicitly disclose
wherein the CNN circuit is to perform a vector-by-vector inner-product direct convolution…, wherein the direct convolution lacks a lowering operation,
However, Narayanaswami (US 9836691 B1) teaches
wherein the CNN circuit is to perform a vector-by-vector inner-product direct convolution…, wherein the direct convolution lacks a lowering operation (Col. 11 lines 45-53; FIG. 4A illustrates an example input activation tensor 404, example weight tensors 406, and an example output tensor 408. FIG. 4B illustrates an example deep loop nest 402 that can be executed by processing unit 102 to perform tensor computations relating to dot product computations or matrix multiplication. In FIG. 4A, computations can include multiplication of activation tensor 404 with parameter/weight tensor 406 on one or more computation cycles to produce outputs/results in the form of output tensor 408.).
	It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the convolutional neural network accelerator of Shalev with the tensor convolution computations of Shalev.
	Doing so would allow for reducing the number of instructions required to complete the tensor computation. Reducing the number of instructions for tensor computations increases computation bandwidth which would improve the system by reducing its overall bandwidth requirement (Col. 3 lines 3-15; Computation bandwidth of the processing unit is increased by reducing the number of instructions that the processor is required to execute when traversing a tensor to perform one or more computations. Instructions for performing tensor computations for a given neural network layer can be encoded and distributed amongst one or more computing systems of an example hardware computing system. Distribution of the encoded instructions to the various compute systems allows for increased computation bandwidth within a single system. Instruction quantity in a compute system is reduced because a single system is responsible only for a subset of the total computations needed for a given tensor.)
Regarding Claim 25,
Shalev and Narayanaswami teach the method of claim 24. Shalaev further teaches wherein the mapping circuit is to access the three-dimensional input as an X-major Fig. 6; Col. 9 lines 30-37; For this purpose, data access logic 31 extracts a sub -tensor 106 from tensor 100 comprising h vectors 108 of length n, and then lowers this sub -tensor to create a matrix 110 comprising two sets of these vectors, having dimensions 2n.times.h. Similarly, data access logic 31 extracts two columns 114, 116 from tensor 102, each comprising n vectors 118 of length w, and lowers this sub -tensor to create a matrix 120 of dimensions w.times.2n. The 3D input (i.e. 100) accessed as X-major matrix (i.e 106 and 108) and rotated to generate Z-major matrix (i.e. 120).).

Claims 4, 5, 7, 8, 12, 22, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Shalev et al. (US-10489479-B1) and Narayanaswami et al. (US-9836691-B1), as applied above, and further in view of Tang et al. (“Binary Convolutional Neural Network on RRAM”).
Regarding Claim 4,
Shalev and Narayanaswami teach The CNN accelerator of claim 1.
	 Shalev and Narayanaswami do not explicitly disclose 
wherein the CNN circuit is a low- precision CNN.
However, Tang teaches
wherein the CNN circuit is a low- precision CNN (pg. 783, section D; Considering that BCNN provides the potential for lowprecision crossbar interfaces, a BCNN-specific low-precision splitting structure is in demand.).

Doing so would allow for increased speed for read and write operations while improving the energy efficiency. This would not only save time for computations of the CNN allowing for scalability but also save a significant amount of energy (Abs. RRAM-based Computing System (RCS) design, which leads to faster read-and-write operations and better energy efficiency than before.)
Regarding Claim 5,
Shalev, Narayanaswami, and Tang teach the CNN accelerator of claim 4. Tang further teaches wherein the low-precision CNN is a 4- bit CNN (pg. 784; Based on this observation, we reduce the ADC precision into 4 bit, which can save large amount of overhead especially when the splitting amount is large.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Shalev and Narayanaswami with the teachings of Tang for at least the same reasons as discussed above in claim 4.
Regarding Claim 7,
Shalev, Narayanaswami, and Tang teach the CNN accelerator of claim 4. Tang further teaches wherein the low-precision CNN is a binary CNN (Pg. 782, Section 1; Recently, researchers in the field of machine learning have demonstrated that Binary CNNs (BCNNs) achieve satisfying recognition accuracy on ImageNet dataset [11], [12]. BCNNs use binary weights and data when processing the forward propagation.).

Regarding Claim 8,
Shalev, Narayanaswami, and Tang teach the CNN accelerator of claim 4. Tang further teaches wherein the low-precision CNN comprises a high-precision input feature map (pg. 784; The intermediate data before adder tree still use high precision. However, since the cascaded digital functions, i.e. non-linear function and BN, are monotone increasing functions, the 1-bit quantization can be merged with these functions by changing the threshold and output data range.) with a low-precision weight, wherein the weight precision is selected from 1-bit, 2-bit, or 4-bit (Pg. 786; Table 1).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Shalev and Narayanaswami with the teachings of Tang for at least the same reasons as discussed above in claim 4.
Regarding Claim 12,
Shalev and Narayanaswami teach the CNN accelerator of claim 1. 
	Shalev and Narayanaswami do not explicitly disclose
further comprising reduction means to:
receive a plurality of high-precision input operators; and 
reduce the plurality of high-precision input operators to a low-precision output operator.
However, Tang teaches

receive a plurality of high-precision input operators (pg. 785; The multi-bit model is achieved by dynamically quantizing [4] the well-trained floating-point model into 8 bits;); and 
reduce the plurality of high-precision input operators to a low-precision output operator (pg. 785; The multi-bit model is achieved by dynamically quantizing [4] the well-trained floating-point model into 8 bits;).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the CNN accelerator of Shalev and Narayanaswami with the low bit weights and neurons of Tang.
Doing so would allow for increased speed for read and write operations while improving the energy efficiency. This would not only save time for computations of the CNN allowing for scalability but also save a significant amount of energy (Abs. RRAM-based Computing System (RCS) design, which leads to faster read-and-write operations and better energy efficiency than before.)
Regarding Claim 22, 
Shalev and Narayanaswami teach the one or more tangible, non-transitory computer-readable mediums of claim 19. 
	Shalev and Narayanaswami do not explicitly disclose
wherein the CNN circuit is a low-precision CNN.
However, Tang teaches
pg. 783, section D; Considering that BCNN provides the potential for lowprecision crossbar interfaces, a BCNN-specific low-precision splitting structure is in demand.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the CNN accelerator of Shalev and Narayanaswami with the low bit weights and neurons of Tang.
Doing so would allow for increased speed for read and write operations while improving the energy efficiency. This would not only save time for computations of the CNN allowing for scalability but also save a significant amount of energy (Abs. RRAM-based Computing System (RCS) design, which leads to faster read-and-write operations and better energy efficiency than before.)
Regarding Claim 23, 
Shalev, Narayanaswami, and Tang teach the one or more tangible, non-transitory computer-readable mediums of claim 22. Tang further teaches wherein the low-precision CNN is a ternary CNN or binary CNN (Pg. 782, Section 1; Recently, researchers in the field of machine learning have demonstrated that Binary CNNs (BCNNs) achieve satisfying recognition accuracy on ImageNet dataset [11], [12]. BCNNs use binary weights and data when processing the forward propagation.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the CNN accelerator of Shalev and Narayanaswami with the low bit weights and neurons of Tang.
Doing so would allow for increased speed for read and write operations while improving the energy efficiency. This would not only save time for computations of the Abs. RRAM-based Computing System (RCS) design, which leads to faster read-and-write operations and better energy efficiency than before.)

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Shalev et al. (US-10489479-B1), Narayanaswami et al. (US-9836691-B1), Tang et al. (“Binary Convolutional Neural Network on RRAM”), as applied above, and further in view of Alemdar et al. (Ternary Neural Networks for Resource-Efficient AI Applications).
Regarding Claim 6,
Shalev, Narayanaswami, and Tang teach the CNN accelerator of claim 4. 
	Shalev, Narayanaswami, and Tang do not explicitly disclose
wherein the low-precision CNN is a ternary CNN.
However, Alemdar (Ternary Neural Networks for Resource-Efficient AI Applications) teaches
wherein the low-precision CNN is a ternary CNN (Pg1. Abs.; Using only ternary weights and activations, the student ternary network learns to mimic the behavior of its teacher network without using any multiplication. And In Section V, we describe our purpose-built hardware that is able to handle both fully connected multi-layer perceptrons (MLPs) and convolutional NNs (CNNs) with a high throughput and a low-energy budget.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the CNN accelerator of Shalev, Narayanaswami, and Tang with ternary network of Alemdar.
Abs. Thanks to our two-stage training procedure, the teacher network is still able to use state-of-the-art methods such as dropout and batch normalization to increase accuracy and reduce training time. Using only ternary weights and activations, the student ternary network learns to mimic the behavior of its teacher network without using any multiplication. Unlike its {-1,1} binary counterparts, a ternary neural network inherently prunes the smaller weights by setting them to zero during training. This makes them sparser and thus more energy-efficient.).

Claims 9 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Shalev et al. (US-10489479-B1) and Narayanaswami et al. (US-9836691-B1), as applied above, and further in view of Nurvitadhi et al. (Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?).
Regarding Claim 9,
Shalev and Narayanaswami teach the CNN accelerator of claim 1. 
Shalev and Narayanaswami and do not explicitly disclose
wherein the CNN is a high-precision CNN.
However, Nurvitadhi (Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?) teaches
wherein the CNN is a high-precision CNN (pg. 12, section 5.1; Neurons are still represented using full precision (FP32).).

Doing so would allow for better performance compared to GPUs (Pg. 5, Abs; Our results show that Stratix 10 FPGA is 10%, 50%, and 5.4x better in performance (TOP/sec) than Titan X Pascal GPU on GEMM operations for pruned, Int6, and binarized DNNs, respectively.). The improved performance allows for better energy efficiency thereby increasing the bandwidth and improving the throughput (pg. 5, section 1; While FPGAs have provided superior energy efficiency (Performance/Watt) than GPUs for DNNs, they have not been known for offering top performance. However, FPGA technologies are advancing rapidly. The upcoming Intel Stratix 10 FPGA [17] will offer more than 5000 hardened floating-point units (DSPs), over 28MB of on-chip RAMs (M20Ks), integration with high-bandwidth memories (up to 4x250GB/s/stack or 1TB/s), and improved frequency from the new HyperFlex technology, thereby leading to a peak 9.2 TFLOP/s in FP32 throughput.)
Regarding Claim 10,
Shalev, Narayanaswami, and Nurvitadhi teach the CNN accelerator of claim 9. Nurvitadhi further teaches wherein the high-precision CNN is selected from 8-bit integer and 16-bit floating point (pg. 7, section 2.2; As an evidence of this, the latest GPUs are providing native support for FP16 and Int8 data types.).

Doing so would allow for better performance compared to GPUs (Pg. 5, Abs; Our results show that Stratix 10 FPGA is 10%, 50%, and 5.4x better in performance (TOP/sec) than Titan X Pascal GPU on GEMM operations for pruned, Int6, and binarized DNNs, respectively.). The improved performance allows for better energy efficiency thereby increasing the bandwidth and improving the throughput (pg. 5, section 1; While FPGAs have provided superior energy efficiency (Performance/Watt) than GPUs for DNNs, they have not been known for offering top performance. However, FPGA technologies are advancing rapidly. The upcoming Intel Stratix 10 FPGA [17] will offer more than 5000 hardened floating-point units (DSPs), over 28MB of on-chip RAMs (M20Ks), integration with high-bandwidth memories (up to 4x250GB/s/stack or 1TB/s), and improved frequency from the new HyperFlex technology, thereby leading to a peak 9.2 TFLOP/s in FP32 throughput.)

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Shalev et al. (US-10489479-B1) and Narayanaswami et al. (US-9836691-B1), as applied above, and further in view of Aydonat et al. (US-20170103299-A1).
Regarding Claim 11,
Shalev and Narayanaswami teach the CNN accelerator of claim 1. 
	Shalev and Narayanaswami do not explicitly disclose

However, Aydonat (US 20170103299 A1) teaches
further comprising at least one accumulator, wherein the results of multiple inner-product convolutions are accumulated in a single accumulator (para [0065] The dot product unit 1020 may be implemented using one or more DSP blocks on the target. The processing element 1000 includes an accumulator unit 1030. The accumulator unit 1030 accumulates dot product results as partial sums until an entire computation is completed. The accumulator unit 1030 may be implemented using a logic array block.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the CNN accelerator of Shalev and Narayanaswami with the accumulator of Aydonat.
Doing so would allow for reducing the amount of wiring needed to design the accelerator. Reducing the amount of resources needed to manufacture the accelerator could save the amount of cost to manufacture the product (para [0046] According to an embodiment of the present invention, the goal of routability optimization is to reduce the amount of wiring used to connect components in the placed logic design. Routability optimization may include performing fanout splitting, logic duplication, logical rewiring, or other procedures.).

Claims 13, 15, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Shalev et al. (US-10489479-B1) and Narayanaswami et al. (US-9836691-B1), as Navarrete et al. (US-20190014320-A1) and Danjo et al. (US-20170368682-A1).
Regarding Claim 13,
Shalev and Narayanaswami teach the CNN accelerator of claim 1. 
Shalev and Narayanaswami do not explicitly disclose
further comprising a reduction module, comprising:
a subtract circuit to compute the difference between a sum(*) and a first input operator ("OpC") to yield a first intermediate operator ("Op1"); 
a pass-through second ("OpE") and third ("OpD") input operators to yield second (“Op3”) and third (“Op4”) intermediate operators respectively;93     Attorney Docket No. : Patent Application  P118216-US Inner Product Convolutional Neural Network Accelerator   
a multiply circuit to compute a product of a fourth ("OpA") and fifth ("OpB") input operator to yield a fourth intermediate input operator (“Op2”); and 
a reduction circuit to compute a reduction of Op1, Op3, Op4, and Op2.
However, Navarrete (US 20190014320 A1) teaches 
a pass-through second ("OpE") and third ("OpD") input operators to yield second (“Op3”) and third (“Op4”) intermediate operators respectively (para [0086] D.sub.d and D.sub.v output from the output interface 411, perform quantization process and inverse quantization process on the superposed image A and the difference features D.sub.n, D.sub.d and D.sub.v, to generate quantization superposed images and quantization difference features.);93     Attorney Docket No. : Patent Application  P118216-US Inner Product Convolutional Neural Network Accelerator   
a multiply circuit to compute a product of a fourth ("OpA") and fifth ("OpB") input operator to yield a fourth intermediate input operator (“Op2”) (para [0064] In an embodiment of the present disclosure, the image superposing circuit 408 is configured to multiply the first image UL by a first weight parameter a to obtain a first product, multiply the update features U by a second weight parameter b to obtain a second product, and superpose the first product and the second product to generate the superposed image A.); and 
a reduction circuit to compute a reduction of Op1, Op3, Op4, and Op2 (fig. 6; para [0086] The quantization apparatus 60 is connected with the image encoding apparatus 40 and configured to receive the superposed image A and the difference features D.sub.n, D.sub.d and D.sub.v output from the output interface 411, perform quantization process and inverse quantization process on the superposed image A and the difference features D.sub.n, D.sub.d and D.sub.v, to generate quantization superposed images and quantization difference features.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the CNN accelerator of Shalev and Narayanaswami with quantization apparatus of Navarrete.
Doing so would allow for a higher compression rate to reduce the computational complexity of the system (para [0079] the first convolutional neural network circuit 407 and the second convolutional neural network circuit 409 have optimized filter parameters, so that the image encoding apparatus has a higher compression rate, without artificially setting corresponding filter parameters, which reduces complexity in setting the filter parameters.). Reduced complexity is important in order to meet bandwidth constraints while processing high quality data (para [0003] It is expected that even though bandwidth is increasing steadily, a dramatic increase in media data traffic is hard to be satisfied. Therefore, it needs to seek for better solutions for media data compression to satisfy the requirement for high quality media data under the existing traffic bandwidth.)
Danjo (US 20170368682 A1) teaches 
a subtract circuit to compute the difference between a sum(*) and a first input operator ("OpC") to yield a first intermediate operator ("Op1") (para [0053] In FIG. 2, a primary .DELTA..SIGMA.-AD converter is illustrated as an example. The .DELTA..SIGMA.-AD converter illustrated in FIG. 2 includes an adder (subtracter) 210, an integrator 220, a comparator ( quantizer) 230, a delay circuit 240, and a digital analog converter (DAC) 250. The adder (subtracter) 210 subtracts the output from the DA converter 250 from an analog signal x inputted into the .DELTA..SIGMA.-AD converter, and outputs a result as a signal u.); 
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the CNN accelerator of Shalev and Narayanaswami with quantization of Danjo.
Doing so would allow for reducing the amount of quantization noise from converting analog signals to digital signals. This would now only permit analog inputs to be converted to digital inputs but also decrease the amount of inaccurate data thereby improving the accuracy. (para [0013] Since the .DELTA..SIGMA.-AD converter has high pass characteristics as explained above, increasing the sampling frequency with respect to the frequency of the input signal makes it possible to decrease the quantization noise existing in a frequency band of the input signal by noise shaping so as to increase a signal to noise ratio (SNR), thereby improving accuracy.)
Regarding Claim 15,
Shalev, Narayanaswami, Naverette, and Danjo teach the CNN accelerator of claim 13. Naverette further teaches wherein OpC, OpA, and OpB are configurable operands (para [0066] An image difference acquisition circuit 410 is connected with the plurality of second image input terminals 404, 405, 406, the second convolutional neural network circuit 409 and the output interface 411 and configured to determine difference features D.sub.n, D.sub.d and D.sub.v of each second image of the plurality of second images UR, BR, BL and corresponding prediction images, and output the difference features D.sub.n, D.sub.d and D.sub.v through the output interface 411.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the CNN accelerator of Shalev and Narayanaswami with quantization apparatus of Navarrete.
Doing so would allow for a higher compression rate to reduce the computational complexity of the system (para [0079] the first convolutional neural network circuit 407 and the second convolutional neural network circuit 409 have optimized filter parameters, so that the image encoding apparatus has a higher compression rate, without artificially setting corresponding filter parameters, which reduces complexity in setting the filter parameters.). Reduced complexity is important in order to meet bandwidth constraints while processing high quality data (para [0003] It is expected that even though bandwidth is increasing steadily, a dramatic increase in media data traffic is hard to be satisfied. Therefore, it needs to seek for better solutions for media data compression to satisfy the requirement for high quality media data under the existing traffic bandwidth.)
Regarding Claim 18,
Shalev, Narayanaswami, Naverette, and Danjo teach the CNN accelerator of claim 13. 
	Navarrete further teaches 
wherein the reduction circuit comprises a quantization operator, wherein the output operator is selected based on a comparison of two or more of the intermediate operators (para [0086] D.sub.d and D.sub.v output from the output interface 411, perform quantization process and inverse quantization process on the superposed image A and the difference features D.sub.n, D.sub.d and D.sub.v, to generate quantization superposed images and quantization difference features.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the CNN accelerator of Shalev and Narayanaswami with quantization apparatus of Navarrete.
Doing so would allow for a higher compression rate to reduce the computational complexity of the system (para [0079] the first convolutional neural network circuit 407 and the second convolutional neural network circuit 409 have optimized filter parameters, so that the image encoding apparatus has a higher compression rate, without artificially setting corresponding filter parameters, which reduces complexity in setting the filter parameters.). Reduced complexity is important in order to meet bandwidth constraints while processing high quality data (para [0003] It is expected that even though bandwidth is increasing steadily, a dramatic increase in media data traffic is hard to be satisfied. Therefore, it needs to seek for better solutions for media data compression to satisfy the requirement for high quality media data under the existing traffic bandwidth.)

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Shalev et al. (US-10489479-B1), Narayanaswami et al. (US-9836691-B1), Navarrete et al. (US-20190014320-A1), Danjo et al. (US-20170368682-A1) as applied above, and further in view of El-Yaniv et al. (US-20170286830-A1).
Regarding Claim 14,
Shalev, Narayanaswami, Naverette, and Danjo teach the CNN accelerator of claim 13. 
	Shalev, Narayanaswami, Naverette, and Danjo do not explicitly disclose
wherein the input operators are half-precision floating point operators, and the output operator is a 1-bit or 2-bit integer operator.
However, El Yaniv (US 20170286830 A1) teaches 
wherein the input operators are half-precision floating point operators, and the output operator is a 1-bit or 2-bit integer operator (para [0048] In such case the QNN may be referred to as a binary neural network (BNN). Quantized activation functions may be used for binarization of floating point connection weight and floating point activation values (e.g. outputs of activation functions) or for reduction of the floating point connection weight and floating point activation values to more than 1-bit per weight value and 1-bit per activation value, for instance any of 2-16 bits for a prediction accuracy comparable to 32 bits counterparts.)

Doing so would allow for reduced computational consumption. Reduced computational consumption will allow for a faster training time in addition to less memory consumption and better energy efficiency (para [0036] Below are presented examples, implemented on Torch7, which show that it is possible to train the neural networks which are described herein on MNIST, CIFAR-10 and SVHN datasets and achieve near state-of-the-art results with improved computation (reduced computational consumption). Moreover, an example on the ImageNet dataset indicates that during the forward pass (both at run-time and train-time), the DNNs trained as described herein drastically reduce memory consumption (e.g. size and number of memory accesses), and replace arithmetic operations, optionally most of them, with bit-wise operations, which lead to an increase in power-efficiency (see Section 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. 3).)

Claims 16 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Shalev et al. (US-10489479-B1), Narayanaswami et al. (US-9836691-B1), Navarrete et al. (US-20190014320-A1), and Danjo et al. (US-20170368682-A1) as applied above, and further in view of Ishida et al. (US 4945507 A).
Regarding Claim 16,
Shalev, Narayanaswami, Naverette, and Danjo teach the CNN accelerator of claim 13. 

wherein the reduction circuit comprises a bit-select operator, wherein the output operator is selected from one or two digits of an accumulator.
However, Ishida (US 4945507 A) teaches 
wherein the reduction circuit comprises a bit-select operator, wherein the output operator is selected from one or two digits of an accumulator (Col. 5 lines 25-31; Furthermore, the most significant bit D23 of the content of the accumulator 46, which is the sign bit indicative of the positive or negative of the numerical data held in the accumulator 46, is also inputted through a single bit line 60 to the selector 48. The selector 44 has an output of 20 bits connected to a 20-bit internal data bus 12.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify accumulator of Shalev, Narayanaswami, Naverette, and Danjo with the accumulator of Ishida
Doing so would allow for mitigating the overflow of data. Mitigating the overflow of data ensures that the data is accurate which helps improve the accuracy of the arithmetic operations of the system. (Col. 1 lines 10-15; In brief, when the result of the arithmetic operation is outputted to the internal bus, the result of the arithmetic operation is stored in an accumulator having an overflow margin in comparison with a data length of the internal bus, so that even if an overflow occurs in the course of the arithmetic operation, the overflowed data is retained or returned to an arithmetic operation unit without the intermediary of the internal bus. With this arrangement, the arithmetic operation can be ensured to have the accuracy corresponding to a data width of a bus within the arithmetic operation unit.)
Regarding Claim 17,
Shalev, Narayanaswami, Naverette, Danjo, and Ishida teach the CNN accelerator of claim 16. Ishida further teaches wherein the output operator is selected from a sign digit of the accumulator (Col. 5 lines 25-31; Furthermore, the most significant bit D23 of the content of the accumulator 46, which is the sign bit indicative of the positive or negative of the numerical data held in the accumulator 46, is also inputted through a single bit line 60 to the selector 48. The selector 44 has an output of 20 bits connected to a 20-bit internal data bus 12.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify accumulator of Shalev, Narayanaswami, Naverette, and Danjo with the accumulator of Ishida
Doing so would allow for mitigating the overflow of data. Mitigating the overflow of data ensures that the data is accurate which helps improve the accuracy of the arithmetic operations of the system. (Col. 1 lines 10-15; In brief, when the result of the arithmetic operation is outputted to the internal bus, the result of the arithmetic operation is stored in an accumulator having an overflow margin in comparison with a data length of the internal bus, so that even if an overflow occurs in the course of the arithmetic operation, the overflowed data is retained or returned to an arithmetic operation unit without the intermediary of the internal bus. With this arrangement, the arithmetic operation can be ensured to have the accuracy corresponding to a data width of a bus within the arithmetic operation unit.)
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Ji et al. (20180197081) – discloses an integrated convolutional neural network chip with quantization.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HENRY K NGUYEN whose telephone number is (571)272-0217. The examiner can normally be reached Mon - Fri 7:00am-4:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 5712723768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 




/H.N./Examiner, Art Unit 2121                                                                                                                                                                                                        /NICHOLAS KLICOS/Primary Examiner, Art Unit 2145