DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is in response to submission of application of 5/4/2018.
Claims 1-20 are presented for examination.
Oath/Declaration
For the record, the Examiner acknowledges that the Oaths/Declarations submitted on 5/4/2018 have been received.
Information Disclosure Statement
The information disclosure statement submitted on 5/4/2018 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is considered by the examiner.
Drawings
The drawings are acceptable for the purposes of examination.
Specification
The disclosure is objected to because of the following informalities:
In [0001], line 1, “convolutional neural network” should be “a convolutional neural network”.
In [0002], line 4, “convolutional neural network” should be “convolutional neural networks”.
In [0005], line 7, “coefficient” should be “coefficients”.
In [0005], line 8, “coefficient” should be “coefficients”.
In [0006], line 2, there should be a period at the end of the sentence.
In [0050], line 1, “be” should be replaced with “are”.
In [0084], line 9, “a” should be replaced with “the”.
The specification has not been checked to the extent necessary to determine the presence of all possible minor errors. Applicant’s cooperation is requested in correcting any errors the applicant may become aware of in the specification.
Claim Objections
Claim 1 recites “kernel decompression circuit” in line 9.  There is lack of antecedent basis for this limitation in the claim.  For the purposes of prior art examination, Examiner is interpreting as “kernel extract circuit”.
Claim 8, in line 3, “coefficient” should be “coefficients”.
Claim 8, in line 4, “coefficient” should be “coefficients”.
Claim 9, in line 6, “an” should be “and”.
Claim 12 recites “the kernel multiplication circuit” in line 1.  There is lack of antecedent basis for this limitation in the claim.  For the purposes of prior art examination, Examiner is interpreting as “multiply-add circuit”.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the 

Claims 1-6, 9-13, and 15-19 are rejected under 35 U.S.C. 103 as being unpatentable over Yao et al (US2018/0046903 A1, herein Yao) and Chen et al (Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, herein Chen).
Regarding claim 1,
	Yao teaches a neural processor circuit, comprising: (Yao, Paragraph [0002], Line 2 “In particular, the present invention relates to how to implement and optimize a convolutional neural network based on an embedded FPGA.” In other words, a convolutional neural network based on an embedded FPGA is a neural processor circuit.)
	a kernel access circuit coupled to memory external to the neural processor circuit, the kernel access circuit configured to read compressed kernel data from the memory; and (Yao, Paragraph [0015], Line 1 “It proposes a deep processing unit (DPU) for implementing an Artificial Neural Network (ANN), comprising:  A CPU, configured for scheduling a programmable logic module; an external memory, configured for storing weights and instructions of the ANN and input data to be processed by said ANN; a direct memory access (DMA), connected to the external memory, directly configured by the CPU for communication between the external memory and the programmable logic module; a programmable logic module, comprising: a controller, configured for getting instructions from the external memory and scheduling operations of a computing complex on the basis of the instructions; a computing complex, including a plurality of processing elements (PEs), configured for performing operations on the basis of the instructions, weights, and data; an input buffer, configured for preparing the input data, weights and instructions for the computing complex; an output buffer, configured for In other words, direct memory access (DMA) is kernel access circuit, external memory is external memory, deep processing unit (DPU) is neural processor circuit, and configured by the CPU for communication between the external memory and the programmable logic module is configured to read compressed kernel data from the memory.)
	a plurality of neural engine circuits configured to receive compressed kernel data from the kernel access circuit, each of the neural engine circuits comprising: (Yao, Figure 8A, and  Paragraph [0015], Line 13 “...including a plurality of processing elements (PEs), configured for performing operations on the basis of the instructions...” In other words, plurality of processing elements is plurality of neural engine circuits, and configured for performing instructions is configured to receive kernel data.)
	[a kernel extract circuit configured to extract uncompressed kernel data from the compressed kernel data]; and
	a multiply-add (MAD) circuit coupled to the kernel decompression circuit to receive the uncompressed kernel data (Yao, Paragraph [0015], Line 9 “… a programmable logic module, comprising: a controller, configured for getting instructions from the external memory and scheduling operations of a computing complex on the basis of the instructions;...” In other words, computing complex is multiply-add circuit configured to receive uncompressed kernel data.)
	the MAD circuit further configured to perform neural network operations on a portion of input data using the uncompressed kernel data.  (Yao, Paragraph [0015], Line 13 “... a computing complex, including a plurality of processing elements (PEs), configured for In other words, computing complex is MAD circuit, and configured for performing operation on the basis of instructions, is configured to perform neural network operations.)

    PNG
    media_image1.png
    558
    488
    media_image1.png
    Greyscale

	Thus far, Yao does not explicitly teach a kernel extract circuit configured to extract uncompressed kernel data from the compressed kernel data.
	Chen teaches a kernel extract circuit configured to extract uncompressed kernel data from the compressed kernel data. (Chen, Fig. 2, and Page 132, Column 2, Paragraph 5, Line 1 “Except for the input data to the first layer of a CNN, all the fmaps are stored in RLC compressed form in the DRAM.  The accelerator reads the encoded ifmaps from DRAM, decompresses it with the RLC decoder, and writes it into the GLB.” And, Line 13 “The DRAM access could be further reduced if RLC was applied to filter weights.” In other words, accelerator is kernel extract circuit, filter is kernel, and reads encoded ifmaps from DRAM, decompresses it is extract uncompressed kernel data from the compressed kernel data.)

	One of ordinary skill in the art would be motivated to do this because compressing kernel data saves both space and read/write bandwidth of accessing external memory, resulting in improved speed, space, and energy constraints of operating large convolutional neural networks.

    PNG
    media_image2.png
    309
    687
    media_image2.png
    Greyscale

Regarding claim 2,
	The combination of Yao and Chen teach the neural processor circuit of claim 1,
	wherein the kernel extract circuit is configured to extract the uncompressed kernel data by: using a mask indicating locations in the uncompressed kernel data where kernel coefficients are zero or non-zero.  (Chen, Page 132, Column 2, Paragraph 5, Line 1 “Except for the input data to the first layer of a CNN, all the fmaps are stored in RLC compressed form in the DRAM.  The accelerator reads the encoded ifmaps from DRAM, decompresses it with the In other words, accelerator is kernel extract circuit, filters is kernel, RLC compressed form is mask indicating locations in the uncompressed kernel data where kernel coefficients are zero or non-zero, and accelerator reads the encoded ifmaps from DRAM, decompresses it with the RLC decoder is configured to extract the uncompressed kernel data.)
Regarding claim 3,
The combination of Yao and Chen teach the neural processor circuit of claim 1,
	wherein the compressed kernel data includes a plurality of kernel groups, each kernel group comprising (Chen, Page 130, Column 1, Paragraph 1, Line 1 “2-D Convolution PE Set: A 2-D convolution is composed of many 1-D convolution primitives, and its computation” 1) shares the same row of filter or ifmap across primitives and 2) accumulates the psums from multiple primitives together. Therefore, a PE Set, as shown in Fig. 4, is grouped to run a 2-D convolution and exploit the interprimitive convolutional reuse and psum accumulation, which avoids data accesses from GLB and DRAM.” In other words, filter is kernel, and PE Set is kernel group.)
	a mask indicating locations in the uncompressed kernel data where the kernel coefficients are zero or non-zero for a number of kernel coefficients, and up to a same number of the kernel coefficients that are non-zero.  (Chen, Fig. 8, and Page 132, Column 2, Paragraph 5, Line 1, “Except for the input data to the first layer of a CNN, all the fmaps are stored in RLC compressed form in the DRAM.  The accelerator reads the encoded ifmaps from DRAM, decompresses it with the RLC decoder, and writes it into the GLB.” In other words, RLC compressed form is mask, and decompresses it with the RLC decoder is indicating locations where the coefficients are zero or non-zero.)

    PNG
    media_image3.png
    207
    617
    media_image3.png
    Greyscale

Regarding claim 4,
	The combination of Yao and Chen teach the neural processor circuit of claim 1,
	wherein the MAD circuit is configured to perform the neural network operations on the portion of input data using the uncompressed kernel data by: responsive to a kernel coefficient being non-zero, multiplying a kernel coefficient with a corresponding value of the input data.  (Yao, Paragraph [0015], Line 13 “... a computing complex, including a plurality of processing elements (PEs), configured for performing operation on the basis of the instructions, weights, and data;…” In other words, computing complex is MAD circuit, and configured for performing operation on the basis of instructions, is configured to perform neural network operations on the uncompressed kernel data.)
Regarding claim 5,
	The combination of Yao and Chen teach the neural processor circuit of claim 1,
	wherein the kernel extract circuit is further configured to extract at least one of a MAD parameter or a post-processor parameter from the compressed kernel data (Chen, Page 132, Column 2, Paragraph 5, Line 1 “Except for the input data to the first layer of a CNN, all the fmaps are stored in RLC compressed form in the DRAM.  The accelerator reads the encoded In other words, accelerator is kernel extract circuit, ofmaps are at least one of a MAD parameter from the compressed data.) 
	 the MAD parameter sent to the MAD circuit of each of the neural engine circuits to configure operations of the MAD circuit, the post-processor parameter sent to a post-processor of each of the neural engine circuits to configure operations of the post-processor.  (Chen, Page 132, Column 2, Paragraph 5, Line 1 “Except for the input data to the first layer of a CNN, all the fmaps are stored in RLC compressed form in the DRAM.  The accelerator reads the encoded ifmaps from DRAM, decompresses it with the RLC decoder, and writes it into the GLB.” And, Line 5, “The computed ofmaps are read from the GLB, processed by the ReLU module optionally, compressed by the RLC encoder, and transmitted to the DRAM.” And, page 132, Column 2, Paragraph 6, line 1 “ The Eyeriss accelerator has a GLB of 108 kB that can communicate with DRAM through the asynchronous interface and with the PE array through the NoC.” In other words, ofmaps are at least one of a MAD parameter, the PE array through the NoC is parameter sent to the MAD circuit of each of the neural engine circuits, and transmitted to the DRAM is sent to a post-processor. Examiner notes that in the instant application the post-processor receives information from the accumulator and then forwards to Output, whereas in Chen, ofmaps are forwarded to the DRAM which then forwards to Output.)
Regarding claim 6,
	The combination of Yao and Chen teach the neural processor circuit of claim 1,
wherein the kernel extract circuit comprises: a reconstruction circuit configured to reconstruct uncompressed kernel data corresponding to each of the neural engine circuits, and (Chen, Page 132, Column 2, Paragraph 5, Line 1 “Except for the input data to the first layer of a CNN, all the fmaps are stored in RLC compressed form in the DRAM.  The accelerator reads the encoded ifmaps from DRAM, decompresses it with the RLC decoder, and writes it into the GLB.” And, Line 5, “The computed ofmaps are read from the GLB, processed by the ReLU module optionally, compressed by the RLC encoder, and transmitted to the DRAM.” And Column 2, Paragraph 6, Line 1, “The Eyeriss accelerator has a GLB of 108 kB that can communicate with DRAM through the asynchronous interface and with the PE array through the NoC.   The GLB stores all the three types of data: ifmaps, filters, and psums/ofmaps.”  In other words, the RLC decoder is the reconstruction circuit, decompresses is reconstruct uncompressed data (examiner is interpreting “configured to reconstruct uncompressed kernel data” as making the recently decompressed data into data that can be used for convolutions), GLB of 108 kB that can communicate with DRAM through the asynchronous interface and with the PE array is corresponding to each of the neural engine circuits.)
	a kernel look-ahead buffer configured to store the uncompressed kernel data. (Chen, Page 132, Column 2, Paragraph 5, Line 1 “Except for the input data to the first layer of a CNN, all the fmaps are stored in RLC compressed form in the DRAM.  The accelerator reads the encoded ifmaps from DRAM, decompresses it with the RLC decoder, and writes it into the GLB.” And, Line 5, “The computed ofmaps are read from the GLB, processed by the ReLU module optionally, compressed by the RLC encoder, and transmitted to the DRAM.” And, Column 2, Paragraph 6, Line 1 “The Eyeriss accelerator has a GLB of 108 kB that can communicate with In other words, the GLB is the kernel look-ahead buffer.)
Claim 9 is the method claim corresponding to the neural processor circuit of claim 1.   
	In addition to the neural processor circuit, Yao teaches the corresponding method (Yao, Paragraph [0013], Line 1 “It proposes a method for optimizing an Artificial Neural Network (ANN), said ANN at least comprises convolutional layers CONV 1, CONV2, … CONV n, and fully connected layers FC 1, FC 2, …, FC m, wherein n and m are positive integers, said ANN can receive a data set as input and process said data set by said CONV 1, … CONV n, FC 1, … FC m in sequence and provide a corresponding feature map set as each layer’s output, said method comprising: compressing step for compressing weights of said convolutional layers CONV 1, CONV 2, … CONV n, and fully connected layers FC 1, FC 2, … FC m of said ANN; fix-point quantization step  for converting floating-point numbers into fixed-point numbers, including: weight quantization step, for converting weights of said convolutional layers CONV 1, CONV 2,… CONV n, and fully connected layers FC 1, FC 2, …, FC m of the compressed ANN from floating-point numbers into fixed-point numbers, wherein the numerical range of quantization is dynamically chosen for different layers while remains static in one layer; data quantization step, for converting data of feature map sets j from floating-point numbers into fixed-point numbers, wherein the numerical range of quantization  is dynamically chosen for different feature map sets while remains static in one feature map set, wherein said feature map sets j are output by said CONV layers and FC layers of said ANN; compiling step, for compiling said compressed ANN to generate instructions to be executed by an ANN accelerator, so as to implement said ANN on said ANN accelerator; wherein the compiling step is conducted on the basis of the quantized 
Claims 10, 11, 12, and 13 are method claims corresponding to processor claims 2, 4, 5, and 6 respectively.  Accordingly, claims 10, 11, 12, and 13 are rejected for the same reasons as claims 2, 4, 5, and 6.
Claim 15 is an integrated circuit system claim corresponding to neural processor circuit claim 1. In addition to a neural processor circuit, Yao claims an integrated circuit system. (Yao, Paragraph [0002], Line 2 “In particular, the present invention relates to how to implement and optimize a convolutional neural network based on an embedded FPGA.” An FPGA is a processor circuit and an integrated circuit system.)  Therefore, claim 15 is rejected for the same reasons as claim 1.
Claims 16, 17, 18, and 19 are the integrated circuit system claims corresponding to the neural processor circuit claims 2, 4, 5, and 6. Claims 16, 17, 18, and 19 are rejected for the same reasons as claims 2, 4, 5, and 6.
Claims 7, 8, 14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Yao et al (US2018/0046903 A1, herein Yao), Chen et al (Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, herein Chen), and Bagherinezhad et al (LCNN: Lookup-based Convolutional Neural Network, herein Bagherinezhad).
Regarding claim 7,
	The combination of Yao and Chen teach the neural processor circuit of claim 6,
[wherein the kernel extract circuit further comprises a look-up table storage configured to store look-up tables storing representative kernel 3432685/39144/FW/9983955.5coefficients of kernels identified by index values], and 
	the representative kernel coefficients are determined during a compilation process based on kernel coefficients of original kernel data before compression. (Chen, Page 133, Column 1, Paragraph 1, Line 4 “Even though it is not required by the dataflow, the remaining 8 kB (two banks of 512-b X 64-b SRAMS) of the GLB is allocated for filter weights to compensate for insufficient off-chip traffic bandwidth.  While the PE array is working on a processing pass, the GLB preloads the filters used by the next processing pass.” In other words, the filter weights are the kernel coefficients.)
	Thus far, the combination of Yao and Chen does not explicitly teach wherein the kernel extract circuit further comprises a look-up table storage configured to store look-up tables storing representative kernel 3432685/39144/FW/9983955.5coefficients of kernels identified by index values,
	Bagherinezhad teaches wherein the kernel extract circuit further comprises a look-up table storage configured to store look-up tables storing representative kernel 3432685/39144/FW/9983955.5coefficients of kernels identified by index values, (Bagherinezhad, Figure 1, and Page 1, Column 2, Paragraph 2, Line 1 “Recent work on efficient deep learning have focused on model compression and reducing the computational precision of operations in neural networks [3, 15, 35].  CNNs suffer from over-parameterization [7] and often encode highly correlated parameters [22], resulting in inefficient computation and memory usage [7].  Our key insight is to leverage the correlation between the parameters and representing the space of parameters by a compact set of weight vectors, called dictionary.  In this paper, we introduce LCNN, a lookup-based convolutional In other words, the dictionary is a look-up table, coefficients are coefficients, and indices are index values.)
	Bagherinezhad and the combination of Yao and Chen are both directed to speeding up inference in Convolutional Neural Networks.  In view of the teaching of Bagherinezhad, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Bagherinezhad into the combination of Yao and Chen. This would result in using look up tables to store representative kernel 3432685/39144/FW/9983955.5coefficients of kernels identified by index values.
	One of ordinary skill in the art would be motivated to do so because using look up tables can improve the speed and space constraints of inference during operation of a convolutional neural network.

    PNG
    media_image4.png
    540
    1354
    media_image4.png
    Greyscale


Regarding claim 8,
	The combination of Yao, Chen and Bagherinezhad teach the neural processor circuit of claim 6,
	wherein the kernel look-ahead buffer is configured to send information on locations of kernel coefficients that are zero in the uncompressed kernel data to the MAD circuit before sending kernel coefficient that are non-zero so that the MAD circuit skips operation associated with the kernel coefficient that are zero.  (Chen, Fig. 2, and Page 133, Column 1, Paragraph 1, Line 4 “Even though it is not required by the dataflow, the remaining 8 kB (two banks of 512-b X 64-b SRAMS) of the GLB is allocated for filter weights to compensate for insufficient off-chip traffic bandwidth.  While the PE array is working on a processing pass, the GLB preloads the filters used by the next processing pass.” In other words, the GLB is the kernel look-ahead buffer which preloads the filters to be used by the next processing pass, and filter weights are kernel coefficients.)  
Regarding claim 14,
	The combination of Yao, Chen and Bagherinezhad teach the method of claim 13, further comprising:
	storing representative kernel coefficients of kernels identified by index values into look- up table storage; (Bagherinezhad, Figure 1, and Page 1, Column 2, Paragraph 2, Line 1 “Recent work on efficient deep learning have focused on model compression and reducing the computational precision of operations in neural networks [3, 15, 35].  CNNs suffer from over-parameterization [7] and often encode highly correlated parameters [22], resulting in inefficient In other words, the dictionary is a look-up table, coefficients are coefficients, and indices are index values.)
	sending, by the kernel look-ahead buffer, information on locations of kernel coefficients that are zero in the uncompressed kernel data to the MAD circuit before sending 3632685/39144/FW/9983955.5kernel coefficient that are non-zero so that the MAD circuit skips operation associated with the kernel coefficient that are zero. (Chen, Fig. 2, and Page 133, Column 1, Paragraph 1, Line 4 “Even though it is not required by the dataflow, the remaining 8 kB (two banks of 512-b X 64-b SRAMS) of the GLB is allocated for filter weights to compensate for insufficient off-chip traffic bandwidth.  While the PE array is working on a processing pass, the GLB preloads the filters used by the next processing pass.” In other words, the GLB is the kernel look-ahead buffer which preloads the filters to be used by the next processing pass, and filter weights are kernel coefficients.)
Regarding claim 20,
The combination of Yao, Chen and Bagherinezhad teach the IC system of claim 19, 	wherein the kernel extract circuit further comprises look-up table storage configured to store look-up tables storing representative kernel coefficients of kernels identified by index values, and (Bagherinezhad, Figure 1, and Page 1, Column 2, Paragraph 2, Line 1 “Recent work on efficient deep learning have focused on model compression and reducing the computational precision of operations in neural networks [3, 15, 35].  CNNs suffer from over-parameterization [7] and often encode highly correlated parameters [22], resulting in inefficient computation and memory usage [7].  Our key insight is to leverage the correlation between the parameters and representing the space of parameters by a compact set of weight vectors, called dictionary.  In this paper, we introduce LCNN, a lookup-based convolutional neural network that encodes convolutions by few lookups to a dictionary that is trained to cover the space of weight in CNNs.  Training LCNN involves jointly learning a dictionary and a small set of linear combinations.  The size of the dictionary naturally traces a spectrum of trade-offs between efficiency and accuracy.” And from Figure 1 “lookup indices and there coefficients are stored in tensors I and C.” In other words, the dictionary is a look-up table, coefficients are coefficients, and indices are index values.)
	the kernel look-ahead buffer is configured to send information on locations of kernel coefficients that are zero in the uncompressed kernel data to the MAD circuit before sending kernel coefficient that are non-zero so that the MAD circuit skips operation associated with the kernel coefficient that are zero. (Chen, Fig. 2, and Page 133, Column 1, Paragraph 1, Line 4 “Even though it is not required by the dataflow, the remaining 8 kB (two banks of 512-b X 64-b SRAMS) of the GLB is allocated for filter weights to compensate for insufficient off-chip traffic bandwidth.  While the PE array is working on a processing pass, the GLB preloads the filters used by the next processing pass.” In other words, the GLB is the kernel look-ahead buffer which preloads the filters to be used by the next processing pass, and filter weights are kernel coefficients.)
Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to BART RYLANDER whose telephone number is (571)272-8359.  The examiner can normally be reached on Monday - Thursday 8:00 to 5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on 571-270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/Vincent Gonzales/Primary Examiner, Art Unit 2124