DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Examiner notes the entry of the following papers:
Amended claims filed 1/25/2022.
Applicant arguments/remarks made in amendment filed 1/25/2022.

A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 1/25/2022 has been entered.
 
Claims 1, 11, 12, and 20 are amended.
Claims 1-20 are pending.
Response to Arguments
Applicant’s arguments filed 1/25/2022 in regards to prior art of record does not disclose the amended limitations are moot in view of a new ground of rejection.  Please see detailed rejection below.
Applicant argues that in the prior art “there is no mention of determining a plurality of layers based on a received network model, as in the present claims.” (Remarks, page 12, 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 12-13, and 20 are rejected under Sharma et al (From High-Level Deep Network Models to FPGAs, herein Sharma), Wei et al (Automated Systolic Array Architecture 
Regarding claim 1,
	Sharma teaches a computer-implemented method for improving deep neural network performance in a field-programmable gate array, the method comprising: (Sharma, Page 1, Column 1, Paragraph 1, Line 10 “This work tackles these challenges by devising DNNWEAVER, a framework that automatically generates a synthesizable accelerator for a given (DNN, FPGA) pair from a high-level specification in Caffe[1].”  In other words, DNNWEAVER is a computer-implemented method, generates a synthesizable accelerator is improving performance, DNN is deep neural network, and FPGA is field-programmable gate array.)
	in response to receiving a network model describing a deep neural network, determining a plurality of layers associated with the deep neural network; (Sharma, Page 2, Column 2, Paragraph 1, Line 12 “The input to DNNWEAVER is a high-level specification of the DNN in Berkeley Caffe format [1]. Caffe is a widely used open-source deep learning framework that takes the DNN specification as input and computes the given model on CPUs and GPUs. The code snippet in Figure 2. shows how two DNN layers, convolution and pooling, are described and connected in Caffe.” In other words, high-level specification is network model describing a deep neural network, and computes the given model on CPUs and GPUs is determining a plurality of layers associated with the deep neural network.)

    PNG
    media_image1.png
    248
    611
    media_image1.png
    Greyscale

	[with respect to a layer in the plurality of layers, determining a parallelism factor for processing operations associated with the layer simultaneously by processing elements in a field-programmable gate array (FPGA) based on a workload associated with the layer and a configuration of the FPGA,]
including a relationship between an amount of operations associated with the layer, a bandwidth of a memory in the FPGA, and a total bandwidth needed for the processing operations. (Sharma, Algorithm 1, and page 7, column 1, paragraph 1, line 1 “The algorithm takes in as input the DNN macro dataflow graph (D) and the constraints of the FPGA platform (F). The FPGA constraints (F) provide the maximum number of PEs and the capacity of the BRAM in each PE. The algorithm finally outputs the nPEperPU and the sliceSizel by taking the following steps.”

    PNG
    media_image2.png
    643
    563
    media_image2.png
    Greyscale

And page 6, column 2, line 49 “

    PNG
    media_image3.png
    189
    558
    media_image3.png
    Greyscale
”
The algorithm iterates through each layer calculating the number of PEs (processing elements) per PU (processing unit) and the slice size on a per layer basis, for the purpose of optimizing the number of execution cycles.  EEC is the workload and represents the number of execution cycles required, which is calculated by layer.  In other words, eec is the amount of operations per layer which is associated with F.BRAM which is the memory in the FPGA, and min executionCycles is the total bandwidth needed for processing the operations.)
[determining an amount of bits for one operation associated with the layer based on weight and feature map data loaded into the FPGA, the weights being parameters associated with the layer for training the DNN; and]
with respect to a second layer in the plurality of layers that follows the layer, processing in the FPGA a portion of an input feature map for the layer to obtain an output feature map, the output feature map being an input feature map for the second layer. (Sharma, page 3, column 1, paragraph 3, line 4 “As follows, a typical DNN consists of several back-to-back layers that represent increasingly abstract representations of the input….A convolution operation generates its output by sliding a window of parameters referred to as filters or kernels, over its inputs.  A convolution layer is a set of these convolution operations that combine multiple input features and kernels to generate a single or multiple output feature maps.  The initial layers of DNN are generally these convolution layers.” In other words, back-to-back layers is second layer that follows the layer where the output of one layer is the input to the following layer, input features is input feature map, output feature map is output feature map, and output feature map of one layer is the input feature map of the following layer.)
Thus far, Sharma does not explicitly teach with respect to a layer in the plurality of layers, determining a parallelism factor for processing operations associated with the layer simultaneously by processing elements in a field-programmable gate array (FPGA) based on a workload associated with the layer and a configuration of the FPGA.
	Wei teaches with respect to a layer in the plurality of layers, determining a parallelism factor for processing operations associated with the layer simultaneously by processing elements in a field-programmable gate array (FPGA) based on a workload associated with the layer and a configuration of the FPGA (Wei, Page 2, Column 1, Paragraph 7, Line 1 “We present a novel 2-D systolic array architecture for CNN on FPGA in Fig. 1.  As shown in this figure, each PE shifts the data of W and IN horizontally and vertically to the neighboring PEs at each cycle.  This 2-D topology matches the 2-D structure in the FPGA layout so that it can achieve timing constraints easily because of low routing complexity.  In addition, there is a SIMD vector accumulation inside each PE.  The parallelization factor of the SIMD factor is usually power of two due to the dedicated inter-DSP accumulation interconnect in modern FPGAs.” In other words, parallelization factor is parallelism factor, FPGA is FPGA, and due to the dedicated inter-DSP accumulation interconnect is workload associated with the layer and the configuration of the FPGA.)

    PNG
    media_image4.png
    238
    385
    media_image4.png
    Greyscale

	It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Wei into the teaching of Sharma.  This would result in being able to determine a parallelism factor for a layer in a deep neural network (DNN).
	Wei and Sharma are both directed to frameworks that optimize (DNN) design for implementation on a field-programmable gate array (FPGA).  One of ordinary skill in the art 
	Thus far, the combination of Sharma and Wei does not specifically teach determining an amount of bits for one operation associated with the layer based on weight and feature map data loaded into the FPGA, the weights being parameters associated with the layer for training the DNN; and
Park teaches determining an amount of bits for one operation associated with the layer based on weight and feature map data loaded into the FPGA, the weights being parameters associated with the layer for training the DNN; and (Park, page 5457, column 1,  paragraph 3, line 1 “We propose a new multi-bit quantization method for both weights and activations.  Unlike binary quantization approaches, our scheme is able to produce quantization results for any number of bits per weight/activation, thereby realizing much more flexibility for exploiting accuracy-performance trade-off.” And, page 5456, column 1, paragraph 1, line 14 “Moreover, our scheme provides an automated quantization flow based on conventional training algorithms, which greatly reduces the design-time effort to quantize the network.” And, page 5462, column 2, paragraph 3, line 6 “In this study, we evaluate the following four styles of per-layer bitwidth assignment based on AlexNet: monotonically decreasing (DEC), monotonically increasing (INC), concave (Concave), and convex (Convex).  All four schemes are designed to have the same number of bitwidth in total.  For example, DEC assigns 6 bits to each weight/activation in the first convolution layer, while it uses only 2 bits for weights/activation in the last fully connected layer.” In other words, any number of bits per weight/activation is determining an amount of bits,  weight is weight, activation is feature map data, and weights are parameters associated with the layer for training the DNN.)
	Both Park and the combination of  Sharma and Wei are directed to optimizing inference in neural network models, among other things.  The combination of Sharma and Wei teach automatically generating a synthesizable accelerator for a given (DNN, FPGA) pair and implementing a DNN on an FPGA using a systolic array architecture, which can achieve high clock frequency, but do not explicitly teach quantization of weights to speed up inference.  Park teaches quantization of weights and activations to optimize performance.  In view of the teaching of the combination of Sharma and Wei, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Park into the combination of Sharma and Wei.  This would result in automatically generating a synthesizable accelerator for a given (DNN, FPGA) pair and optimizing the quantization of weights and activations.
	One of ordinary skill in the art would be motivated to do this because the cost of inference is high, particularly for platforms with tight resource constraints.  (Park, page 1, paragraph 1, line 1 “Quantization is considered as one of the most effective methods to optimize the inference cost of neural network models for their deployment to mobile and embedded systems, which have tight resource constraints.”)
Regarding claim 2,
	the combination  of  Sharma, Wei and Park teaches the computer-implemented method of Claim 1, wherein: the workload associated with the layer comprises the amount of operations associated with the layer; and the configuration of the FPGA comprises the total bandwidth needed for the processing operations associated with the plurality of layers in the FPGA and the bandwidth of the memory in the FPGA. (Sharma, Page 2, Column 2, Paragraph 2, Line 12 “We choose this abstraction to provide a unified hardware-software interface and enable layer-specific optimization in the accelerator microarchitecture without exposing them to the software.” And Page 2, Column 2, Paragraph 3, Line 12 “Our Template Resource Optimization algorithm aims to strike a balance between parallel operations and data reuse by slicing computations and configuring the accelerator to best match the constraints of the FPGA (on-chip memory and external memory bandwidth).” In other words, layer-specific optimization is workload associated with the layer, and constraints of the FPGA (on-chip memory and external memory bandwidth) is configuration of the FPGA comprises a total bandwidth required for processing operations.)
Claims 12 and 13 are computer system claims corresponding to method claims 1 and 2 respectively.  Outside of that, they are the same.  It is implicit that a computer-implemented method will be implemented on a computer system with at least one computer processor and at least one computer-readable memory unit.  Therefore, claims 12 and 13 are rejected for the same reasons as claims 1 and 2 respectively.
Claim 20 is a computer program product comprising a computer-readable storage medium that corresponds to method claim 1.  Outside of that, they are the same.  It is implicit that a computer-implemented method would have at least one computer-readable storage medium.  Therefore, claim 20 is rejected for the same reasons as claim 1.
Allowable Subject Matter
Claim 3 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Claim 14 is a computer system claim corresponding to method claim 3. Otherwise, they are the same.  Therefore, claim 14 is allowable over the prior art for the same reasons as claim 3.  Claims 4- 11 and 15-19, respectively, are also allowable over the prior art.
The following is a statement of reasons for the indication of allowable subject matter: Claim 3 requires, among other things, a method for determining a parallelism factor 
    PNG
    media_image5.png
    63
    234
    media_image5.png
    Greyscale
 wherein the parallelism factor for a given layer i (PFi) is calculated by the amount of operations per layer (Nopsi) multiplied by the bandwidth (ABW) of the memory in the FPGA, the product of which being divided by the total bandwidth (NTBW) of the FPGA. NTBW is calculated by
    PNG
    media_image6.png
    31
    361
    media_image6.png
    Greyscale
, where BPOi is the amount of bits to be loaded into the FPGA for one operation for layer i.  BPOi is calculated by 
    PNG
    media_image7.png
    24
    215
    media_image7.png
    Greyscale
 where DWi is the bit width of the weights, Hi is the height of an output feature map, and Ri is the reuse factor.
	Sharma teaches a method for implementing and accelerating a deep neural network (DNN) in a field-programmable gate array (FPGA) but does not teach calculating a parallelism factor based on amount of operations associated with the layer and the bandwidth of memory in the FPGA the product of which being divided by the total bandwidth as described in the claimed invention. Other references also teach accelerating DNNs in an FPGA, see (Wei) and Zhang et al (Optimizing FPGA-based Accelerator Design for Deep Convolutional neural 
Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to BART RYLANDER whose telephone number is (571)272-8359. The examiner can normally be reached Monday - Thursday 8:00 to 5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/B.I.R./Examiner, Art Unit 2124                                                                                                                                                                                                        
/BRIAN M SMITH/Primary Examiner, Art Unit 2122