Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
	
EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in a telephone communication with Applicant’s Representative, Adam Langley, to Examiner on 07/26/2022. 
Claim(s) 9, 11 has/have been amended based on the received e-mail correspondence (see attached) from the Applicant’s Representative on 07/26/2022.

Reasons for Allowance
The following is an examiner’s statement of reasons for allowance: Claim(s) 1-20 is/are considered allowable since when reading the claims in light of the specification, none of the references of record either alone or in combination fairly disclose or suggest the combination of limitations specific in the independent claim including at least:

From independent claims 1 and 13:
detecting, by analyzing the latency model, a computational resource latency bottleneck in a first layer of the neural architecture, the computational resource latency bottleneck indicating that the number of computational resources of the accelerator are fully utilized during the performance of inference of the first layer, and a communication bandwidth latency bottleneck in a second layer of the neural architecture, the communication bandwidth latency bottleneck indicating that utilization of the number of computational resources of the accelerator is limited by the communication bandwidth in transferring the input values, the weight values, and the output values during the inference process of the second layer; 
expanding the latency model for the second layer of the neural architecture by increasing a range of an assigned hyper-parameter among the second plurality of neural architecture hyper-parameters constrained by the neural architecture in response to detecting the communication bandwidth latency bottleneck in the second layer; and 
determining a value of the assigned hyper-parameter that reduces the overall latency of the inference process or increases the accuracy in the inference process.
	
From independent claim 18:
a detecting section configured to detect, by analyzing the latency model, a computational 9Docket No. 6663-003resource latency bottleneck in a first layer of the neural architecture, the computational resource latency bottleneck indicating that the number of computational resources of the accelerator are fully utilized during the performance of inference of the first layer, and a communication bandwidth latency bottleneck in a second layer of the neural architecture, the communication bandwidth latency bottleneck indicating that utilization of the number of computational resources of the accelerator is limited by the communication bandwidth in transferring the input values, the weight values, and the output values during the inference process of the second layer; 
an expanding section configured to expand the latency model for the second layer of the neural architecture by increasing a range of an assigned hyper-parameter among the second plurality of neural architecture hyper-parameters constrained by the neural architecture in response to detecting the communication bandwidth latency bottleneck in the second layer; and 
wherein the determining section is further configured to detect determine a value of the assigned hyper-parameter that reduces the overall latency of the inference process or increases the accuracy in the inference process.

The closest prior art of record, Jiang et al. (Hardware/Software Co-Exploration of Neural Architectures) discloses significantly pushing forward the design Pareto frontier on accuracy and hardware efficiency tradeoffs through jointly exploring architecture search space and hardware design space based on the FPGA. 

Umuroglu et al. (FINN: A Framework for Fast, Scalable Binarized Neural Network Inference) teaches using separate compute engines which are dedicated to each layer for on-chip data streams, and avoiding most accesses to off-chip memory and minimizing the latency (the time to finish classifying one image) by overlapping computation and communication.

Shen et al. (Maximizing CNN Accelerator Efficiency Through Resource Partitioning) teaches transferring input, output, and weight data between the on-chip buffers and off-chip memory, and using double-buffering to overlap data transfer with computation based on each memory provisioning with twice the capacity. 

Tan et al. (MnasNet: Platform-Aware Neural Architecture Search for Mobile) discloses an automated mobile neural architecture search (MNAS) which explicitly incorporate model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. In addition, a factorized hierarchical search space is used to allow layers to be architecturally different yet still strike the right balance between flexibility and search space size.

However, none of the references discloses in detail 

From independent claims 1 and 13:
detecting, by analyzing the latency model, a computational resource latency bottleneck in a first layer of the neural architecture, the computational resource latency bottleneck indicating that the number of computational resources of the accelerator are fully utilized during the performance of inference of the first layer, and a communication bandwidth latency bottleneck in a second layer of the neural architecture, the communication bandwidth latency bottleneck indicating that utilization of the number of computational resources of the accelerator is limited by the communication bandwidth in transferring the input values, the weight values, and the output values during the inference process of the second layer; 
expanding the latency model for the second layer of the neural architecture by increasing a range of an assigned hyper-parameter among the second plurality of neural architecture hyper-parameters constrained by the neural architecture in response to detecting the communication bandwidth latency bottleneck in the second layer; and 
determining a value of the assigned hyper-parameter that reduces the overall latency of the inference process or increases the accuracy in the inference process.
	
From independent claim 18:
a detecting section configured to detect, by analyzing the latency model, a computational 9Docket No. 6663-003resource latency bottleneck in a first layer of the neural architecture, the computational resource latency bottleneck indicating that the number of computational resources of the accelerator are fully utilized during the performance of inference of the first layer, and a communication bandwidth latency bottleneck in a second layer of the neural architecture, the communication bandwidth latency bottleneck indicating that utilization of the number of computational resources of the accelerator is limited by the communication bandwidth in transferring the input values, the weight values, and the output values during the inference process of the second layer; 
an expanding section configured to expand the latency model for the second layer of the neural architecture by increasing a range of an assigned hyper-parameter among the second plurality of neural architecture hyper-parameters constrained by the neural architecture in response to detecting the communication bandwidth latency bottleneck in the second layer; and 
wherein the determining section is further configured to detect determine a value of the assigned hyper-parameter that reduces the overall latency of the inference process or increases the accuracy in the inference process.

as in the claims for the purpose of selecting, from among the plurality of neural architectures, a neural architecture based on the overall latency and the accuracy with hardware and neural architecture co-search.

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
Claims 1-20 are allowed.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEHWAN KIM whose telephone number is (571)270-7409. The examiner can normally be reached Mon - Fri 7:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/S.K./Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129