Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 01/17/2022 have been fully considered but they are not persuasive.
In Remarks, pp. 12-13, Applicant contends: 
“None of the references of record, alone or in combination, disclose or suggest the subject matter recited in the pending claims. For example, Jiang, Chen, and Yang all fail to disclose, as recited in claim 1, "determining a latency model of a plurality of hyper-parameters to execute the performance of inference of the neural architecture by the accelerator while interfacing with an external memory storing activation data, the plurality of hyper-parameters include a bandwidth allocation during performance of inference of the neural architecture by the accelerator, the bandwidth allocation representing an amount of bandwidth allocated between the accelerator and the external memory."”

Examiner’s response:
The relevant claim limitation appears to be 
“determining a latency model of a plurality of hyper-parameters to execute the performance of inference of the neural architecture by the accelerator while interfacing with an external memory storing activation data, 
the plurality of hyper-parameters include a bandwidth allocation during performance of inference of the neural architecture by the accelerator, 
the bandwidth allocation representing an amount of bandwidth allocated between the accelerator and the external memory.”

Regarding the “determining …” limitation, 
as noted in the rejections, Jiang teaches 
[figs 1-3] “Hyperparameters of child network”; [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “Each FPGA, fi, has a set of attributes, including memory memi, DSP slices dspi, etc. These attributes will be utilized to model the timing performance for a child network. … The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired.” [sec IV] “Hardware Design Space: The hardware design space is composed of up to three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9Mb on-chip memory, and 150 DSP Slices. One reason for our selection is that such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs.” [sec V] “First, many inferior architectures can be pruned, such that the number of architectures needing to be trained is significantly reduced, as shown in column ‘Arch for Training’”; 

In other words, Jiang teaches a hardware and software co-exploration framework for neural architecture search (NAS). The framework determines an overall latency based on hyper-parameters (i.e. “determining a latency model of a plurality of hyper-“The latency of a pipeline stage under an assignment function can be easily captured”, “Hyperparameters” and “HW/SW co-exploration framework” of fig 3) and finds a neural architecture (i.e. “inference of the neural architecture”, cf. “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process”) with multiple FPGAs (i.e. “accelerator”, cf. “three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9Mb on-chip memory”) which have an on-chip memory and communication between them (i.e. “external memory”, cf. “such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs”). Note that Yang also teaches “external memory”.

Regarding the “the plurality of hyper-parameters …” limitation, 
as noted in the rejections, Jiang teaches 
[figs 1-2]; [fig 3] “Hyperparameters of child network”, “(2) Iteratively train the controller to maximize utilization of each FPGA” and “the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired.”; [sec V] “First, many inferior architectures can be pruned, such that the number of architectures needing to be trained is significantly reduced, as shown in column ‘Arch for Training’” [sec IV] “Hardware Design Space: The hardware design space is composed of up to three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9Mb on-chip memory, and 150 DSP Slices. One reason for our selection is that such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs.”

and, Shen teaches
[figs 1 and 6] [listing 3] “OptimizeMultiCLP (cnn,Ndsp,Nbram,bw)” [sec 1] “We develop an optimization algorithm that, given CNN layer dimensions and a resource budget, computes a partitioning of the FPGA resources into multiple CLPs for an efficient high-performance design. Our algorithm runs in minutes and produces a set of CLP dimensions. We then use these dimensions to parameterize a CLP design specified using high-level synthesis (HLS), combining the resulting CLPs to form a complete CNN implementation.” [sec 4.1] “Because the intermediate data are typically too large to hold on chip, all CLPs read their inputs from and write their outputs to off-chip memory.” [sec 4.2] “Modeling Bandwidth Usage. We are primarily focused on the peak bandwidth use of a CLP, to estimate how much bandwidth is needed to support the maximum computation speed. When the peak bandwidth is unavailable on the target platform, the model must be able to estimate the throughput of the accelerator, taking into consideration how compute may be blocked by data transfer. This allows design space exploration to find the best-performing design under a bandwidth limitation.”

In other words, Jiang teaches a hardware and software co-exploration framework for neural architecture search (NAS). The framework uses hyper-parameters (i.e. “the plurality of hyper-parameters”, cf. “Hyperparameters”), and a bandwidth is allocated when a neural architecture is selected by the multiple FPGAs (i.e. “bandwidth allocation during performance of inference of the neural architecture by the accelerator”, cf. “Iteratively train the controller to maximize utilization of each FPGA” and “up to 16.8Gbps of bandwidth”). In addition, Shen teaches modeling bandwidth usage in optimizing multiples CLPs for an efficient high-performance design (i.e. “hyper-parameters include a bandwidth allocation”, cf. “OptimizeMultiCLP (cnn,Ndsp,Nbram,bw)”).

Regarding the “the bandwidth allocation …” limitation, 
as noted in the rejections, Jiang teaches 
[figs 1-2]; [sec II.C] “Each FPGA, fi, has a set of attributes, including memory memi, DSP slices dspi, etc. These attributes will be utilized to model the timing performance for a child network.” [sec IV] “Hardware Design Space: The hardware design space is composed of up to three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9Mb on-chip memory, and 150 DSP Slices. One reason for our selection is that such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs.”

In other words, Jiang teaches a hardware and software co-exploration framework for neural architecture search (NAS). The framework uses a bandwidth which is allocated among multiple FPGAs (i.e. “amount of bandwidth allocated between the “such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs”). Note that Yang also teaches “external memory”.

Therefore, the applicant’s arguments are not convincing.

In Remarks, p. 14, Applicant contends: 
“Furthermore, claims 2 and 14 have been amended to clarify the concept of communication bandwidth in more detail: "wherein the bandwidth allocation represents an amount of bandwidth between the accelerator and the external memory for each of input values, weight values, and output values." Even assuming, arguendo, that the references disclose the raw concept of communication bandwidth as a hyper-parameter in a latency model, there is nothing in Jiang, Chen, Yang, or Jiang2019 that teaches or fairly describes these three aspects of bandwidth.”

Examiner’s response:
The relevant claim limitation appears to be 
“the bandwidth allocation represents an amount of bandwidth between the accelerator and the external memory for each of input values, weight values, and output values.”

As noted in the rejections, Jiang teaches 
[figs 1-2] [figs 3-5] “Reward(A,U)” [sec II.C] “Each FPGA, fi, has a set of attributes, including memory memi, DSP slices dspi, etc. These attributes will be utilized to model the timing performance for a child network.” [sec III.A-C] “Figure 3 shows the HW/SW co-exploration framework. The framework contains a RNN based controller and two levels of explorations. … Finally, we compute the reward to update the controller using the following formula. Reward(A, U) = β × A + (1 − β) × U (2) where β is an adjustment parameter, which reflects the bias on test accuracy and hardware utilization. The value of β ranges from 0 to 1. We will discuss how to scale β in Section V. After that, we update the controller using the reward by applying the policy gradient reinforcement learning, which is the same as that in FE level. As shown in Figure 5, all RNN cells share the same weights and states in this level, since we have only one reward.” [sec IV] “Hardware Design Space: The hardware design space is composed of up to three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9Mb on-chip memory, and 150 DSP Slices. One reason for our selection is that such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs.”

In other words, Jiang teaches a hardware and software co-exploration framework for neural architecture search (NAS). The framework uses a bandwidth which is allocated among multiple FPGAs (i.e. “amount of bandwidth between the accelerator and the external memory”, cf. “such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs”). In addition, the HW/SW co-exploration framework uses different input, weight and output values as shown in figs 3-5 (i.e. “for each of input values, weight values, and output values”, cf. figs 3-5).

Therefore, the applicant’s arguments are not convincing.

In Remarks, p. 15, Applicant contends: 
“claims 3 and 15 have been amended to clarify the concept of latency bottleneck detection in more detail: "detecting, by analyzing the latency model for each layer, a latency bottleneck among latency factors" Oh discloses a "Bottleneck Part" of a neural architecture, which is very different from the concept of a latency bottleneck in a latency model of a layer of a neural architecture. The Office Action cites portions of Umuroglu in an effort to show the individual actions that are recited in claims 3 and 15 as latency factors. However, at best this is just a disclosure of the actions themselves. None of the references disclose analyzing such individual actions as latency factors to determine which action causes the highest latency.”

Examiner’s response:
First of all, it does not appear that “analyzing such individual actions as latency factors to determine which action causes the highest latency” is clearly and effectively reflected in the claims.

The relevant claim limitation appears to be 
“detecting, by analyzing the latency model for each layer, a latency bottleneck among latency factors”

As noted in the rejections, Oh teaches 
[figs 2-3]; [fig 4] “Layer-wise latency results on NVIDIA TX2”; [sec III.A.1)] “To gain an intuition regarding the value of the latency and the factors controlling it, we measured the latency of the bottleneck structure of MobileNetv2 [19] layer by layer. The results are indicated in green in Figure 4. As can be seen from the Figure, depthwise convolution has approximately twice the latency of the pointwise convolution, even though the computation is only about one-seventh that of the pointwise convolution before it. According to our detailed analysis of this phenomenon, depthwise convolution is not supported by the GPU-optimized library cuDNN; therefore, it is not possible to perform high-speed computation in this layer as it is in the others. Instead, a standard group convolution function is used to process depthwise convolution. However, this function has not been sufficiently optimized at the GPU level in most DNN frameworks, such as TensorFlow. In addition, we also observed the same latency issue in experiments with the recent CUDA 10 and cuDNN 7. Figure 4 shows the reason for reducing the use of depthwise convolution as the basic strategy of the RR block. In this example, the expansion value CE and reduction value CR are interchanged in the corresponding bottleneck block of MobileNetv2 [19] (from pointwise(CE)–depthwise(CE)– pointwise(CR) to pointwise(CR)–depthwise(CR)– pointwise(CE)). In order to match the amount of computation, the input channel of the first pointwise(CR) convolution of the RR block was increased from CR to CE (64 to 384). The measured latency is shown in orange in Figure 4. This figure shows that the unnecessary latency of the depthwise convolution can be reduced.”

In other words, Oh teaches analyzing latency layer by layer (i.e. “analyzing the latency model for each layer”, cf. “Layer-wise latency” and fig 4) to detect a bottleneck structure (i.e. “detecting … a latency bottleneck among latency factors”, cf. “we measured the latency of the bottleneck structure”).

Therefore, the applicant’s arguments are not convincing.

In Remarks, pp. 15-16, Applicant contends: 


Examiner’s response:
The relevant claim limitation appears to be 
“each value constrained by 
the range of an accuracy-increasing technique or a latency-decreasing technique assigned to the corresponding hyper-parameter, or 
the plurality of hardware design parameters and the neural architecture applicable to the corresponding hyper-parameter where no accuracy-increasing technique or a latency-decreasing technique is assigned to the corresponding hyper-parameter.”

As noted in the rejections, Oh teaches 
[fig 2]; [fig 3] “Hyperparameters of child network” and “Level 2: Slow Exploration (SE) (1) Train the child network from Level 1 to obtain its accuracy (2) Generate Reward in terms of accuracy and utilization. … the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec Abstract] “Without lengthy training, the fast exploration can effectively fine-tune hyperparameters and prune inferior architectures in terms of hardware specifications, which significantly accelerates the NAS process.” [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “➀ Child Network. A child network is defined as C = <L, para, acc>. It consists of a set of layers L. … The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired. … Given a dataset, a pool of FPGAs F, and a throughput specification TS, we are going to co-explore architecture search space and hardware design space to find a child network C: • para: parameters of all layers in the child network; • P: the partition of layer set L in the child network; • α: the assignment of pipeline stages to set F; such that the accuracy of child network C is maximized, the pipeline FPGA system can meet the required throughput TS, and the average utilization of all FPGAs is maximized.”

In other words, Jiang teaches a hardware and software co-exploration framework for neural architecture search (NAS). The hyper-parameters are fine-tuned (i.e. “accuracy-increasing technique or a latency-decreasing technique assigned to the corresponding hyper-parameter”, cf. “the fast exploration can effectively fine-tune hyperparameters”, and “accuracy” of fig 3) while the accuracy is being maximized (i.e. “the range of an accuracy-increasing technique”, cf. “the accuracy of child network C is maximized”) based on the hyper-parameters under a constraint (i.e. “each value constrained by”, cf. “such that the accuracy of child network C is maximized, the pipeline FPGA system can meet the required throughput TS, and the average utilization of all FPGAs is maximized”). 

Therefore, the applicant’s arguments are not convincing.

Claim Objections
Claim 2 is objected to because of the following informalities: “the bandwidth allocation represents an amount of bandwidth between the accelerator and the external memory” may need to read “the bandwidth allocation represents an amount of bandwidth allocated between the accelerator and the external memory” for consistency with the last limitation of claim 1. Appropriate correction is required. Claim 14 is objected to for the same reason.
 
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are: 
claim 18, “an obtaining section configured to obtain a specification of a function and a plurality of hardware design parameters, the hardware design parameters including a memory capacity, a number of computational resources, a communication bandwidth, and a template configuration for performing neural architecture inference” (Note that par 126 and fig 13 of the present application describe a sufficient structure for performing the claimed function. In addition, Jiang teaches obtaining function specification and hardware design parameters using multiple FPGAs.)
claim 18, “a determining section configured to determine, for each neural architecture among a plurality of neural architectures, each neural architecture having been trained to perform the function with an accuracy, an overall latency of performance of inference of the neural architecture by an accelerator within the hardware design parameters” (Note that par 126 and fig 13 of the present application describe a sufficient structure for performing the claimed function. In addition, Jiang teaches determining an overall latency using multiple FPGAs.)
claim 18, “a selecting section configured to select, from among the plurality of neural architectures, a neural architecture based on the overall latency and the accuracy” (Note that par 126 and fig 13 of the present application describe a sufficient structure for performing the claimed function. In addition, Jiang teaches selecting a neural architecture using multiple FPGAs.)

If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 12-13 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Jiang et al. (Hardware/Software Co-Exploration of Neural Architectures) in view of Chen et al. (DetNAS: Neural Architecture Search on Object Detection), further in view of Yang et al. (FPNet: Customized Convolutional Neural Network for FPGA Platforms) further in view of Shen et al. (Maximizing CNN Accelerator Efficiency Through Resource Partitioning)

Regarding claim 1, 
Jiang teaches

A computer-readable medium including instructions recorded thereon that are executable by a computer to cause the computer to perform operations comprising: 

obtaining a specification of a function and a plurality of hardware design parameters, the hardware design parameters including a memory capacity, a number of computational 
(Jiang, [figs 1-2]; [sec II.C] “Each FPGA, fi, has a set of attributes, including memory memi, DSP slices dspi, etc. These attributes will be utilized to model the timing performance for a child network. …  Given a dataset, a pool of FPGAs F, and a throughput specification TS, we are going to co-explore architecture search space and hardware design space to find a child network C: • para: parameters of all layers in the child network; • P: the partition of layer set L in the child network; • α: the assignment of pipeline stages to set F; such that the accuracy of child network C is maximized, the pipeline FPGA system can meet the required throughput TS, and the average utilization of all FPGAs is maximized.”; [sec IV] “Datasets: We use CIFAR-10 and ImageNet datasets to study the efficacy of our approach and compare it with the state-of-the-art. During the exploration of child networks, we only use the training images in these datasets, while the test images are used to test the accuracy of the resultant architectures. To evaluate the accuracy in the search process, we randomly select 10% of the samples from the training set as a validation set. All the images undergo the data preprocessing and augmentation procedure, including whitening, upsampling, random cropping, and random horizontal flip, which are common among the related work. … Hardware Design Space: The hardware design space is composed of up to three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9Mb on-chip memory, and 150 DSP Slices. One reason for our selection is that such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs.”; e.g., “accuracy”, “latency”, “throughput specification” with “CIFAR-10 and ImageNet datasets” may read on “a specification of a function” since any task based on “CIFAR-10 and ImageNet datasets” may read on “function”. In addition, e.g., “pool of FPGAs” may read on “template configuration”.);

(Note: Hereinafter, if a limitation has brackets (i.e. [ ]) around claim languages, the bracketed claim languages indicate that they have not been taught yet by the current prior art reference but they will be taught by another prior art reference afterwards.)

determining, for each neural architecture among a plurality of neural architectures, each neural architecture [having been] trained to perform the function with an accuracy, an overall latency of performance of inference of the neural architecture by an accelerator within the hardware design parameters 
(Jiang, [figs 1-2]; [fig 3] “the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired.”; [sec V] “First, many inferior architectures can be pruned, such that the number of architectures needing to be trained is significantly reduced, as shown in column ‘Arch for Training’”; e.g., “The utilization of FPGA fi is equal to Lati × TS” and “average utilization of all FPGAs” may read on “overall latency of performance of inference of the neural architecture by an accelerator within the hardware design parameters”.); and 


(Jiang, [fig 3] “Level 2: Slow Exploration (SE) (1) Train the child network from Level 1 to obtain its accuracy (2) Generate Reward in terms of accuracy and utilization. … the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; e.g., “fast exploration level prunes child networks with inferior hardware utilization” may read on “selecting … a neural architecture”.);

wherein the determining the overall latency further includes determining a latency model of a plurality of hyper-parameters to execute the performance of inference of the neural architecture by the accelerator while interfacing with an external memory storing activation data,
(Jiang, [figs 1-3] “Hyperparameters of child network”; [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “Each FPGA, fi, has a set of attributes, including memory memi, DSP slices dspi, etc. These attributes will be utilized to model the timing performance for a child network. … The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired.” [sec IV] “Hardware Design Space: The hardware design space is composed of up to three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9Mb on-chip memory, and 150 DSP Slices. One reason for our selection is that such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs.” [sec V] “First, many inferior architectures can be pruned, such that the number of architectures needing to be trained is significantly reduced, as shown in column ‘Arch for Training’”; e.g., “The latency of a pipeline stage under an assignment function can be easily captured with a performance model” may read on “latency model” since the latency may be captured with a performance model. In addition, e.g., data for inference may read on “activation data”. Furthermore, e.g., “three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9Mb on-chip memory” along with “such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs” may read on “external memory”.)
the plurality of hyper-parameters [include] a bandwidth allocation during performance of inference of the neural architecture by the accelerator,
(Jiang, [figs 1-2]; [fig 3] “Hyperparameters of child network”, “(2) Iteratively train the controller to maximize utilization of each FPGA” and “the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired.”; [sec V] “First, many inferior architectures can be pruned, such that the number of architectures needing to be trained is significantly reduced, as shown in column ‘Arch for Training’” [sec IV] “Hardware Design Space: The hardware design space is composed of up to three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9Mb on-chip memory, and 150 DSP Slices. One reason for our selection is that such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs.”; e.g., “The utilization of FPGA fi is equal to Lati × TS” and “average utilization of all FPGAs” along with “Iteratively train the controller to maximize utilization of each FPGA” may read on “during performance of inference of the neural architecture by the accelerator”. In addition, e.g., “One reason for our selection is that such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs” may read on “bandwidth allocation”.); and 
the bandwidth allocation representing an amount of bandwidth allocated between the accelerator and the external memory.
(Jiang, [figs 1-2]; [sec II.C] “Each FPGA, fi, has a set of attributes, including memory memi, DSP slices dspi, etc. These attributes will be utilized to model the timing performance for a child network.” [sec IV] “Hardware Design Space: The hardware design space is composed of up to three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9Mb on-chip memory, and 150 DSP Slices. One reason for our selection is that such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs.”);

However, Jiang does not appear to distinctly disclose
determining, for each neural architecture among a plurality of neural architectures, each neural architecture having been trained to perform the function with an accuracy, an overall 
the plurality of hyper-parameters include a bandwidth allocation during performance of inference of the neural architecture by the accelerator,

(Note: Hereinafter, if a limitation has one or more underlines, the one or more underlined claim languages indicate that they have not been taught yet, while the one or more non-underlined claim languages indicate that they have been taught already.)

Chen teaches
determining, for each neural architecture among a plurality of neural architectures, each neural architecture having been trained to perform the function with an accuracy, an overall latency of performance of inference of the neural architecture by an accelerator within the hardware design parameters ([fig 2] “The training step includes both ImageNet pre-training and target task finetuning.”; [table 1]; [sec 3.2] “For each individual architecture, only training on ImageNet costs several days on 8 GPUs, in the pretrain-finetune scheme. Training from scratch is a substitute, but it needs much more computation on the target data to compensate.”; Note that Jiang teaches “determining, for each neural architecture among a plurality of neural architectures, each neural architecture [having been] trained to perform the function with an accuracy, an overall latency of performance of inference of the neural architecture by an accelerator within the hardware design parameters”.);

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the hardware and software co-exploration system of Jiang with the pre-trained neural architectures of Chen. 
(Chen, sec 4.3 and table 4).

In the alternative, Yang can also be interpreted to teach the following limitation:
Yang teaches
wherein the determining the overall latency further includes determining a latency model of a plurality of hyper-parameters to execute the performance of inference of the neural architecture by the accelerator while interfacing with an external memory storing activation data, … the bandwidth allocation representing an amount of bandwidth allocated between the accelerator and the external memory.
(Yang, [figs 1 and 4]; [sec III] “Inspired by NAS, we adopt a neural architecture search with reinforcement learning to generate CNN models. We define the optimization goal as: 
    PNG
    media_image1.png
    83
    662
    media_image1.png
    Greyscale
 (1) α and β are the hyper parameters for NAS. The values of α and β decide the influence of accuracy, latency, and efficiency on the CNN model generation.”; [sec III.C-D] “For each sample model x, we train it on the target task to get its accuracy ACC(x), and run it on the FPGA accelerator to get its inference latency LAT(x) and running energy efficiency EFF(x). We then calculate the reward value R(x) using equation 1. At the end of each step, the parameters θ of the controller are updated by maximizing the expected reward defined by equation 2. The sample-eval-update loop is repeated until it reaches the maximum number of steps or the parameters θ converge. … In [17], a roofline model is developed to relate system performance to off-chip memory traffic and the peak performance provided by the hardware platform [18]. … While the computation resources of configuration are under-utilized because of the inefficient off-chip communication(Bandwidth and memory resources). Configuration 2 performs better than configuration 1.”; Note that Jiang and Chen teach “wherein the determining the overall latency further includes determining a latency model of a plurality of hyper-parameters to execute the performance of inference of the neural architecture by the accelerator while interfacing with an [external] memory storing activation data, … the bandwidth allocation representing an amount of bandwidth allocated between the accelerator and the [external] memory”.).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the hardware and software co-exploration system of Jiang, Chen with the external memory of Yang. 
Doing so would lead to providing an automated neural network architecture search approach utilizing reinforcement learning to design customized neural network model for FPGA platforms (Yang, sec I).

However, the combination of Jiang, Chen, Yang does not appear to distinctly disclose
the plurality of hyper-parameters include a bandwidth allocation during performance of inference of the neural architecture by the accelerator,

Shen teaches
the plurality of hyper-parameters include a bandwidth allocation during performance of inference of the neural architecture by the accelerator,
(Shen, [figs 1 and 6] [listing 3] “OptimizeMultiCLP (cnn,Ndsp,Nbram,bw)” [sec 1] “We develop an optimization algorithm that, given CNN layer dimensions and a resource budget, computes a partitioning of the FPGA resources into multiple CLPs for an efficient high-performance design. Our algorithm runs in minutes and produces a set of CLP dimensions. We then use these dimensions to parameterize a CLP design specified using high-level synthesis (HLS), combining the resulting CLPs to form a complete CNN implementation.” [sec 4.1] “Because the intermediate data are typically too large to hold on chip, all CLPs read their inputs from and write their outputs to off-chip memory.” [sec 4.2] “Modeling Bandwidth Usage. We are primarily focused on the peak bandwidth use of a CLP, to estimate how much bandwidth is needed to support the maximum computation speed. When the peak bandwidth is unavailable on the target platform, the model must be able to estimate the throughput of the accelerator, taking into consideration how compute may be blocked by data transfer. This allows design space exploration to find the best-performing design under a bandwidth limitation.”;)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the hardware and software co-exploration system of Jiang, Chen, Yang with the bandwidth allocation of Shen. 
Doing so would lead to providing optimization algorithm that, given CNN layer dimensions and a resource budget, computes a partitioning of the FPGA resources into multiple CLPs for an efficient high-performance design.
(Shen, [sec 1], “We develop an optimization algorithm that, given CNN layer dimensions and a resource budget, computes a partitioning of the FPGA resources into multiple CLPs for an efficient high-performance design.”)

Regarding claim 12
The combination of Jiang, Chen, Yang, Shen teaches claim 1.

Jiang further teaches 
selecting, from among neural architectures within the plurality of neural architectures that have a latency of performance of inference that is lower than a threshold latency value, a neural architecture trained to perform the function with a greatest accuracy 
(Jiang, [fig 2]; [fig 3] “RNN Controller” and “Reward” and “Level 2: Slow Exploration (SE) (1) Train the child network from Level 1 to obtain its accuracy (2) Generate Reward in terms of accuracy and utilization. … the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec II.C] “➀ Child Network. A child network is defined as C = <L, para, acc>. … The system will continuously obtain inputs from the dataset with a fixed rate (frame per second), and generate output data from the last pipeline stage. The input rate of the system reflects the throughput specification TS, which implies that the latency of each pipeline stage should be no more than 1/TS. The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). … Given a dataset, a pool of FPGAs F, and a throughput specification TS, we are going to co-explore architecture search space and hardware design space to find a child network C: … such that the accuracy of child network C is maximized, the pipeline FPGA system can meet the required throughput TS, and the average utilization of all FPGAs is maximized”; [sec III.B] “After we form the RNNs, we apply reinforcement learning to update the parameters in those N RNNs, and use these RNNs to predict the hyperparameters of child networks.”; e.g., “the latency of each pipeline stage should be no more than 1/TS” and “the pipeline FPGA system can meet the required throughput TS” may read on “lower than a threshold latency value”.).

Regarding claim 13, 


Regarding claim 18
Claim 18 is an apparatus claim corresponding to the claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 1. 

Claims 2 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Jiang et al. (Hardware/Software Co-Exploration of Neural Architectures) in view of Chen et al. (DetNAS: Neural Architecture Search on Object Detection), further in view of Yang et al. (FPNet: Customized Convolutional Neural Network for FPGA Platforms), further in view of Shen et al. (Maximizing CNN Accelerator Efficiency Through Resource Partitioning) further in view of Jiang et al. (“Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search”, hereinafter Jiang2019).

Regarding claim 2
The combination of Jiang, Chen, Yang, Shen teaches claim 1.

Jiang further teaches 
the plurality of hyper-parameters further include a [tiling] design during performance of inference of the neural architecture by the accelerator, and 
(Jiang, [figs 1-2]; [fig 3] “Hyperparameters of child network”, “(2) Iteratively train the controller to maximize utilization of each FPGA” and “the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired.”; [sec V] “First, many inferior architectures can be pruned, such that the number of architectures needing to be trained is significantly reduced, as shown in column ‘Arch for Training’” [sec IV] “Hardware Design Space: The hardware design space is composed of up to three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9Mb on-chip memory, and 150 DSP Slices. One reason for our selection is that such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs.”; e.g., “The utilization of FPGA fi is equal to Lati × TS” and “average utilization of all FPGAs” along with “Iteratively train the controller to maximize utilization of each FPGA” may read on “during performance of inference of the neural architecture by the accelerator”.)

wherein the bandwidth allocation represents an amount of bandwidth between the accelerator and the external memory for each of input values, weight values, and output values. 
(Jiang, [figs 1-2] [figs 3-5] “Reward(A,U)” [sec II.C] “Each FPGA, fi, has a set of attributes, including memory memi, DSP slices dspi, etc. These attributes will be utilized to model the timing performance for a child network.” [sec III.A-C] “Figure 3 shows the HW/SW co-exploration framework. The framework contains a RNN based controller and two levels of explorations. … Finally, we compute the reward to update the controller using the following formula. Reward(A, U) = β × A + (1 − β) × U (2) where β is an adjustment parameter, which reflects the bias on test accuracy and hardware utilization. The value of β ranges from 0 to 1. We will discuss how to scale β in Section V. After that, we update the controller using the reward by applying the policy gradient reinforcement learning, which is the same as that in FE level. As shown in Figure 5, all RNN cells share the same weights and states in this level, since we have only one reward.” [sec IV] “Hardware Design Space: The hardware design space is composed of up to three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9Mb on-chip memory, and 150 DSP Slices. One reason for our selection is that such an FPGA provides high speed serial communication (up to 16.8Gbps of bandwidth), so that a high speed hardware pipeline can be formed by multiple FPGAs.”; e.g., input, weight and output values which are related to the HW/SW co-exploration framework may read on “input values, weight values, and output values” since the bandwidth is allocated for the HW/SW co-exploration framework. Note that Yang teaches “external memory” as well.); 

However, the combination of Jiang, Chen, Yang, Shen does not appear to distinctly disclose
the plurality of hyper-parameters further include a tiling design during performance of inference of the neural architecture by the accelerator.

Jiang2019 teaches
the plurality of hyper-parameters further include a tiling design during performance of inference of the neural architecture by the accelerator.
(Jiang2019, [figs 1-3] “Hyperparameters”; [secs 3.3-3.4] “Take one convolutional operation as an example, it involves four parameters <Tm,Tn,Tr,Tc>, related to the input/output feature maps (IFM/OFM). Here, the number of IFM channel is N. The size of corresponding tiles is Tn (channels). IFM is then partitioned into ⌈ N/Tn ⌉ tiles, as shown in Figure 3(a). Similarly, OFM with M channels is partitioned to ⌈ M/Tm ⌉ tiles. In addition, the numbers of row/column of OFM are R and C, respectively. They are tiled according to Tr and Tc as shown in Figure 3(b). … then the best parameters <Tm,Tn,Tr,Tc> can be obtained according to [8, 13].”)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the hardware and software co-exploration system of Jiang, Chen, Yang, Shen with the tiling design of Jiang2019. 
Doing so would lead to providing a neural architecture search framework which can generate optimal DNN architectures with guaranteed latency on target FPGAs (Jiang2019, sec I).

Regarding claim 14, 
Claim 14 is a method claim corresponding to the claim 2, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 2. 

Claims 3-5 and 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over Jiang et al. (Hardware/Software Co-Exploration of Neural Architectures) in view of Chen et al. (DetNAS: Neural Architecture Search on Object Detection), further in view of Yang et al. (FPNet: Customized Convolutional Neural Network for FPGA Platforms), further in view of Shen et al. (Maximizing CNN Accelerator Efficiency Through Resource Partitioning) further in view of Jiang et al. (“Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search”, hereinafter Jiang2019) further in view of Oh et al. (RRNet: Umuroglu et al. (FINN: A Framework for Fast, Scalable Binarized Neural Network Inference).

Regarding claim 3
The combination of Jiang, Chen, Yang, Shen, Jiang2019 teaches claim 2.

However, the combination of Jiang, Chen, Yang, Shen, Jiang2019 does not appear to distinctly disclose
detecting, by analyzing the latency model for each layer, a latency bottleneck among latency factors of: 
copying activation data from an external memory to an internal memory of the accelerator, 
copying weight values from the external memory to the internal memory, 
performing computations on the activation data, and 
copying the activation data from the internal memory to the external memory.

Oh teaches
detecting, by analyzing the latency model for each layer, a latency bottleneck among latency factors ([figs 2-3]; [fig 4] “Layer-wise latency results on NVIDIA TX2”; [sec III.A.1)] “To gain an intuition regarding the value of the latency and the factors controlling it, we measured the latency of the bottleneck structure of MobileNetv2 [19] layer by layer. The results are indicated in green in Figure 4. As can be seen from the Figure, depthwise convolution has approximately twice the latency of the pointwise convolution, even though the computation is only about one-seventh that of the pointwise convolution before it. According to our detailed analysis of this phenomenon, depthwise convolution is not supported by the GPU-optimized library cuDNN; therefore, it is not possible to perform high-speed computation in this layer as it is in the others. Instead, a standard group convolution function is used to process depthwise convolution. However, this function has not been sufficiently optimized at the GPU level in most DNN frameworks, such as TensorFlow. In addition, we also observed the same latency issue in experiments with the recent CUDA 10 and cuDNN 7. Figure 4 shows the reason for reducing the use of depthwise convolution as the basic strategy of the RR block. In this example, the expansion value CE and reduction value CR are interchanged in the corresponding bottleneck block of MobileNetv2 [19] (from pointwise(CE)–depthwise(CE)– pointwise(CR) to pointwise(CR)–depthwise(CR)– pointwise(CE)). In order to match the amount of computation, the input channel of the first pointwise(CR) convolution of the RR block was increased from CR to CE (64 to 384). The measured latency is shown in orange in Figure 4. This figure shows that the unnecessary latency of the depthwise convolution can be reduced.”; e.g., factors that may affect latency may read on “latency factors”. Note that the combination of Jiang, Chen, Yang, Jiang2019 teaches “latency model” as well.)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the hardware and software co-exploration system of Jiang, Chen, Yang, Shen, Jiang2019 with the latency bottleneck detection of Oh. 
Doing so would lead to decreasing unnecessary latency of a latency factor among different kinds of latency factors (Oh, sec III.A.1)).

However, the combination of Jiang, Chen, Yang, Jiang2019, Oh does not appear to distinctly disclose
detecting, by analyzing the latency model for each layer, a latency bottleneck among latency factors of: 
copying activation data from an external memory to an internal memory of the accelerator, 
copying weight values from the external memory to the internal memory, 
performing computations on the activation data, and 
copying the activation data from the internal memory to the external memory.

Umuroglu teaches
detecting, by analyzing the latency model for each layer, a latency bottleneck among latency factors of: 
copying activation data from an external memory to an internal memory of the accelerator, 
copying weight values from the external memory to the internal memory, 
performing computations on the activation data, and 
copying the activation data from the internal memory to the external memory 
([figs 1-2]; [sec 4.1-4.2] “We adopted a heterogeneous streaming architecture as shown in Figure 2 for this work. We build a custom architecture for a given topology rather than scheduling a operations on top of a fixed architecture. Separate compute engines are dedicated to each layer, which communicate via on-chip data streams. Each engine starts to compute as soon as the previous engine starts to produce output. Additionally, owing to the compact model size of BNNs, all neural network parameters are kept in on-chip memory. This avoids most accesses to off-chip memory, minimizes the latency (the time to finish classifying one image) by overlapping computation and communication, and minimizes the initiation interval: a new image can enter the accelerator as soon as the first compute array is finished with the previous image.”; e.g., “images” may read on “weight values”. In addition, e.g., any data for inference may read on “activation data”. Note that Jiang, Chen, Yang and Oh, in combination, teach “activation data” as well, and teach “detecting, for each layer, a latency bottleneck among latency factors”.).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the hardware and software co-exploration system of Jiang, Chen, Yang, Shen, Jiang2019, Oh with the latency factors of Umuroglu. 
Doing so would lead to avoiding most accesses to off-chip memory, and minimizing the latency by overlapping computation and communication (Umuroglu, sec 4.1-4.2).

Regarding claim 4
The combination of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu teaches claim 3.

Jiang further teaches 
assigning, for at least one layer of the selected neural architecture, a ... technique corresponding to the latency bottleneck, each … technique associated with a hyper-parameter among the plurality of hyper-parameters and a range ([fig 2]; [fig 3] “Hyperparameters of child network” and “Level 2: Slow Exploration (SE) (1) Train the child network from Level 1 to obtain its accuracy (2) Generate Reward in terms of accuracy and utilization. … the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “➀ Child Network. A child network is defined as C = <L, para, acc>. It consists of a set of layers L. … The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired.”; e.g., “The latency of a pipeline stage under an assignment function can be easily captured with a performance model” may read on “latency model” since the latency may be captured with a performance model. In addition, e.g., “fast exploration level prunes child networks with inferior hardware utilization” may read on “selected neural architecture”.).

Oh further teaches 
assigning, for at least one layer of the selected neural architecture, a latency-decreasing technique corresponding to the latency bottleneck, each latency-decreasing technique associated with a hyper-parameter among the plurality of hyper-parameters and a range ([figs 2-3]; [fig 4] “Layer-wise latency results on NVIDIA TX2”; [sec III.A.1)] “To gain an intuition regarding the value of the latency and the factors controlling it, we measured the latency of the bottleneck structure of MobileNetv2 [19] layer by layer. The results are indicated in green in Figure 4. As can be seen from the Figure, depthwise convolution has approximately twice the latency of the pointwise convolution, even though the computation is only about one-seventh that of the pointwise convolution before it. … Figure 4 shows the reason for reducing the use of depthwise convolution as the basic strategy of the RR block. In this example, the expansion value CE and reduction value CR are interchanged in the corresponding bottleneck block of MobileNetv2 [19] (from pointwise(CE)–depthwise(CE)– pointwise(CR) to pointwise(CR)–depthwise(CR)– pointwise(CE)). In order to match the amount of computation, the input channel of the first pointwise(CR) convolution of the RR block was increased from CR to CE (64 to 384). The measured latency is shown in orange in Figure 4. This figure shows that the unnecessary latency of the depthwise convolution can be reduced.”; [sec IV.A] “Our loss function and hyperparameters also matched those in [27], [45].”)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the hardware and software co-exploration system of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu with the latency-decreasing technique of Oh. 
Doing so would lead to decreasing unnecessary latency of a latency factor among different kinds of latency factors (Oh, sec III.A.1)).

Regarding claim 5, 
The combination of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu teaches claim 4.

Jiang further teaches 
assigning, for at least one layer of the selected neural architecture, an accuracy-increasing technique corresponding to any among the latency factors other than the latency bottleneck, each accuracy-increasing technique associated with a hyper-parameter among the plurality of hyper-parameters and a range 
(Jiang, [fig 2]; [fig 3] “Hyperparameters of child network” and “Level 2: Slow Exploration (SE) (1) Train the child network from Level 1 to obtain its accuracy (2) Generate Reward in terms of accuracy and utilization. … the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “➀ Child Network. A child network is defined as C = <L, para, acc>. It consists of a set of layers L. … The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired. … Given a dataset, a pool of FPGAs F, and a throughput specification TS, we are going to co-explore architecture search space and hardware design space to find a child network C: • para: parameters of all layers in the child network; • P: the partition of layer set L in the child network; • α: the assignment of pipeline stages to set F; such that the accuracy of child network C is maximized, the pipeline FPGA system can meet the required throughput TS, and the average utilization of all FPGAs is maximized.”; e.g., “The latency of a pipeline stage under an assignment function can be easily captured with a performance model” may read on “latency model” since the latency may be captured with a performance model. In addition, e.g., “fast exploration level prunes child networks with inferior hardware utilization” may read on “selected neural architecture”. Furthermore, e.g., factors that may affect latency may read on “latency factors”.)

Regarding claim 15, 
Claim 15 is a method claim corresponding to the claim 3, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 3. 

Regarding claim 16, 
Claim 16 is a method claim corresponding to the claim 4, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 4. 

Regarding claim 17, 
Claim 17 is a method claim corresponding to the claim 5, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 5.

Claims 6-11, 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Jiang et al. (Hardware/Software Co-Exploration of Neural Architectures) in view of Chen et al. (DetNAS: Neural Architecture Search on Object Detection), further in view of Yang et al. (FPNet: Customized Convolutional Neural Network for FPGA Platforms), further in view of Shen et al. (Maximizing CNN Accelerator Efficiency Through Resource Partitioning) further in view of Jiang et al. (“Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search”, hereinafter Jiang2019) further in view of Oh et al. (RRNet: Repetition-Reduction Network for Energy Efficient Depth Estimation), further in view of Umuroglu et al. (FINN: A Framework for Fast, Scalable Binarized Neural Network Inference), further in view of Liu et al. (A GRADIENT-BASED ARCHITECTURE HYPERPARAMETER OPTIMIZATION APPROACH).

Regarding claim 6
The combination of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu teaches claim 5.

Jiang further teaches 
[generating a plurality of unique combinations of values of the hyper-parameters in the latency model], each value constrained by 
the range of an accuracy-increasing technique or a latency-decreasing technique assigned to the corresponding hyper-parameter, or 
the plurality of hardware design parameters and the neural architecture applicable to the corresponding hyper-parameter where no accuracy-increasing technique or a latency-decreasing technique is assigned to the corresponding hyper-parameter.
(Jiang, [fig 2]; [fig 3] “Hyperparameters of child network” and “Level 2: Slow Exploration (SE) (1) Train the child network from Level 1 to obtain its accuracy (2) Generate Reward in terms of accuracy and utilization. … the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec Abstract] “Without lengthy training, the fast exploration can effectively fine-tune hyperparameters and prune inferior architectures in terms of hardware specifications, which significantly accelerates the NAS process.” [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “➀ Child Network. A child network is defined as C = <L, para, acc>. It consists of a set of layers L. … The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired. … Given a dataset, a pool of FPGAs F, and a throughput specification TS, we are going to co-explore architecture search space and hardware design space to find a child network C: • para: parameters of all layers in the child network; • P: the partition of layer set L in the child network; • α: the assignment of pipeline stages to set F; such that the accuracy of child network C is maximized, the pipeline FPGA system can meet the required throughput TS, and the average utilization of all FPGAs is maximized.”; e.g., “accuracy of child network C is maximized” may read on “range of an accuracy-increasing technique”. In addition, e.g., “effectively fine-tune hyperparameters” may read on “each value constrained by the range of an accuracy-increasing technique or a latency-decreasing technique assigned to the corresponding hyper-parameter” since hyperparameters are fine-tuned while the accuracy is being maximized based on the hyperparameters under a constraint of “such that the accuracy of child network C is maximized, the pipeline FPGA system can meet the required throughput TS, and the average utilization of all FPGAs is maximized”. Note that “the range of an accuracy-increasing technique or a latency-decreasing technique assigned to the corresponding hyper-parameter” is elected for examination.)

However, the combination of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu does not appear to distinctly disclose
generating a plurality of unique combinations of values of the hyper-parameters in the latency model.

However, Liu teaches
generating a plurality of unique combinations of values of the hyper-parameters in the latency model.
(Liu, [table 2] “Proposed Method”, “Channel + Input image”, “Channel + Input image + Depth” and “Latency”; [sec 3, p. 3] “We formulate the architecture hyperparameter optimization problem as 
    PNG
    media_image2.png
    102
    509
    media_image2.png
    Greyscale
 (1) where we jointly optimize the hyperparameters H of the channel, spatial and depth dimension for a backbone architecture with the goal of minimizing the loss L when the weights W corresponding to H are trained.” ; [sec 4.2] “(3) For example, when targeting at the CPU latency, our method learns to generate network with thinner and deeper structure while it chooses wide and shallow structure for a GPU device. This is interpretable, because the GPU is highly paralleled and can execute condensed operations faster than computing fragmented pieces. Thus the accuracy gain is significant when we optimize with all three dimensions (channel + spatial + depth) (17th row) than we only optimize two dimensions (channel + spatial) (16th row) for a network to be deployed on the GPU device, as shown in Table 2, GPU latency constraint part”)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the hardware and software co-exploration system of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu with the hyper-parameter combinations of Liu. 
Doing so would lead to providing a gradient-based approach to optimize hyper-parameters in an efficient and unified manner for Neural Architecture Search (NAS) (Liu, sec 3, p. 3).

Regarding claim 7
The combination of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu, Liu teaches claim 6.

Liu further teaches 
calculating, for each of the plurality of unique combinations of values of the hyper- parameters, a resultant latency ([table 2] “Proposed Method”, “Channel + Input image”, “Channel + Input image + Depth” and “Latency”; [sec 3, p. 3] “We formulate the architecture hyperparameter optimization problem as 
    PNG
    media_image2.png
    102
    509
    media_image2.png
    Greyscale
 (1) where we jointly optimize the hyperparameters H of the channel, spatial and depth dimension for a backbone architecture with the goal of minimizing the loss L when the weights W corresponding to H are trained.” ; [sec 4.2] “(3) For example, when targeting at the CPU latency, our method learns to generate network with thinner and deeper structure while it chooses wide and shallow structure for a GPU device. This is interpretable, because the GPU is highly paralleled and can execute condensed operations faster than computing fragmented pieces. Thus the accuracy gain is significant when we optimize with all three dimensions (channel + spatial + depth) (17th row) than we only optimize two dimensions (channel + spatial) (16th row) for a network to be deployed on the GPU device, as shown in Table 2, GPU latency constraint part”).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the hardware and software co-exploration system of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu, Liu with the latency calculation of Liu. 
Doing so would lead to providing a gradient-based approach to optimize hyper-parameters in an efficient and unified manner with latency constraints for Neural Architecture Search (NAS) (Liu, sec 3, p. 3 and sec 4.2).

Regarding claim 8


Jiang further teaches 
[determining values of the hyper-parameters of the latency model], each value constrained by
the range of an accuracy-increasing technique or a latency-decreasing technique assigned to the corresponding hyper-parameter, or
the plurality of hardware design parameters and the neural architecture applicable to the corresponding hyper-parameter where no accuracy-increasing technique or a latency-decreasing technique is assigned to the corresponding hyper-parameter.
(Jiang, [fig 2]; [fig 3] “Hyperparameters of child network” and “Level 2: Slow Exploration (SE) (1) Train the child network from Level 1 to obtain its accuracy (2) Generate Reward in terms of accuracy and utilization. … the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec Abstract] “Without lengthy training, the fast exploration can effectively fine-tune hyperparameters and prune inferior architectures in terms of hardware specifications, which significantly accelerates the NAS process.” [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “➀ Child Network. A child network is defined as C = <L, para, acc>. It consists of a set of layers L. … The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired. … Given a dataset, a pool of FPGAs F, and a throughput specification TS, we are going to co-explore architecture search space and hardware design space to find a child network C: • para: parameters of all layers in the child network; • P: the partition of layer set L in the child network; • α: the assignment of pipeline stages to set F; such that the accuracy of child network C is maximized, the pipeline FPGA system can meet the required throughput TS, and the average utilization of all FPGAs is maximized.”; e.g., “accuracy of child network C is maximized” may read on “range of an accuracy-increasing technique”. In addition, e.g., “effectively fine-tune hyperparameters” may read on “each value constrained by the range of an accuracy-increasing technique or a latency-decreasing technique assigned to the corresponding hyper-parameter” since hyperparameters are fine-tuned while the accuracy is being maximized based on the hyperparameters under a constraint of “such that the accuracy of child network C is maximized, the pipeline FPGA system can meet the required throughput TS, and the average utilization of all FPGAs is maximized”. Note that “the range of an accuracy-increasing technique or a latency-decreasing technique assigned to the corresponding hyper-parameter” is elected for examination.)

However, the combination of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu does not appear to distinctly disclose
determining values of the hyper-parameters of the latency model.

Liu teaches
determining values of the hyper-parameters of the latency model ([table 2] “Proposed Method”, “Channel + Input image”, “Channel + Input image + Depth” and “Latency”; [sec 3, p. 3] “We formulate the architecture hyperparameter optimization problem as 
    PNG
    media_image2.png
    102
    509
    media_image2.png
    Greyscale
 (1) where we jointly optimize the hyperparameters H of the channel, spatial and depth dimension for a backbone architecture with the goal of minimizing the loss L when the weights W corresponding to H are trained.” ; [sec 4.2] “(3) For example, when targeting at the CPU latency, our method learns to generate network with thinner and deeper structure while it chooses wide and shallow structure for a GPU device. This is interpretable, because the GPU is highly paralleled and can execute condensed operations faster than computing fragmented pieces. Thus the accuracy gain is significant when we optimize with all three dimensions (channel + spatial + depth) (17th row) than we only optimize two dimensions (channel + spatial) (16th row) for a network to be deployed on the GPU device, as shown in Table 2, GPU latency constraint part”).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the hardware and software co-exploration system of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu with the hyper-parameters of Liu. 
Doing so would lead to providing a gradient-based approach to optimize hyper-parameters in an efficient and unified manner with latency constraints for Neural Architecture Search (NAS) (Liu, sec 3, p. 3 and sec 4.2).

In the alternative, Jiang can also be interpreted to teach the following limitation:
Jiang further teaches 
determining values of the hyper-parameters of the latency model ([fig 2]; [fig 3] “Hyperparameters of child network” and “Level 2: Slow Exploration (SE) (1) Train the child network from Level 1 to obtain its accuracy (2) Generate Reward in terms of accuracy and utilization. … The controller contains multiple reconfigurable RNN cells and predicts the hyperparameters in a child network; the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “➀ Child Network. A child network is defined as C = <L, para, acc>. It consists of a set of layers L. … The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired.”; e.g., “The latency of a pipeline stage under an assignment function can be easily captured with a performance model” may read on “latency model” since the latency may be captured with a performance model.).

Regarding claim 9
The combination of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu, Liu teaches claim 8.

Jiang further teaches 
applying a function approximator to the latency model 
([fig 2]; [fig 3] “RNN Controller” and “Reward”; [sec II.C] “➀ Child Network. A child network is defined as C = <L, para, acc>. … The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). … Given a dataset, a pool of FPGAs F, and a throughput specification TS, we are going to co-explore architecture search space and hardware design space to find a child network C: … such that the accuracy of child network C is maximized, the pipeline FPGA system can meet the required throughput TS, and the average utilization of all FPGAs is maximized”; [sec III.B] “After we form the RNNs, we apply reinforcement learning to update the parameters in those N RNNs, and use these RNNs to predict the hyperparameters of child networks.”).

Regarding claim 10, 
The combination of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu, Liu teaches claim 9.

Jiang further teaches 
the function approximator is a recurrent neural network with reinforcement learning using a reward including a latency component and an accuracy component ([fig 2]; [fig 3] “Hyperparameters of child network” and “Level 2: Slow Exploration (SE) (1) Train the child network from Level 1 to obtain its accuracy (2) Generate Reward in terms of accuracy and utilization. … the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “➀ Child Network. A child network is defined as C = <L, para, acc>. It consists of a set of layers L. … The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired.”; [sec III.B] “After we form the RNNs, we apply reinforcement learning to update the parameters in those N RNNs, and use these RNNs to predict the hyperparameters of child networks. … For each child network predicted by the controller, we can obtain the utilization of the ith pipeline stage (corresponding to one FPGA) using BLAST, denoted as Ui. Then, for RNN i, we utilize Ui to generate a reward Ri to update its parameters θi.”).

Regarding claim 11, 
The combination of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu, Liu teaches claim 10.

Jiang further teaches 
determining the accuracy component by training the neural architecture using a hold-out training data set 
([fig 2]; [fig 3]; [sec II. C] “➀ Child Network. A child network is defined as C = <L, para, acc>. It consists of a set of layers L. The number of layers in the child network is the size of set L, i.e., |L|. For the ith layer li ∈ L, set parai contains the predictable parameters, such as the number of filters, filter size, etc. The accuracy of the child network is acc, which can be obtained by training C on a held-out dataset.”).
 
Regarding claim 19
Claim 19 is a method claim corresponding to the claim 8, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 8. 

Regarding claim 20
The combination of Jiang, Chen, Yang, Shen, Jiang2019, Oh, Umuroglu, Liu teaches claim 19.

the determining the values of the hyperparameters of the latency model includes (see the rejections of claim 19) 

Jiang further teaches 
applying, to the latency model, a recurrent neural network with reinforcement learning using a reward including a latency component and an accuracy component.
(Jiang, [fig 2]; [fig 3] “Hyperparameters of child network”, “RNN Controller” and “Level 2: Slow Exploration (SE) (1) Train the child network from Level 1 to obtain its accuracy (2) Generate Reward in terms of accuracy and utilization. … the fast exploration level prunes child networks with inferior hardware utilization; the slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks”; [sec I] “The architectures with inferior hardware efficiency will be quickly pruned, which significantly accelerates the search process.”; [sec II.C] “➀ Child Network. A child network is defined as C = <L, para, acc>. It consists of a set of layers L. … The latency of a pipeline stage under an assignment function can be easily captured with a performance model [28]. For FPGA fi, its latency is denoted as Lati. After obtaining the latency of each FPGA, we introduce pipeline efficiency, which is composed of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA fi is equal to Lati × TS. Higher utilization of an FPGA indicates the less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired. … Given a dataset, a pool of FPGAs F, and a throughput specification TS, we are going to co-explore architecture search space and hardware design space to find a child network C: … such that the accuracy of child network C is maximized, the pipeline FPGA system can meet the required throughput TS, and the average utilization of all FPGAs is maximized”; [sec III.B] “After we form the RNNs, we apply reinforcement learning to update the parameters in those N RNNs, and use these RNNs to predict the hyperparameters of child networks. … For each child network predicted by the controller, we can obtain the utilization of the ith pipeline stage (corresponding to one FPGA) using BLAST, denoted as Ui. Then, for RNN i, we utilize Ui to generate a reward Ri to update its parameters θi.”).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Zhang et al. (Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks) teaches a roofline performance model.
Denolf et al. (US 2020/0104715 A1) teaches hyperparameters and an optimization with a bandwidth.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEHWAN KIM whose telephone number is (571)270-7409.  The examiner can normally be reached on Mon - Thu 7:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/S.K./Examiner, Art Unit 2129                                                                                                                                                                                                        
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129