DETAILED ACTION
This Office Action is in response to the remarks entered on 08/02/2021. Claims 8 and 12 were amended. No claims were added. No claims were cancelled.
Response to Argument
In reference to Applicant’s arguments about: Rejections under 35 U.S.C. §112
Applicant’s Argument: Application argument about the rejection of claims 15, 16 and 21 under 112(b). 
Examiner’s Response: the rejection of claims 15, 16 and 21 are withdrawn in view of amendments filed on 08/02/2021. 
In reference to Applicant’s arguments about: Rejections under 35 U.S.C. §103:  claims 1-8, 10-12, 18-19 were rejected under 35 U.S.C. 103 as being unpatentable over Vahdat et al. (Pub. No: US20210073612-hereinafter, Vahdat) and further in view of Chai et al. (Pub. No.: US20210081806- hereinafter, Chai). 
Independent claims 1, 22, and 23 each recite that "the reward function includes a quality term that measures the estimated quality of a candidate architecture and a latency term that is based on an absolute value of a term that compares the estimated latency of the candidate architecture and the target latency." 
The Action acknowledges that Vahdat does not teach this feature and instead cites Chai for this teaching. In particular, the Action cites [0053] of Chai, which reads as follows: 
In some embodiments, the reward function balances two or more of the following objectives: maximizing classification accuracy of the TNN; minimizing computational operations performed while executing the TNN; minimizing power consumption of a 
However, Applicant respectfully submits that the present Application has an effective filing date of March 23, 2020. Chai was filed September 10, 2020, which is after the effective filing date of the present Application, and claims the benefit of Provisional application No. 62/900,311, filed on Sep. 13, 2019, and provisional application No. 63/018,236, filed on Apr. 30, 2020. 
Thus, only Provisional application No. 62/900,311 was filed before the effective filing date of the present Application. 
Applicant respectfully submits that the cited portion of Chai does not appear in 
Provisional application No. 62/900,311 and therefore cannot properly be used as prior art to reject the claims. Moreover, Applicant respectfully submits that Provisional application No. 62/900,311 does not describe that "the reward function balances two or more ... objectives," as is cited in the Action. 
Additionally, Applicant respectfully submits that even if the cited portion of Chai could be cited as prior art, the cited portion describes only that a reward function "balances" "objectives" that include "minimizing latency involved in executing the TNN to produce an output." 
The independent claims, on the other hand, recite a reward function that includes a quality term that measures the estimated quality of a candidate architecture and a latency term that is based on an absolute value of a term that compares the estimated latency of the candidate architecture and [a] target latency [for performing the particular machine learning task]." The cited portion of Chai makes no mention of a tern that 
Therefore, Applicant respectfully requests that the Section 103 rejections to the independent claims be withdrawn.
Claim 8 
Claim 8 is dependent from claim 1 and recites that "the term in the reward function that compares the estimated latency of the candidate architecture and the target latency is a difference between (i) a ratio between the estimated latency of the candidate architecture and the target latency and (ii) one." 
Applicant respectfully submits that the Section 103 rejection for claim 8 be withdrawn for at least the reasons given above for the independent claims. 
Moreover, the Action cites a portion of Vahdat as teaching the features of claim 8. The cited portion of Vahdat describes that "In at least one embodiment, using CIFAR-10 data, techniques described herein produce a relationship between latency and accuracy as shown in a first graph 502. In at least one embodiment, using CIFAR-10 data, techniques described herein produce a relationship between predicted latency and true accuracy as shown in a second graph 504." [0099]. 
The Action states that "the comparing (relationship) of the predicted latency and true latency is showed in the second graph 504 is correspond to the compare the different of the target latency and measured latency." (Page 22). 
As an initial matter, Applicant notes that claim 8 is directed to a term in reward function rather than a graph and that Vahdat has no teaching of using the relationship between predicted latency and true latency in any kind of reward function. 

Moreover, Applicant submits that claim 8 does not merely recite a difference between two latencies. Instead, claim 8 specifies that "the term in the reward function t... is a difference between (i) a ratio between the estimated latency of the candidate architecture and the target latency and (ii) one." Thus, claim 8 specifies that the term is a difference between (i) a ratio between two latencies and (ii) a constant value, i.e., one. The cited portion of Vahdat has no such teaching.
Examiner’s Response: Applicant’s arguments, see pages 1-3, filed 08/02/2021, with respect to the rejection(s) of claim(s) 1, 22 and 23 regarding the prior date of Chai et al. has been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Vahdat further in view of Xu.
Furthermore, Applicant’s arguments with respect to a newly amended limitation are deem to be moot because argument are directed to limitation they have not previously examined.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-23 are presented for examination.
Information Disclosure Statement
 The information disclosure statements (IDS) filed 04/13/2021 is in compliance with the provisions of 37 CFR 1.97 and 1.98. Accordingly, the information disclosure 
Priority
The following claimed benefit is acknowledged: the instant application 17210391, filed 03/23/2021 claims priority from provisional application 62993573, filed 03/23/2020

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:

A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.

3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-7, 10-12, 18-19, 22, 23 are rejected under 35 U.S.C. 103 as being unpatentable over Vahdat et al. (Pub. No: US20210073612-hereinafter, Vahdat) and further in view of Xu et al. (Pub. No.: 20210150407 – hereinafter, Xu).   
Regarding to claim 1, Vahdat teaches a method performed by one or more computers, the method comprising: receiving training data and validation data for a particular machine learning task (Vahdat, [Par.0125, lines 1-5], ““CIFAR-10 dataset consists of 50,000 training images and 10,000 test images. In at least one embodiment, during search, 45,000 images from original training set are used as a training set and remaining are used as a validation set.” Examiner’s note, the training data including the training set and validation set);
receiving data specifying a target latency for performing the particular machine learning task (Vahdat, [Par.0064, lines 1-3], “In at least one embodiment, neural architecture search aims to discover network architectures with desired properties such as high accuracy, low latency, or both.” Examiner’s note, the low latency is considered as the target latency.”);
and selecting, from a space of candidate architectures and using the training data and the validation data (Vahdat, [Par0080, lines 1-7], “In at least one embodiment, to overfitting, an objective function is based on generalization error of an architecture. In at least one embodiment, rationale behind this is that a selected architecture not only should performance well on training set but also should generalize equally well to examples in ,
an architecture for a neural network to be deployed for performing the machine learning task (Vahdat, [Par.0063, lines 1-5], “In at least one embodiment, techniques described herein can be used for hardware-aware architecture search. In at least one embodiment, an objective of these techniques is to design a neural architecture for a specific task such that it achieves an improved accuracy while running sufficient fast”), 
wherein each candidate architecture in the space has a different subset of a shared set of model parameters that is defined by a corresponding set of decision values that includes a respective decision value for each of a plurality of categorical decisions (Vahdat, [Par.0130-0131], “In at least one embodiment, for final evaluation, a 14-layer network is trained for 250 epochs with an initial channel count such that multiply-adds of network is <600M. For results presented in FIG. 8, as well as FIG. 9, networks are trained using SGD with momentum of0.9, base learning rate of0.1, weight decay of3xl0-5 , with a batch size of 128 per GPU. In at least one embodiment, a model is trained for 250 epochs in line with prior work, annealing a learning rate to 0 at end of training using a cosine learning rate decay…FIG. 12 shows results obtained when only best models are trained 5 times, with different random seeds. In at least one embodiment, lowest errors obtained over 5 runs were 24.74% and 7.63% for top-1 and top-5 errors respectively, using a cell searched on ImageNet.” Examiner’s note, the parameters of epochs, the weight, the learning rate and the batch size is considered as the set (subset) of the model parameter. the error value is estimated from 24.74% and 7.63(decision value)  for top 1 top 5 is considered as the categorical decision, as mention in specification, the decision value is respective to the operation performance of the architecture.),
and wherein the selecting comprises: jointly updating (i) a set of controller policy parameters that define, for each of the plurality of categorical decisions, a respective probability distribution over decision values for the categorical decision and (ii) the shared set of parameters, wherein (Vahdat, [Par.0077], “In at least one embodiment, bi-level training of architecture parameters and network parameters is proposed. In at least one embodiment, in architecture update, either training loss, or validation loss given current network parameters w , are used to update architecture parameters using.


    PNG
    media_image1.png
    53
    445
    media_image1.png
    Greyscale
”
Examiner’s note, the architecture parameter is updated to minimize the loss function or error value, the changing of the error values is respective to the decision values corresponding to the categorical decision because decision value defines the operation performance of the neural network. The architecture parameters are considered as the policy controller parameter. As mention in the par 0130, the error value is generated based on the machine learning architecture to generate the selected shared parameter such as batch size, the number of layers, weight. “):
updating the set of controller policy parameters comprises updating the set of controller policy parameters through reinforcement learning to maximize a reward function that measures an estimated quality and an estimated latency of candidate architectures defined by sets of decision values sampled from probability distributions generated using the controller policy parameters (Vahdat, [Abstract, liens 1-4], “In at least one embodiment, differentiable neural architecture search and reinforcement learning are combined under one framework to discover network architectures with desired properties such as high accuracy, low latency” 
[Par.0070],  “In at least one embodiment, a generic approach for
optimizing expected loss is REINFORCE gradient estimator

    PNG
    media_image2.png
    67
    458
    media_image2.png
    Greyscale
 
which can be applied even to a loss function.£ (z) that is not differentiable with respect to z. In at least one embodiment, however, this estimator may suffer from high variance and, therefore, a large number of trained architecture samples may be necessary to reduce its variance, making it extremely compute intensive. In at least one embodiment, REINFORCE estimator in Eq. 1 can be rewritten as 

    PNG
    media_image3.png
    98
    449
    media_image3.png
    Greyscale
“
Furthermore, see the[ Par.0077], ““In at least one embodiment, bi-level training of architecture parameters and network parameters is proposed.
In at least one embodiment, in architecture update, either training loss, or validation loss given current network parameters w , are used to update architecture parameters using.


    PNG
    media_image1.png
    53
    445
    media_image1.png
    Greyscale
”
The parameter Z, W are updated to minimize a loss value. the parameters are updated to minimize the latency with a reasonable accurate.  As the Fig.5 show the improvement of latency and accurate, {Par.0101, lines 1-6], “FIG. 5. In comparison to REINFORCE, improved latency is achieved while maintaining similar accuracy”),
[…]
and updating the shared set of model parameters comprises updating the shared set of model parameters to optimize an objective function that measures a performance on the particular machine learning task of the candidate architectures defined by the sets of decision values sampled from the probability distributions generated using the controller policy parameters (Vahdat, [Par.0080-0081], “In at least one embodiment, to overfitting, an objective function is based on generalization error of an architecture. In at least one embodiment, rationale behind this is that a selected architecture not only should perform well on training set but also should generalize equally well to examples in validation set, even when network parameters are not yet optimized. In at least one embodiment, this prevents search from discovering architectures that do not generalize well. In at least one embodiment, we define

    PNG
    media_image4.png
    134
    470
    media_image4.png
    Greyscale

where A is a scalar balancing training loss and generalization error. In at least one embodiment, A=0.5 often works well in experiments. In at least one embodiment, for training, we iterate between updating cp using Eq. 7 and updating
-w using Eq. 6. In at least one embodiment, in each parameter update, a simple gradient descent update is performed. [0081] In at least one embodiment, a goal is to find an architecture that has a low latency as well as high accuracy.
In at least one embodiment, latency of network specified by z can be measured in each parameter update.” Examiner’s note, optimizing an error values (objective function) by using a machine learning to generate updated parameters.);
and after the joint updating, selecting as the architecture for the neural network, a candidate architecture that is defined by respective particular decision values for each of the plurality of categorical decisions (Vahdat, [Page 57, claim 16], “A machine-readable medium having stored thereon a set of instructions, which if performed by one or more processors, cause the one or more processors to at least determine an architecture for a neural .
Vahdat discloses reinforced learning to maximize quality and latency defined by the decision values and parameters but does not specifically discloses using a reward function.
However, Vahdat does not teach the reward function includes a quality term that measures the estimated quality of a candidate architecture and a latency term that is based on an absolute value of a term that compares the estimated latency of the candidate architecture and the target latency, 
On the other hand, Xu teaches the reward function includes a quality term that measures the estimated quality of a candidate architecture and a latency term that is based on an absolute value of a term that compares the estimated latency of the candidate architecture and the target latency (Xu, [Par.0027], “A "candidate student model," as used herein, refers to a student model that is being examined to determine if is a better student model (better at predicting the observed target) than the current student model. A reward is then generated by comparing the current student model with the candidate student model using training and testing data to determine which is better at predicting an observed target. A "reward," as used herein, refers to a value generated by a function (reward function) ,
Vahdat and Xu are analogous in arts because they have the same filed of endeavor of selecting a machine learning architecture based on the latency and quality. 
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Vadhat’s method in view of Xu by having a reward function to maximize the loss error by measuring the different of the estimated latency and target latency.  The modification would have been obvious because one of the ordinary skills in art would be motivated to maximize the classification accuracy during a training. (Xu, [Par.0012], “In one embodiment of the present invention, a computer-implemented method for improving prediction accuracy in machine learning techniques comprises constructing a teacher model, where the teacher model generates a weight for each data case. The method further comprises training a current student model using training data and weights generated by the teacher model.”).
Regarding to claim 22, is being rejected for the same reason as the claim 1.  
In additional, Vahdat further teaches a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations comprising (Vahdat, [Par.0270, lines 5-13], “In at least one embodiment, MCH 2116 may provide a high bandwidth memory path 2118 to memory 2120 for instruction and data storage and for storage of graphics commands, data and textures. In at least one embodiment, MCH 2116 may direct data signals between processor 2102, memory 2120, and other component in computer system 2100 and to :
Regarding to claim 23, is being rejected for the same reason as the claim 1. 
In additional, Vahdat further teaches one or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising (Vahdat, [Par.0512, lines 1-10], “Operations of processes described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. In at least one embodiment, a process such as those processes described herein ( or variations and/or combinations thereof) is performed under control of one or more computer systems configured with executable instructions and is implemented as code (e.g., executable instructions, one or more computer programs or one or more applications) executing collectively on one or more processors, by hardware or combinations thereof. In at least one embodiment, code is stored on a computer-readable storage medium, for example, in for of a computer program comprising a plurality of instructions executable by one or more processors.”):
Regarding to the claim 2, Vahdat teaches the method of claim 1, wherein the joint updating comprises repeatedly performing operations comprising: generating a respective probability distribution for each of the plurality of categorical decisions in accordance with current values of the controller policy parameters, selecting a respective decision value for each of the plurality of categorical decisions using the respective probability distributions, determining, using the validation data (Vahdat, Par.0077-0079], “In at least one embodiment, bi-level training of architecture parameters and network parameters is proposed. In at least 

    PNG
    media_image5.png
    71
    443
    media_image5.png
    Greyscale

 [0078] In at least one embodiment, network parameters w are updated given samples from architecture by minimizing

    PNG
    media_image6.png
    96
    378
    media_image6.png
    Greyscale


[Par.0079], “In at least one embodiment, parameters cp and w are updated iteratively by taking a single gradient step in Eq. 5 and Eq. 6. It has been shown that by sharing network parameters among all architecture instances, we gain several orders of magnitude speedup in search. In at least one embodiment, however, this comes with cost of updating architecture parameters at suboptimal w .”  Examiner’s note, the parameters are updated iteratively based on the result of the equation (5), (6). ),
an estimated quality on the particular machine learning task of a neural network having a candidate architecture that has a subset of the shared set of model parameters that is defined by the selected decision values for the categorical decisions, wherein the quality is estimated in accordance with current values of the subset of the shared set of model parameters that is defined by the selected decision values for the categorical decisions, determining, using the validation data (Vadhat, Par.0130-0131], “In at least one embodiment, for final evaluation, a 14-layer network is trained for 250 epochs with an initial channel count such that multiply-adds of network is <600M. For results presented in FIG. 8, as well as FIG. 9, networks are trained using SGD with momentum of0.9, base learning rate of0.1, weight decay of3xl0-5 , with a batch size of 128 per GPU. In at least one embodiment, a model , 
an estimated latency when performing the particular machine learning task of the neural network having the candidate architecture that has the subset of the shared set of model parameters that is defined by the selected decision values for the categorical decisions (Vadhat, Abstract, liens 1-4], “In at least one embodiment, differentiable neural architecture search and reinforcement learning are combined under one framework to discover network architectures with desired properties such as high accuracy, low latency” ),
determining, through reinforcement learning, an update to the controller policy parameters that improves the reward function based on the estimated quality and the estimated latency, and determining, using the training data (Vadhat, [Par.0070],  “In at least one embodiment, a generic approach for optimizing expected loss is REINFORCE gradient estimator
                                                                                  
    PNG
    media_image2.png
    67
    458
    media_image2.png
    Greyscale
 
which can be applied even to a loss function.£ (z) that is not differentiable with respect to z. In at least one embodiment, however, this estimator may suffer from high variance and, therefore, a 

    PNG
    media_image3.png
    98
    449
    media_image3.png
    Greyscale


“
Furthermore, see the [ Par.0077], ““In at least one embodiment, bi-level training of architecture parameters and network parameters is proposed.
In at least one embodiment, in architecture update, either training loss, or validation loss given current network parameters w , are used to update architecture parameters using.


    PNG
    media_image1.png
    53
    445
    media_image1.png
    Greyscale
”The parameter Z, W are updated to minimize a loss value.   the parameters are updated to minimize the latency with a reasonable accurate.  As the Fig.5 show the improvement of latency and accurate, [Par.0101, lines 1-6], “FIG. 5. In comparison to REINFORCE, improved latency is achieved while maintaining similar accuracy”),
an update to the current values of the subset of the shared set of model parameters that is defined by the selected decisions for the categorical decisions shared set of parameters by optimizing an objective function for the particular machine learning task (Vadhat, Par.0080-0081], “In at least one embodiment, to overfitting, an objective function is based on generalization error of an architecture. In at least one embodiment, rationale behind this is that a selected architecture not only should perform well on training set but also should generalize equally well to examples in validation set, even when network parameters are not yet 


    PNG
    media_image4.png
    134
    470
    media_image4.png
    Greyscale

where A is a scalar balancing training loss and generalization error. In at least one embodiment, A=0.5 often works well in experiments. In at least one embodiment, for training, we
iterate between updating cp using Eq. 7 and updating
-w using Eq. 6. In at least one embodiment, in each parameter update, a simple gradient descent update is performed.
[0081] In at least one embodiment, a goal is to find an architecture that has a low latency as well as high accuracy. In at least one embodiment, latency of network specified by z can be measured in each parameter update.” Examiner’s note, optimizing an error values (objective function) by using a machine learning to generate updated parameters.).
Regarding to claim 3, Vadhat teaches the method of claim 2, wherein determining the update to the current values of the subset of the shared set of model parameters comprises computing a gradient update to the current values on a batch of training examples from the training data (Vadhat, [Par.0083], “In at least one embodiment, Architecture Sample Size is set to a Batch Size. In at least one embodiment, this option corresponds to examining many architectures at a same time with shared w for updating cp. In at least one embodiment, this has a lower gradient variance for updating ¢, as it uses many z samples for estimating a gradient. In at least one embodiment, however, this approach is compute and memory intensive at it .
Regrading to the claim 4, Vadhat teaches the method of claim 2, wherein determining, using the validation data, an estimated latency comprises determining latencies of the neural network having the candidate architecture for each validation example in a batch of validation examples from the validation data (Vadhat, [Par.0080, lines 3-6], “In at least one embodiment, rationale behind this is that a selected architecture not only should perform well on training set but also should generalize equally well to examples in validation set” Furthermore, [Par.0082], “In at least one embodiment, both objectives in Eq. 7 and Eq. 6 involve expectations with respect to pq,(z). In at least one embodiment, for gradient estimation, as shown in Sec. 3.1, Monte Carlo estimate is computed by drawing samples from pq,(z). In at least one embodiment, since we compute training/validation loss in an objective function using a mini-batch of data, we can choose to set a number of architecture samples to a value between one and a number of samples in mini-batch (such as batch size). In this section, we review an effect of choosing a number of architecture samples on variance and efficiency of search,”).
Regrading to claim 5, Vadhat teaches the method of claim 4, wherein the target latency is a target latency for the neural network when deployed on a particular set of one or more computing devices (Vadhat, [Par.0075, lines 4-9], “In at least one embodiment, an embodiment trains g by minimizing lg(z)-.£ nCz)I during architecture search. In at least one embodiment, in case of latency, this corresponds to ,
and wherein determining latencies comprises determining latencies for each validation example in the batch of validation examples when the neural network having the candidate architecture is deployed on the particular set of one or more computing devices (Vadhat, [Par.0082], “In at least one embodiment, both objectives in Eq. 7 and Eq. 6 involve expectations with respect to pq,(z). In at least one embodiment, for gradient estimation, as shown in Sec. 3.1, Monte Carlo estimate is computed by drawing samples from pq,(z). In at least one embodiment, since we compute training/validation loss in an objective function using a mini-batch of data, we can choose to set a number of architecture samples to a value between one and a number of samples in mini-batch (such as batch size). In this section, we review an effect of choosing a number of architecture samples on variance and efficiency of search, in an embodiment.” Examiner’s note, training the number of the sample architectures to train on the mini batch size validation data.).
Regarding to claim 6, Vadhat teaches the method of claim 2, wherein determining, using the validation data, an estimated quality on the particular machine learning task of the neural network having the candidate architecture comprises determining a quality of the neural network having the candidate architecture on a batch of validation examples from the validation data (Vadhat, [Par.0082], “In at least one embodiment, both objectives in Eq. 7 and Eq. 6 involve .
Regarding to claim 7, Vadhat in view of Xu teaches the method of claim 1, wherein the reward function is a sum of the quality term and the latency term (Xu,  [Par.0070], “These input-weight products are summed and then the sum is passed through a node's so-called activation function, to determine whether and to what extent that signal should progress further through the network to affect the.” And [Par.0062], “Case level reward 502 refers to the reward based on correctly classifying the data case by student models 302, 303 using training data 103. If student model 302, 303 correctly classified the data case, then a positive reward 502 is returned by reward generator 304. Conversely, a negative reward 502 is returned by reward generator 304 if student model 302, 303 did not correctly classify the data case.  Ultimate outcome (e.g., an act of classification).”).
Vahdat and Xu are analogous in arts because they have the same filed of endeavor of selecting a machine learning architecture based on the latency and quality. 
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Vadhat’s method in view of Xu by having a reward function sum of the quality term and 
Regarding to claim 10, Vadhat teaches the method of claim 1, wherein after the joint updating, selecting as the architecture for the neural network, a candidate architecture that is defined by respective particular decision values for each of the plurality of categorical decisions comprises: for each of the categorical decisions, selecting as the particular decision value the decision value having a highest probability in the probability distribution for the categorical decision (Vadhat, [Par.0131], “FIG. 12 illustrates an example of ImageNet performance of best models, averaged over five evaluation runs, in accordance with an embodiment. In at least one embodiment, FIG. 12 shows results obtained when only best models are trained 5 times, with different random seeds. In at least one embodiment, lowest errors obtained over 5 runs were 24.74% and 7.63% for top-1 and top-5 errors respectively, using a cell searched on ImageNet.” Examiner’s note, the goal of this training to search for the architecture with low latency and high accurate, therefore, over the five times runs, the lowest error (highest probability) is selected.).
Regarding to claim 11, Vadhat teaches the method of claim 1, wherein the selecting further comprises: prior to the joint updating, updating the shared set of parameters without updating the controller policy parameters by repeatedly performing operations comprising: selecting a candidate architecture from the space; and determining, using the training data, an update to the subset of the shared set of model parameters that are in the selected candidate architecture by optimizing the objective function for the particular machine learning task (Vadhat, [Par.0080-0081], “In at least one embodiment, to overfitting, an objective function is based on generalization error of an architecture. In at least one embodiment, rationale behind is that a selected architecture not only should perform well on training set but also should generalize equally well to examples in validation set, even when network parameters are not yet optimized. In at least one embodiment, this prevents search from discovering architectures that do not generalize well. In at least one embodiment, we define

    PNG
    media_image4.png
    134
    470
    media_image4.png
    Greyscale

where A is a scalar balancing training loss and generalization error. In at least one embodiment, A=0.5 often works well in experiments. In at least one embodiment, for training, we iterate between updating cp using Eq. 7 and updating-w using Eq. 6. In at least one embodiment, in each parameter update, a simple gradient descent update is performed. [0081] In at least one embodiment, a goal is to find an architecture that has a low latency as well as high accuracy. In at least one embodiment, latency of network specified by z can be measured in each parameter update.” Examiner’s note, optimizing an error values (objective function) by using a machine learning to generate updated parameters.).
Regarding to claim 12, Vadhat teaches the method of claim 11, wherein selecting the candidate architecture comprises, for each of one or more of the categorical decisions: with probability p, including operations represented by all of the respective decision values for the categorical decision in the candidate architecture, and with probability 1 -p, sampling a decision value from a fixed initial probability distribution for the categorical decision and including only the sampled decision value in the candidate architecture (Vadhat, [Par.0109-0110], “In at least one embodiment, distribution over architecture parameters are represented using a factorial distribution:

    PNG
    media_image7.png
    180
    481
    media_image7.png
    Greyscale

[Par.0110], “where q is a probability of using temperature T. Tn at least one embodiment, a Gumbel-Softmax sample drawn with Te =O is, in fact, a discrete sample since a Gumbel Softmax distribution becomes a categorical distribution in limit of Te =O.” Examiner’s note, therefore, the Te with o then probability is 1-q and when te=t then the probability is q.).
Regarding to claim 18 Vadhat teaches the method of claim 1, wherein the controller policy parameters include for each of the categorical decisions, a respective parameter for each decision value for the categorical decision (Vadhat, [Par.0087, lines 18-21], “

    PNG
    media_image8.png
    69
    470
    media_image8.png
    Greyscale

Represent a distribution over architecture parameters”).
Regarding to claim 19, Vadhat teaches the method of claim 18, wherein, for each of the categorical decisions, the probability distribution that is defined by the controller policy parameters is generated by applying a softmax to the respective parameters for the decision values for the categorical decision (Vadhat, [Par.0068], “

    PNG
    media_image9.png
    237
    457
    media_image9.png
    Greyscale
).
Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Vahdat et al. (Pub. No: US20210073612-hereinafter, Vahdat) and further in view of Xu et al. (Pub. No.: 20210150407 – hereinafter, Xu) and further in view of Ai-Fattah et al. (Pub.No :20100211536 –hereinafter, Ai-Fattah).
Regarding to claim 8, Vadhat in view of XU teaches the method of claim 1, wherein the term in the reward function that compares the estimated latency of the candidate architecture and the target latency is a (Xu, [Par.0027], “A "candidate student model," as used herein, refers to a student model that is being examined to determine if is a better student model (better at predicting the observed target) than the current student model. A reward is then generated by comparing the current student model with the candidate student model using training and testing data to determine which is better 
Vahdat and Xu are analogous in arts because they have the same filed of endeavor of selecting a machine learning architecture based on the latency and quality. 
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Vadhat’s method in view of Xu by having a reward function by comparing the estimated latency of the candidate and target latency .  The modification would have been obvious because one of the ordinary skills in art would be motivated to maximize the classification accuracy during a training. (Xu, [Par.0012], “In one embodiment of the present invention, a computer-implemented method for improving prediction accuracy in machine learning techniques comprises constructing a teacher model, where the teacher model generates a weight for each data case. The method further comprises training a current student model using training data and weights generated by the teacher model.”).
However, neither Vadhat nor Xu teaches is a difference between (i) a ratio between the estimated latency of the candidate architecture and the target latency and (ii) one 

Vahdat, Xu and Ai-Fattah are analogous in arts because they have the same filed of endeavor of selecting a machine learning architecture based on the latency and quality. 
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Vadhat and Xu method’s in view of Ai-Fattah difference ratio of the target latency and candidate latency. The modification would have been obvious because one of the ordinary skills in art would be motivated to maximize the classification accuracy during a training (Ai-Fattah, [Par.0045] “The most significant parameter is the standard deviation (SD) ratio that measures the performance of the neural network It is the best indicator of the goodness, e.g., accuracy, of a regression model and it is defined as the ratio of the prediction error SD to the data SD. One minus this regression ratio is sometimes referred to as the "explained variance" of the model. It will be understood that the .
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Vahdat et al. (Pub. No: US20210073612-hereinafter, Vahdat) and further in view of Xu et al. (Pub. No.: 20210150407 – hereinafter, Xu) and further in view of Shirouzu et al. (Patent No.: 5185772-hereinafter, Shirouzu).   
Regrading to the claim 9, Vadhat and Xu do not teach the method of claim 1, wherein the latency term is a product of the absolute value and a negative scalar value.
On the other hand, Shirouzu teaches the method of claim 1, wherein the latency term is a product of the absolute value and a negative scalar value (Shirouzu, [Column 12, lines 59-65], “A latent image (potential distribution) on the imaging plate 126 is generally at a negative potential. The amount of attached toner is proportional to the absolute potential. Accordingly, it should be understood that the potential scale in FIG. 12 represents absolute values, and that individual potentials are actually negative values.”).
Vahdat , Xu  and Shirouzu are analogous in arts because they have the same filed of endeavor of selecting a machine learning architecture based on the latency and quality. 
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Vadhat’s method in view of Xu and Shirouzu by latency term is a product of the absolute value and a negative scalar value.  The modification would have been obvious because one of the ordinary skills in art would be motivated the generation accuracy. (Shirouzu, [Column 12, lines 59-65], “A latent image (potential distribution) on the imaging plate 126 is generally at a negative potential. The amount of attached toner is proportional to the absolute potential. Accordingly, it should be understood that the potential scale in FIG. 12 represents absolute values, and that individual potentials are actually negative values.”).
Claims 14, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Vahdat et al. (Pub. No: US20210073612-hereinafter, Vahdat) and further in view of Xu et al. (Pub. No.: 20210150407 – hereinafter, Xu) and further in view of Ovtcharov et al. (Pub No.: 20200302271 -hereinafter, Ovtcharov).   
Regarding to claim 14, Vadhat in view of Xu teaches the method of claim 12, wherein when all of the operations represented by all of the respective decision values are included in the candidate architecture (Vadhat, [Par.0131], “FIG. illustrates an example of ImageNet per 12 performance of best models, averaged over five evaluation runs, in accordance with an embodiment. In at least one embodiment, FIG. 12 shows results obtained when only best models are trained 5 times, with different random seeds. In at least one embodiment, lowest errors obtained over 5 runs were 24.74% and 7.63% for top-1 and top-5 errors respectively, using a cell searched on ImageNet.” Examiner’s note, the output value of each time run is considered as the decision value, therefor, the evaluation of the operation performance is corresponding to decision value that including the candidate architecture, for further clarify, see Claim 8, on page 57, “one or more processors to be configured to determine a network ,
[…]
applying each of the operations to each input to a point in the neural network represented by the categorical decision (Vadhat, Par.0112], “In at least one embodiment, a same idea can be applied to a paired input cell structure by applying a different temperature to each input and operation selector. In FIG.11, memory and time required for searching for an architecture are shown with different values of q ranging in { 0.2, 0.4, 0.6, 0.8, 1.0} using REBAR. Interestingly, searches found an architecture with similar test error in a range 3.0±0.05%. However, GPU memory and time can be reduced significantly by using smaller q”),
computing, for each input, an average of the outputs of the operations for the input as an output of the point in the neural network represented by the categorical decision (Vadhat, [Par.3502, lines 3-9], “In at least one embodiment, inputs may be averaged, or any other suitable transfer function may be used. Furthermore, in at least one embodiment, neurons 3502 may include, without limitation, comparator circuits or logic that generate an output spike at neuron output 3506 when result of applying a transfer function to neuron input 3504 exceeds a threshold.”),
and storing only the inputs to the categorical decision and the outputs of the categorical decision for use in a backward pass through the neural network (Vadhat, [Par.0139], “In at least one embodiment, inference and/or training logic 1715 may include, without limitation, a code and/or data storage 1705 to store backward and/or output weight and/or input/output data corresponding to neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, code ; 
However, Vadhat and Xu do not teach determining the update for the selected architecture comprises: during a forward pass through the neural network for a batch of training examples: and during the backward pass through the neural network for the batch of training examples, recomputing the outputs of the operations by again applying each of the operations to the stored inputs to the categorical decision.
On the other hand, Ovtcharov teaches determining the update for the selected architecture comprises: during a forward pass through the neural network for a batch of training examples (Ovtcharov, [Par.0063, lines 1-11], “Once training of the child neural network 308 has completed, the routine 500 proceeds from operation 510 to operation 512, where metrics 310 for the trained child neural network 308 can be obtained and recorded such as, but not limited to, accuracy, inference time, or inference cost. The routine 500 then proceeds from operation 512 to operation 514, where a determination is made as to whether process described above is to continue. For example, the process described above can be repeated for a specified number of iterations or until hyper parameters 122 can be generated defining an ANN architecture that satisfies constraints on accuracy or inference time.” Examiner’s note, the parameter of selected architecture keep generating until it satisfied the time and accuracy.):
 and during the backward pass through the neural network for the batch of training examples, re computing the outputs of the operations by again applying each of the operations to the stored inputs to the categorical decision (Ovtcharov, [Par.0032], “In a "backward pass" (which might also be referred to herein as a "backward training pass") of the ANN, each layer of the ANN computes the error for the previous layer and the gradients, or updates, to the weights 110 of the layer that move the ANN' s prediction toward the desired .
Vahdat, Xu and Ovtcharov are analogous in arts because they have the same filed of endeavor of selecting a machine learning architecture based on the latency and quality. 
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Vadhat’s method in view of Xu and Ovtcharov by having a forward and backward training on sample training batch.  The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the training latency and quality. (Ovtcharov, [Par.0024, lines 1-5], “The following detailed description is directed to technologies for quantization-aware neural architecture search. In addition to other technical benefits, the disclosed technologies can improve the accuracy or inference time of ANNs that use quantized-precision floating-point formats.”).

Regarding to claim 20, Vadhat as modified in view of Xu and Ovtcharov teach the method of claim 1, wherein the selecting comprises: when updating the shared set of parameters: storing only a proper subset of intermediate outputs generated by any given neural network having any given candidate architecture during a forward pass through the given neural network (Ovtcharov, [Par.0030-0031], “ANNs are typically trained across multiple "epochs." In each epoch, an ANN training module 106, or another component, trains an ANN over the training data in a training data set 108 in multiple steps. In each step, the ANN first makes a prediction for an instance of the training data (which might also be referred to herein as a "sample"). , 
and recomputing intermediate outputs that are not in the proper subset during a backward pass through the neural network to compute a gradient of the objective function (Ovtcharov, [Par.00032], “In a "backward pass" (which might also be referred to herein as a "backward training pass") of the ANN, each layer of the ANN computes the error for the previous layer and the gradients, or updates, to the weights 110 of the layer that move the ANN' s prediction toward the desired output. The result of training an ANN is a set of weights 110 that represent a transform function that can be applied to an input with the result being a prediction 116. A modelling framework such as those described below can be used to train an ANN in this manner.”).
Vahdat, Xu and Ovtcharov are analogous in arts because they have the same filed of endeavor of selecting a machine learning architecture based on the latency and quality. 
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Vadhat’s method in view of Xu and Ovtcharov by having a forward and backward  The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the training latency and quality. (Ovtcharov, [Par.0024, lines 1-5], “The following detailed description is directed to technologies for quantization-aware neural architecture search. In addition to other technical benefits, the disclosed technologies can improve the accuracy or inference time of ANNs that use quantized-precision floating-point formats.”).

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Vahdat et al. (Pub. No: US20210073612-hereinafter, Vahdat) and further in view of Xu et al. (Pub. No.: 20210150407 – hereinafter, Xu) and further in view of Bowen et al. (DESIGNING NEURAL NETWORK ARCHITECTURES USING REINFORCEMENT LEARNING-hereinafter—Massachusetts Institute of Technology, hereinafter-Bowen).   
Regarding to claim 13, Vadhat and Xu do not teach the method of claim 12, wherein prior to the joint updating, updating the shared set of parameters without updating the controller policy parameters comprises: linearly decreasing p from 1 to 0 during the updating the shared set of parameters without updating the controller policy parameters.
On the other hand, Bowen teaches the method of claim 12, wherein prior to the joint updating, updating the shared set of parameters without updating the controller policy parameters comprises: linearly decreasing p from 1 to 0 during the updating the shared set of parameters without updating the controller policy parameters (Bowen, [Page.5, section 4.3], “For the iterative Q-learning updates (Equation 3), we set the Q-learning rate (α) to 0.01. In addition, we set the discount .
Vahdat, Xu and Bowen are analogous in arts because they have the same filed of endeavor of selecting a machine learning architecture based on the latency and quality. 
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Vadhat’s method in view of Xu and Bowen by updating the share parameter by linearly decreasing.  The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the training latency and quality. (Bowen, section 4.3], “we want to identify several well-performing model topologies, which can then be ensemble to improve prediction performance.”).

Claims 15-16 is rejected under 35 U.S.C. 103 as being unpatentable over Vahdat et al. (Pub. No: US20210073612-hereinafter, Vahdat) and further in view of Xu et al. (Pub. No.: 20210150407 – hereinafter, Xu) and further in view of of Ovtcharov et al. (Pub No.: 20200302271 -hereinafter, Ovtcharov) and further in view of Gabriel et al. (NPL: Understanding and Simplifying One-Shot Architecture Search—hereinafter, Gabriel Bender).
Regarding to claim 15 Vadhat, as modified in view of Xu and further in view of Ovtcharov teaches the method of claim 1, wherein the space of candidate architectures is a space of architectures for a convolutional neural network (Ovatchrov, [Par.0014, lines 1-3], “It should be noted that applications of the QNAS ,  
wherein a particular one of the categorical decisions represents a number of output filters of a convolutional neural network layer in the convolutional neural network, wherein the decision values for the particular categorical decision correspond to different numbers of output filters ranging from a first number to a second number (Ovatchrov,[ Par.0056, lines 1-10], “FIG. 4 is a neural network architecture diagram that illustrates aspects of the various processes described above for QNAS with reference to a simplified topology of an example ANN 400. In this example, the QNAS process described above was performed with a search space that includes model topology parameters 122A and quantization parameters 122B. In this example, the search space for the model topology parameters 122A was limited to three groups of layers 402A-402F, 16, 18, or 20 layers 402A-204F per group, and 32, 64, or 128 filters per group.” ),
However, Vadhat, Xu and Ovatchrov do not teach and wherein a candidate architecture defined by a set of decision values having a given decision value for the particular categorical decision that represents a given number of output filters for the convolutional neural network layer includes: the convolutional neural network layer with the second number of output filters but with a third number of output filters zeroed out, wherein the third number of output filters is  equal to a difference between the second number and the given number.
On the other hand, Gabriel teaches and wherein a candidate architecture defined by a set of decision values having a given decision value for the particular categorical decision includes the convolutional neural network layer with the maximum number of output filters but with a number of output filters equal to a difference between the maximum number and the number corresponding to the given decision value zeroed out (Gabriel, Page 3, the left column], “The one-shot model then applies several different operations to the output of the 1x1 convolution and adds the results together. At evaluation time, we zero out or remove some of these operations from the network. In our running example, we have four possible operations: a pair of 3x3 convolutions, a pair of 5x5 convolutions, a max pooling layer, or an identity operation. However, only the 5x5 convolutions’ outputs are used when the architecture is evaluated.” Therefore, some of the pair of convolution layers are zero outed when the decision value reach to the maximum accuracy that is corresponding to the layers that are equate to the different of the maximum number and decision value.  And [page 3, the right column, the last paragraph], “When training the one-shot model, we randomly zero out a subset of the ops for each batch of examples. We achieved good results by disabling path dropout at the beginning of training and gradually increasing the rate of dropout over time using a linear schedule … The higher the fan-in, the more likely each possible input is to be dropped out. However, the probability of dropping out all inputs to a node is kept constant regardless of its fan-in. Suppose r = 0.05. If a node has k = 2 inputs then each one will independently be dropped out with probability 0.051/2 ⇡ 0.22 and will be retained with probability 0.78.”).
Vahdat, Xu, Ovatchrov and Gabriel analogous in arts because they have the same filed of endeavor of selecting a machine learning architecture based on the latency and quality. 
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Vadhat, Xu and  The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the training latency and quality. (Gabriel, section 41, first paragraph], “detailed in the previous section, we begin by training a one-shot model on CIFAR-10. Each one-shot model was trained for 5,000 - 10,000 steps (113 - 225 epochs) on a cluster of 16 P100 GPUs. Each worker used a batch size of 64, which was divided into two ghost batches of size 32. We used a global learning rate of 0.1 and Nesterov momentum 0.9.1 Increasing the number of training steps improved the correlations between one-shot and stand-alone model accuracies in our experiments, but only slightly. We therefore used a shorter training period for our initial hyperparameter tuning experiments and a longer period for the model that was used in our large-scale architecture search.”).
Regarding to claim 16, Vadhat teaches the method of claim 15, wherein selecting the candidate architecture comprises, for the particular categorical decision: with probability q, configuring the convolutional neural network layer to have the second number of output filters with none of the output filters zeroed out, and with probability 1 – q (Vadhat, [Par.0109-0110], “In at least one embodiment, distribution over architecture parameters are represented using a factorial distribution: 

    PNG
    media_image7.png
    180
    481
    media_image7.png
    Greyscale
”
),
sampling a decision value from a fixed initial probability distribution for the particular categorical decision and configuring the convolutional neural network layer to have the second number of output filters but with a fourth number of output filters zeroed out, wherein the fourth number is  equal to a difference between the second number and the number corresponding to the sampled decision value (Gabriel, [Page 3, the left column, the third paragraph], “The one-shot model then applies several different operations to the output of the 1x1 convolution and adds the results together. At evaluation time, we zero out or remove some of these operations from the network. In our running example, we have four possible operations: a pair of 3x3 convolutions, a pair of 5x5 convolutions, a max pooling layer, or an identity operation. However, only the 5x5 convolutions’ outputs are used when the architecture is evaluated.” Therefore, some of the pair of convolution layers are zero outed when the decision value reach to the maximum accuracy that is corresponding to the layers that are equate to the different of the maximum number and decision value.  And [page 3, the right column, the last paragraph], “When training the one-shot model, we randomly zero out a subset of the ops for each batch of examples. We achieved good results by disabling path dropout at the beginning of training and gradually increasing the rate of dropout over time using a linear schedule … The higher the fan-in, the more likely each possible input is to be dropped out. However, the probability of dropping out all inputs to a node is kept constant regardless of its fan-in. suppose r = 0.05. If a node has k = 2 inputs then each one will independently be dropped out with probability 0.051/2 ⇡ 0.22 and will be retained with probability 0.78.”).
Vahdat, Xu, Ovatchrov and Gabriel analogous in arts because they have the same filed of endeavor of selecting a machine learning architecture based on the latency and quality. 
 The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the training latency and quality. (Gabriel, section 41, first paragraph], “detailed in the previous section, we begin by training a one-shot model on CIFAR-10. Each one-shot model was trained for 5,000 - 10,000 steps (113 - 225 epochs) on a cluster of 16 P100 GPUs. Each worker used a batch size of 64, which was divided into two ghost batches of size 32. We used a global learning rate of 0.1 and Nesterov momentum 0.9.1 Increasing the number of training steps improved the correlations between one-shot and stand-alone model accuracies in our experiments, but only slightly. We therefore used a shorter training period for our initial hyperparameter tuning experiments and a longer period for the model that was used in our large-scale architecture search.”).
Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Vahdat et al. (Pub. No: US20210073612-hereinafter, Vahdat) and further in view of Xu et al. (Pub. No.: 20210150407 – hereinafter, Xu) and further in view of Ovtcharov et al. (Pub No.: 20200302271 -hereinafter, Ovtcharov) and further in view of Gabriel et al. (NPL: Understanding and Simplifying One-Shot Architecture Search—hereinafter, Gabriel Bender) further in view of Bowen et al. (DESIGNING NEURAL NETWORK ARCHITECTURES USING REINFORCEMENT LEARNING-hereinafter—Massachusetts Institute of Technology, hereinafter-Bowen).   
Regarding to claim 17, Vadhat as modified in view of Xu and further in view of  Ovtcharov, Gabriel and Bowen teaches the method of claim 16, wherein prior to the joint updating, updating the shared set of parameters without updating the controller policy parameters comprises: linearly decreasing q from 1 to 0 during the updating the shared set of parameters without updating the controller policy parameters (Bowen, [Page7, the first paragraph] Bowen, “Figure 3, we plot the rolling mean of prediction accuracy over 100 models and the mean accuracy of models sampled at different values, for the CIFAR-10 and SVHN experiments. The plots show that, while the prediction accuracy remains flat during the exploration phase ( = 1) as expected, the agent consistently improves in its ability to pick better-performing models as  reduces from 1 to 0.1. For example, the mean accuracy of models in the SVHN experiment increases from 52.25% at  = 1 to 88.02% at = 0.1. Furthermore, we demonstrate the stability of the Q-learning procedure with 10 independent runs on a subset of the SVHN dataset in Section D.1 of the Appendix. Additional analysis of Q-learning results can be found in Section D.2.”).
Vahdat, Xu, Ovtcharov, Gabriel and Bowen teach are analogous in arts because they have the same filed of endeavor of selecting a machine learning architecture based on the latency and quality. 
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Vadhat’s method in view of Xu , Ovtcharov, Gabriel and Bowen by updating the share parameter by linearly decreasing.  The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the training latency and quality. (Bowen, section 4.3], “we want to identify several well-performing model topologies, which can then be ensemble to improve prediction performance.”).
Claim 21 is rejected under 35 U.S.C. 103 as being unpatentable over Vahdat et al. (Pub. No: US20210073612-hereinafter, Vahdat) and further in view of Xu et al. (Pub. No.: 20210150407 – hereinafter, Xu) and further in view of Leslie et al. (NPL: A DISCIPLINED APPROACH TO NEURAL NETWORK HYPER-PARAMETERS: PART 1 – LEARNING RATE, BATCH SIZE, MOMENTUM, AND WEIGHT DECAY—Washington, DC, USA—hereinafter-Leslie).
Regarding to claim 21, Vadhat and Xu do not teach the method of claim 1, wherein the joint updating comprises: exponentially increasing a learning rate of the learning updates to the controller policy parameters during the updating.
On the other hand, Leslie teaches the method of claim 1, wherein the joint updating comprises: exponentially increasing a learning rate of the learning updates to the controller policy parameters during the updating (Leslie, [Page 9], “
    PNG
    media_image10.png
    214
    553
    media_image10.png
    Greyscale

Figure 7c shows a LR range test for a shallow, 3-layer architecture on Cifar-10 with the learning rate increasing from 0.002 to 0.02. The constant momentum case is shown as the blue curve. The red curve combines the increasing learning rate with a linearly increasing momentum in the range of 0.8 to 1.0.”).
Vahdat, Xu and Leslie analogous in arts because they have the same filed of endeavor of selecting a machine learning architecture based on the latency and quality. 
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified  The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the training latency and quality.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to EM N TRIEU whose telephone number is (571)272-5747.  The examiner can normally be reached on 7:30 - 5:00 M_TH
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on 571 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/E.T./

/MICHAEL J HUNTLEY/Primary Examiner, Art Unit 2116