DETAILED ACTION
This action is in response to the claims filed 01/15/2019 for application 16/248,670 filed 01/15/2019. Claims 1-20 are currently pending. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 06/16/2020 and 04/19/2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1, 3-11, 13-16, and 18-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim 1, 
Step 1 Analysis: Claim 1 is directed to a process, which falls within one of the four statutory categories. 
Step 2A Prong 1 Analysis: Claim 1 recites, in part, adjusting a plurality of hyperparameters…, asynchronously measuring one or more performance metrics, and ceasing the adjusting of the plurality of hyperparameters. The limitations of adjusting a 
Step 2A Prong 2 Analysis: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional elements – “plurality of neural networks”. These elements that are recited are only generally linked to the judicial exception. Additionally, the claim recites the – “plurality of computer systems”, The elements in the claim are recited at a high level of generality (i.e. as a generic processor performing a generic computer function of generating an index) such that it amounts to no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
Step 2B Analysis: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of utilizing a plurality of neural networks to perform the steps of the claimed process amount to no more than generally linking the elements to the judicial exception. Additionally, the plurality of computer systems amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 3, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein adjusting the plurality of hyperparameters comprises selecting a set of hyperparameter values for each neural network in the plurality of neural networks. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 4, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein asynchronously measuring the one or more performance metrics comprises collecting the one or more performance metrics at an end of a training phase used to train a first neural network after a second neural network has previously completed the training phase. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 5, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein ceasing the adjusting of the plurality of hyperparameters comprises increasing a proportion of the plurality of neural networks for which the adjusting of the plurality of hyperparameters is ceased with successive training phases used to train the plurality of neural networks. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 6, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein ceasing the adjusting of the plurality of hyperparameters comprises selecting the one or more of the plurality of neural networks based on the one or more performance metrics associated with training the plurality of neural networks. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 7, the rejection of claim 6 is further incorporated, and further, the claim recites: wherein selecting the one or more of the plurality of neural networks comprises: selecting a first neural network that completes a training phase at a first time for continued training; and selecting a second neural network with a performance metric that is lower than the threshold and that completes the training phase at a second time that is later than the first time for inclusion in the one or more of the plurality of neural networks. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 8, the rejection of claim 6 is further incorporated, and further, the claim recites: wherein selecting the one or more of the plurality of neural networks comprises: wherein selecting the one or more of the plurality of neural networks further comprises adjusting at least one of the first time and the second time based on a number of computational resources used to train at least one of the first neural network and the second neural network. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 9, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein the threshold comprises a quantile associated with the one or more performance metrics. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above. 
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 10, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein the plurality of hyperparameters comprise at least one of a learning rate, a regularization parameter, a convergence parameter, a model topology, a number of training samples, a parameter-optimization technique, and a model type. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above. 
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 11, 
Step 1 Analysis: Claim 11 is directed to a process, which falls within one of the four statutory categories. 
Step 2A Prong 1 Analysis: Claim 11 recites, in part, adjust a plurality of hyperparameters…, asynchronously measure one or more performance metrics, and ceasing the adjusting of the plurality of hyperparameters. The limitations of adjust a plurality of hyperparameters…, asynchronously measure one or more performance metrics, and ceasing the adjusting of the plurality of hyperparameters, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2 Analysis: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional elements – “plurality of neural networks”. These elements that are recited are only generally linked to the judicial exception. Additionally, the claim recites the – “non-transitory computer readable medium”, “processor”, and “plurality of computer systems”, The elements in the claim are recited at a high level of generality (i.e. as a generic processor performing a generic computer function of generating an index) such that it amounts to no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
Step 2B Analysis: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of utilizing a plurality of neural networks to perform the steps of the claimed process amount to no more than generally linking the elements to the judicial exception. Additionally, the non-transitory computer readable medium, processor, and plurality of computer systems amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 13, the rejection of claim 11 is further incorporated, and further, the claim recites: wherein ceasing the adjusting of the plurality of hyperparameters comprises selecting the one or more of the plurality of machine learning models based on the one or more performance metrics and an eviction rate associated with training of the plurality of machine learning models. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 11, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 14, the rejection of claim 11 is further incorporated, and further, the claim recites: wherein asynchronously measuring the one or more performance metrics associated with the plurality of machine learning models being trained comprises collecting the one or more performance metrics up to a maximum number of training phases used to asynchronously train the plurality of machine learning models. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 11, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 15, the rejection of claim 11 is further incorporated, and further, the claim recites: wherein adjusting the plurality of hyperparameters comprises selecting a set of hyperparameter values for each machine learning model in the plurality of machine learning models. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 11, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 16, 
Step 1 Analysis: Claim 16 is directed to a process, which falls within one of the four statutory categories. 
Step 2A Prong 1 Analysis: Claim 16 recites, in part, adjust a plurality of hyperparameters…, asynchronously measure one or more performance metrics, and ceasing the adjusting of the plurality of hyperparameters. The limitations of adjust a plurality of hyperparameters…, asynchronously measure one or more performance metrics, and ceasing the adjusting of the plurality of hyperparameters, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2 Analysis: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional elements – “plurality of neural networks”. These elements that are recited are only generally linked to the judicial exception. Additionally, the claim recites the – “memory”, “processor”, and “plurality of computer systems”, The elements in the claim are recited at a high level of generality (i.e. as a generic processor performing a generic computer function of generating an index) such that it amounts to no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
Step 2B Analysis: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of utilizing a plurality of neural networks to perform the steps of the claimed process amount to no more than generally linking the elements to the judicial exception. Additionally, the memory, processor, and plurality of computer systems amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 18, the rejection of claim 16 is further incorporated, and further, the claim recites: wherein ceasing the adjusting of the plurality of hyperparameters comprises selecting the one or more of the plurality of machine learning models based on the one or more performance metrics, an eviction rate associated with training the plurality of machine learning models, and training speeds associated with training the plurality of machine learning models. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 16, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 19, the rejection of claim 16 is further incorporated, and further, the claim recites: wherein the plurality of hyperparameters comprise at least one of a learning rate, a regularization parameter, a convergence parameter, a model topology, a number of training samples, a parameter-optimization technique, and a model type. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 16 above. 
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 20, the rejection of claim 16 is further incorporated, and further, the claim recites: wherein the plurality of machine learning models comprise a neural network. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 16 above. 
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Claims 2, 12, and 17 recite additional elements or steps that amount to a practical application of the abstract idea or significantly more than the exception and would be eligible if incorporated into the respective parent independent claim.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-8, 10-12, 14-17, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Jaderberg et al. ("Population Based Training of Neural Networks", cited by Applicant in the IDS filed 06/16/2020, hereinafter "Jaderberg") in view of Li et al. ("Massively Parallel Hyperparameter Tuning" hereinafter, "Li").


Regarding claim 1, Jaderberg teaches A method, comprising: 
adjusting a plurality of hyperparameters corresponding to a plurality of neural networks trained asynchronously relative to each other using a plurality of computer systems (“However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued. This process allows hyperparameters to be optimised online, and the computational resources to be focused on the hyperparameter and weight space that has most chance of producing good results. The result is a hyperparameter tuning method that while very simple, results in faster learning, lower computational resources, and often better solutions.” [pg. 3, Figure 1 caption; Jaderberg discloses workers in population based training, thus examiner is interpreting workers to imply a plurality of computer systems [pg. 9, § 4.2.1 PBT for Machine Translation]]); 
asynchronously measuring one or more performance metrics associated with the plurality of neural networks being trained (“However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued” [pg. 3, Figure 1 caption; See further: “More importantly, the actual performance metric Q that we truly care to optimise is often different to Qˆ, for example Q could be accuracy on a validation set, or BLEU score as used in machine translation. The main purpose of PBT is to provide a way to optimise both the parameters θ and the hyperparameters h jointly on the actual metric Q that we care about.” [pg. 4, § 3 Population Based Training, ¶1]); and 
Although Jaderberg discloses performance metrics in population based training, the reference doesn’t go into details about ceasing the adjusting of the plurality of hyperparameters.
Li teaches ceasing the adjusting of the plurality of hyperparameters corresponding to one or more of the plurality of neural networks if the one or more performance metrics associated with the one or more of the plurality of neural networks are below a threshold (“SHA requires the number of configurations n, a minimum resource r, a maximum resource R, a reduction factor η ≥ 2, and a minimum early-stopping rate s. Additionally, the get_hyperparameter_configuration(n) subroutine returns n configurations sampled randomly from a given search space; and the run_then_return_val_loss(θ, r) subroutine returns the validation loss after training the model with the hyperparameter setting θ and for r resources. For a given early-stopping rate s, a minimum resource of r0 = rηs will be allocated to each configuration. Hence, lower s corresponds to more aggressive early-stopping, with s = 0 prescribing a minimum resource of r” [pg. 3, § 3.1 Successive Halving (SHA), ¶2]).
Jaderberg and Li are both in the same field of endeavor asynchronous hyperparameter tuning on distributed environments, thus are analogous. Jaderberg teaches population based training. Li teaches parallel hyperparameter tuning. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Jaderberg’s teachings to implement Li’s early stopping algorithm. One would have been motivated to make this combination to find a suitable hyperparameter configuration to train new models. [¶1, § 1 Introduction, Li]

	Regarding claim 2, Jaderberg/Li teaches The method of claim 1, where Jaderberg teaches further comprising, upon ceasing the adjusting of the plurality of hyperparameters corresponding to the one or more of the plurality of neural networks, asynchronously initiating training of one or more additional neural networks on a subset of the plurality of computer systems previously used to train the one or more of the plurality of neural networks (“Population based training starts like parallel search, randomly sampling hyperparameters and weight initialisations. However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued. This process allows hyperparameters to be optimised online, and the computational resources to be focused on the hyperparameter and weight space that has most chance of producing good results. The result is a hyperparameter tuning method that while very simple, results in faster learning, lower computational resources, and often better solutions..” [pg. 3, Figure 1 Caption]).

	Regarding claim 3, Jaderberg/Li teaches The method of claim 1, where Jaderberg further teaches wherein adjusting the plurality of hyperparameters comprises selecting a set of hyperparameter values for each neural network in the plurality of neural networks (“Hyperparameters: Since PBT allows online adaptation of hyperparameters during training, we evaluate how important the adaptation is by comparing full PBT performance compared to using the set of hyperparameters that PBT found by the end of training. When training UNREAL on DeepMind Lab we find that the strength of PBT is in allowing the hyperparameters to be adaptive, not merely in finding a good prior on the space of hyperparameters.” [pg. 9, Figure 5, Caption; See Abstract: “In this work we present Population Based Training (PBT), a simple asynchronous optimisation algorithm which effectively utilises a fixed computational budget to jointly optimise a population of models and their hyperparameters to maximise performance”]).

Regarding claim 4, Jaderberg/Li teaches The method of claim 1, where Jaderberg further teaches where Jaderberg further teaches wherein asynchronously measuring the one or more performance metrics comprises collecting the one or more performance metrics at an end of a training phase used to train a first neural network after a second neural network has previously completed the training phase (“However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued. [pg. 3, Figure 1 Caption]).

Regarding claim 5, Jaderberg/Li teaches The method of claim 1, where Jaderberg further teaches wherein ceasing the adjusting of the plurality of hyperparameters comprises increasing a proportion of the plurality of neural networks for which the adjusting of the plurality of hyperparameters is ceased with successive training phases used to train the plurality of neural networks (“In Fig. 5 (a) we demonstrate the effect of population size on the performance of PBT when training FuN on Atari. In general, we find that if the population size is too small (10 or below) we tend to encounter higher variance and can suffer from poorer results – this is to be expected as PBT is a greedy algorithm and so can get stuck in local optima if there is not sufficient population to maintain diversity and scope for exploration. However, these problems rapidly disappear as population size increases and we see improved results as the population size grows. In our experiments, we observe that a population size of between 20 and 40 is sufficient to see strong and consistent improvements; larger populations tend to fare even better, although we see diminishing returns for the cost of additional population members.” [pg. 12-13, § 4.4 Analysis, Population Size; Jaderberg discloses increasing population sizes with PBT.]).

Regarding claim 6, Jaderberg/Li teaches The method of claim 1, where Jaderberg further teaches wherein ceasing the adjusting of the plurality of hyperparameters comprises selecting the one or more of the plurality of neural networks based on the one or more performance metrics associated with training the plurality of neural networks (“Each member of the population is trained in parallel, with iterative calls to step to update the member’s weights and eval to measure the member’s current performance. However, when a member of the population is deemed ready (for example, by having been optimised for a minimum number of steps or having reached a certain performance threshold), its weights and hyperparameters are updated by exploit and explore. For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population, and explore could randomly perturb the hyperparameters with noise. After exploit and explore, iterative training continues using step as before. This cycle of local iterative training (with step) and exploitation and exploration using the rest of the population (with exploit and explore) is repeated until convergence of the model.” [pg. 5, ¶3; See further: “The improvements we show empirically are the result of (a) automatic selection of hyperparameters during training, (b) online model selection to maximise the use of computation spent on promising models, and (c) the ability for online adaptation of hyperparameters to enable non-stationary training regimes and the discovery of complex hyperparameter schedules” [pg. 2, 3]]).

Regarding claim 7, Jaderberg/Li teaches The method of claim 6, where Jaderberg further teaches wherein selecting the one or more of the plurality of neural networks comprises: 
selecting a first neural network that completes a training phase at a first time for continued training (“In this work we focus on optimising neural networks for reinforcement learning, supervised learning, and generative modelling with PBT (Sect. 4). In these cases, step is a step of gradient descent (with e.g. SGD or RMSProp (Tieleman & Hinton, 2012)), eval is the mean episodic return or validation set performance of the metric we aim to optimise, exploit selects another member of the population to copy the weights and hyperparameters from, and explore creates new hyperparameters for the next steps of gradient-based learning by either perturbing the copied hyperparameters or resampling hyperparameters from the originally defined prior distribution” [pg. 5, ¶4; exploit implies a first neural network was trained in order to copy the weights/hyperparameters.]); and 
selecting a second neural network with a performance metric that is lower than the threshold and that completes the training phase at a second time that is later than the first time for inclusion in the one or more of the plurality of neural networks (“Each member of the population is trained in parallel, with iterative calls to step to update the member’s weights and eval to measure the member’s current performance. However, when a member of the population is deemed ready (for example, by having been optimised for a minimum number of steps or having reached a certain performance threshold), its weights and hyperparameters are updated by exploit and explore” [pg. 5, ¶3]).

Regarding claim 8, Jaderberg/Li teaches The method of claim 6, where Li further teaches wherein selecting the one or more of the plurality of neural networks further comprises adjusting at least one of the first time and the second time based on a number of computational resources used to train at least one of the first neural network and the second neural network (“ASHA is well-suited for the large-scale regime, where wall-clock time is constrained to a small multiple of the time needed to train a single model. For ease of comparison with SHA, assume training time scales linearly with the resource. Consider the example of Bracket 0 shown in Figure 1, and assume we can run ASHA with 9 machines. Then ASHA returns a fully trained configuration in 13/9 × time(R), since 9 machines are sufficient to promote configurations to the next rung in the same time it takes to train a single configuration in the rung. Hence, the training time for a configuration in rung 0 is 1/9 × time(R), for rung 1 it is 1/3 × time(R), and for rung 2 it is time(R).” [pg. 4, 3.2 Asynchronous SHA (ASHA), ¶2]).
Jaderberg and Li are both in the same field of endeavor asynchronous hyperparameter tuning on distributed environments, thus are analogous. Jaderberg teaches population based training. Li teaches parallel hyperparameter tuning. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Jaderberg’s teachings to implement Li’s early stopping algorithm. One would have been motivated to make this combination to find a suitable hyperparameter configuration to train new models. [¶1, § 1 Introduction, Li]

Regarding claim 10, Jaderberg/Li teaches The method of claim 1, where Jaderberg further teaches wherein the plurality of hyperparameters comprise at least one of a learning rate, a regularization parameter, a convergence parameter, a model topology, a number of training samples, a parameter-optimization technique, and a model type (“Correctly selecting hyperparameters requires strong prior knowledge on h to exist or to be found (most often through multiple optimisation processes with different h). Furthermore, due to the dependence of h on iteration step, the number of possible values grows exponentially with time. Consequently, it is common practise to either make all ht equal to each other (e.g. constant learning rate through entire training, or constant regularisation strength) or to predefine a simple schedule (e.g. learning rate annealing). In both cases one needs to search over multiple possible values of h” [pg. 4, 3 Population Based Training, ¶5; note BRI of the claims requires “at least one of”, examiner has provided a citation corresponding to a learning rate/regularization parameter.]).

Regarding claim 11, Jaderberg teaches A non-transitory computer readable medium storing instructions that, when executed by a processor, cause the processor to at least (We used the open-sourced implementation of the Transformer framework1 with the provided transformer base single gpu architecture settings. This model has the same number of parameters as the base configuration that runs on 8 GPUs (6.5 × 107), but sees, in each training step, 1/16 of the number of tokens (2048 vs. 8×4096) as it uses a smaller batch size.” [pg. 21, A.4 Machine Translation, ¶1; implies use of memory and processors]):
adjust a plurality of hyperparameters corresponding to a plurality of neural networks trained asynchronously relative to each other using a plurality of computer systems (“However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued. This process allows hyperparameters to be optimised online, and the computational resources to be focused on the hyperparameter and weight space that has most chance of producing good results. The result is a hyperparameter tuning method that while very simple, results in faster learning, lower computational resources, and often better solutions.” [pg. 3, Figure 1 caption; Jaderberg discloses workers in population based training, thus examiner is interpreting workers to imply a plurality of computer systems [pg. 9, § 4.2.1 PBT for Machine Translation]]); 
asynchronously measure one or more performance metrics associated with the plurality of neural networks being trained (“However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued” [pg. 3, Figure 1 caption; See further: “More importantly, the actual performance metric Q that we truly care to optimise is often different to Qˆ, for example Q could be accuracy on a validation set, or BLEU score as used in machine translation. The main purpose of PBT is to provide a way to optimise both the parameters θ and the hyperparameters h jointly on the actual metric Q that we care about.” [pg. 4, § 3 Population Based Training, ¶1]); and 
Although Jaderberg discloses performance metrics in population based training, the reference doesn’t go into details about cease the adjusting of the plurality of hyperparameters.
Li teaches cease the adjusting of the plurality of hyperparameters corresponding to one or more of the plurality of neural networks if the one or more performance metrics associated with the one or more of the plurality of neural networks are below a threshold (“SHA requires the number of configurations n, a minimum resource r, a maximum resource R, a reduction factor η ≥ 2, and a minimum early-stopping rate s. Additionally, the get_hyperparameter_configuration(n) subroutine returns n configurations sampled randomly from a given search space; and the run_then_return_val_loss(θ, r) subroutine returns the validation loss after training the model with the hyperparameter setting θ and for r resources. For a given early-stopping rate s, a minimum resource of r0 = rηs will be allocated to each configuration. Hence, lower s corresponds to more aggressive early-stopping, with s = 0 prescribing a minimum resource of r” [pg. 3, § 3.1 Successive Halving (SHA), ¶2]).
Jaderberg and Li are both in the same field of endeavor asynchronous hyperparameter tuning on distributed environments, thus are analogous. Jaderberg teaches population based training. Li teaches parallel hyperparameter tuning. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Jaderberg’s teachings to implement Li’s early stopping algorithm. One would have been motivated to make this combination to find a suitable hyperparameter configuration to train new models. [¶1, § 1 Introduction, Li]

Regarding claim 12, Jaderberg/Li teaches The non-transitory computer-readable medium of claim 11, further storing instructions that, when executed by the processor, cause the processor to at least asynchronously initiate training of one or more additional machine learning models on a subset of the plurality of computer systems previously used to train the one or more of the plurality of machine learning models (“Population based training starts like parallel search, randomly sampling hyperparameters and weight initialisations. However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued. This process allows hyperparameters to be optimised online, and the computational resources to be focused on the hyperparameter and weight space that has most chance of producing good results. The result is a hyperparameter tuning method that while very simple, results in faster learning, lower computational resources, and often better solutions..” [pg. 3, Figure 1 Caption]).

Regarding claim 14, Jaderberg/Li teaches The non-transitory computer-readable medium of claim 11, where Jaderberg further teaches wherein asynchronously measuring the one or more performance metrics associated with the plurality of machine learning models being trained comprises collecting the one or more performance metrics up to a maximum number of training phases used to asynchronously train the plurality of machine learning models (“However, when a member of the population is deemed ready (for example, by having been optimised for a minimum number of steps or having reached a certain performance threshold), its weights and hyperparameters are updated by exploit and explore. For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population, and explore could randomly perturb the hyperparameters with noise. After exploit and explore, iterative training continues using step as before. This cycle of local iterative training (with step) and exploitation and exploration using the rest of the population (with exploit and explore) is repeated until convergence of the model” [pg. 5, ¶3]).

Regarding claim 15, Jaderberg/Li teaches The non-transitory computer-readable medium of claim 11, where Jaderberg teaches wherein adjusting the plurality of hyperparameters comprises selecting a set of hyperparameter values for each machine learning model in the plurality of machine learning models (“Hyperparameters: Since PBT allows online adaptation of hyperparameters during training, we evaluate how important the adaptation is by comparing full PBT performance compared to using the set of hyperparameters that PBT found by the end of training. When training UNREAL on DeepMind Lab we find that the strength of PBT is in allowing the hyperparameters to be adaptive, not merely in finding a good prior on the space of hyperparameters.” [pg. 9, Figure 5, Caption; See Abstract: “In this work we present Population Based Training (PBT), a simple asynchronous optimisation algorithm which effectively utilises a fixed computational budget to jointly optimise a population of models and their hyperparameters to maximise performance”]).

Regarding claim 16, Jaderberg teaches A system, comprising:
a memory storing one or more instructions; and 
a processor that executes the instructions to at least: (We used the open-sourced implementation of the Transformer framework1 with the provided transformer base single gpu architecture settings. This model has the same number of parameters as the base configuration that runs on 8 GPUs (6.5 × 107), but sees, in each training step, 1/16 of the number of tokens (2048 vs. 8×4096) as it uses a smaller batch size.” [pg. 21, A.4 Machine Translation, ¶1; implies use of memory and processors]):
adjust a plurality of hyperparameters corresponding to a plurality of neural networks trained asynchronously relative to each other using a plurality of computer systems (“However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued. This process allows hyperparameters to be optimised online, and the computational resources to be focused on the hyperparameter and weight space that has most chance of producing good results. The result is a hyperparameter tuning method that while very simple, results in faster learning, lower computational resources, and often better solutions.” [pg. 3, Figure 1 caption; Jaderberg discloses workers in population based training, thus examiner is interpreting workers to imply a plurality of computer systems [pg. 9, § 4.2.1 PBT for Machine Translation]]); 
asynchronously measure one or more performance metrics associated with the plurality of neural networks being trained (“However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued” [pg. 3, Figure 1 caption; See further: “More importantly, the actual performance metric Q that we truly care to optimise is often different to Qˆ, for example Q could be accuracy on a validation set, or BLEU score as used in machine translation. The main purpose of PBT is to provide a way to optimise both the parameters θ and the hyperparameters h jointly on the actual metric Q that we care about.” [pg. 4, § 3 Population Based Training, ¶1]); and 
Although Jaderberg discloses performance metrics in population based training, the reference doesn’t go into details about cease the adjusting of the plurality of hyperparameters.
Li teaches cease the adjusting of the plurality of hyperparameters corresponding to one or more of the plurality of neural networks if the one or more performance metrics associated with the one or more of the plurality of neural networks are below a threshold (“SHA requires the number of configurations n, a minimum resource r, a maximum resource R, a reduction factor η ≥ 2, and a minimum early-stopping rate s. Additionally, the get_hyperparameter_configuration(n) subroutine returns n configurations sampled randomly from a given search space; and the run_then_return_val_loss(θ, r) subroutine returns the validation loss after training the model with the hyperparameter setting θ and for r resources. For a given early-stopping rate s, a minimum resource of r0 = rηs will be allocated to each configuration. Hence, lower s corresponds to more aggressive early-stopping, with s = 0 prescribing a minimum resource of r” [pg. 3, § 3.1 Successive Halving (SHA), ¶2]).
Jaderberg and Li are both in the same field of endeavor asynchronous hyperparameter tuning on distributed environments, thus are analogous. Jaderberg teaches population based training. Li teaches parallel hyperparameter tuning. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Jaderberg’s teachings to implement Li’s early stopping algorithm. One would have been motivated to make this combination to find a suitable hyperparameter configuration to train new models. [¶1, § 1 Introduction, Li]

Regarding claim 17, Jaderberg/Li teaches The system of claim 16, where Jader berg further teaches wherein the processor further executes the instructions to at least asynchronously initiate training of one or more additional machine learning models on a subset of the plurality of computer systems previously used to train the one or more of the plurality of machine learning models. (“Population based training starts like parallel search, randomly sampling hyperparameters and weight initialisations. However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued. This process allows hyperparameters to be optimised online, and the computational resources to be focused on the hyperparameter and weight space that has most chance of producing good results. The result is a hyperparameter tuning method that while very simple, results in faster learning, lower computational resources, and often better solutions..” [pg. 3, Figure 1 Caption]).

Regarding claim 19, Jaderberg/Li teaches The system of claim 16, where Jaderberg further teaches wherein the plurality of hyperparameters comprise at least one of a learning rate, a regularization parameter, a convergence parameter, a model topology, a number of training samples, a parameter-optimization technique, and a model type (“Correctly selecting hyperparameters requires strong prior knowledge on h to exist or to be found (most often through multiple optimisation processes with different h). Furthermore, due to the dependence of h on iteration step, the number of possible values grows exponentially with time. Consequently, it is common practise to either make all ht equal to each other (e.g. constant learning rate through entire training, or constant regularisation strength) or to predefine a simple schedule (e.g. learning rate annealing). In both cases one needs to search over multiple possible values of h” [pg. 4, 3 Population Based Training, ¶5; note BRI of the claims requires “at least one of”, examiner has provided a citation corresponding to a learning rate/regularization parameter.]).

Regarding claim 20, Jaderberg/Li teaches The system of claim 16, where Jaderberg further teaches wherein the plurality of machine learning models comprise a neural network (“When our model is a neural network, we generally optimise the weights θ in an iterative manner, e.g. by using stochastic gradient descent on the objective function Q.” [pg. 4, § 3 Population Based Training, ¶4]).

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Jaderberg in view of Li and further in view of Thornton et al. ("Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms", hereinafter "Thornton")

Regarding claim 9, Jaderberg/Li teaches The method of claim 1, however fails to explicitly teach wherein the threshold comprises a quantile associated with the one or more performance metrics.
Thornton teaches wherein the threshold comprises a quantile associated with the one or more performance metrics (“Here, c∗ is chosen as the γ-quantile of the losses TPE obtained so far (where γ is an algorithm parameter with a default value of γ = 0.15),                         
                            l
                        
                    (·) is a density estimate learned from all previous hyperparameters λ with corresponding loss smaller than c∗ , and g(·) is a density estimate learned from all previous hyperparameters λ with corresponding loss greater than or equal to c∗ . Intuitively, this creates a probabilistic density estimator                         
                            l
                        
                    (·) for hyperparameters that appear to do ‘well’, and a different density estimator g(·) for hyperparameters that appear ‘poor’ with respect to the threshold.” [pg. 849, 3.2. Tree-structured Parzen Estimator (TPE), ¶2]).
Jaderberg, Li and Thornton are all in the same field of endeavor of hyperparameter optimization and thus are analogous. Jaderberg teaches population based training. Li teaches parallel hyperparameter tuning. Li2 teaches an efficient hyperparameter optimization method. Thornton teaches a method for hyperparameter optimization of classification algorithm. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Jaderberg’s/Li’s teachings to implement a threshold comprising a quantile as taught by Thornton. One would have been motivated to make this combination in order to determine if a hyperparameter configuration is doing well. [pg. 849, 3.2. Tree-structured Parzen Estimator (TPE), ¶2, Thornton]

Claims 13 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Jaderberg in view of Li and further in view of Li et al. ("Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits", cited by the Applicant in the IDS filed 04/19/2021, hereinafter "Li2").

Regarding claim 13, Jaderberg/Li teaches The non-transitory computer-readable medium of claim 11, where Jaderberg further teaches wherein ceasing the adjusting of the plurality of hyperparameters comprises selecting the one or more of the plurality of machine learning models based on the one or more performance metrics (“Each member of the population is trained in parallel, with iterative calls to step to update the member’s weights and eval to measure the member’s current performance. However, when a member of the population is deemed ready (for example, by having been optimised for a minimum number of steps or having reached a certain performance threshold), its weights and hyperparameters are updated by exploit and explore” [pg. 5, ¶3]) 
However Jaderberg/Li fails to explicitly teach and an eviction rate associated with training of the plurality of machine learning models.
Li2 teaches an eviction rate associated with training of the plurality of machine learning models (“First, we require user-defined inputs to specify the horizon, namely upper and lower bounds on the budget that can be allocated to an arm, e.g., an upper (lower) bound may be the full dataset size (minimum required sample) or desired maximum iteration (minimum required iteration). Second, we allow a variable rate of elimination, η ≥ 2 instead of halving at each step (η = 2).” [pg. 4, §3.1. SuccessiveHalving, ¶1; Examiner interpreting a rate of elimination to be equivalent to eviction rate.]).
Jaderberg, Li, Li2 are all in the same field of endeavor of hyperparameter optimization and thus are analogous. Jaderberg teaches population based training. Li teaches parallel hyperparameter tuning. Li2 teaches an efficient hyperparameter optimization method. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Jaderberg’s/Li’s teachings to implement a variable elimination rate as taught by Li2. One would have been motivated to make this modification in order to speed up hyperparameter configuration evaluation by eliminating weaker ones. [pg. 2, top left col, ¶1, Li2]

Regarding claim 18, Jaderberg/Li teaches The system of claim 16, where Jaderberg teaches wherein ceasing the adjusting of the plurality of hyperparameters comprises selecting the one or more of the plurality of machine learning models based on the one or more performance metrics (“Each member of the population is trained in parallel, with iterative calls to step to update the member’s weights and eval to measure the member’s current performance. However, when a member of the population is deemed ready (for example, by having been optimised for a minimum number of steps or having reached a certain performance threshold), its weights and hyperparameters are updated by exploit and explore” [pg. 5, ¶3]) 
and training speeds associated with training the plurality of machine learning models (“This process allows hyperparameters to be optimised online, and the computational resources to be focused on the hyperparameter and weight space that has most chance of producing good results. The result is a hyperparameter tuning method that while very simple, results in faster learning, lower computational resources, and often better solutions.” [pg. 3, Figure 1 Caption]).
However Jaderberg/Li teaches an eviction rate associated with training the plurality of machine learning models
Li2 teaches an eviction rate associated with training the plurality of machine learning models (“First, we require user-defined inputs to specify the horizon, namely upper and lower bounds on the budget that can be allocated to an arm, e.g., an upper (lower) bound may be the full dataset size (minimum required sample) or desired maximum iteration (minimum required iteration). Second, we allow a variable rate of elimination, η ≥ 2 instead of halving at each step (η = 2).” [pg. 4, §3.1. SuccessiveHalving, ¶1; Examiner interpreting a rate of elimination to be equivalent to eviction rate.]).
Jaderberg, Li, Li2 are all in the same field of endeavor of hyperparameter optimization and thus are analogous. Jaderberg teaches population based training. Li teaches parallel hyperparameter tuning. Li2 teaches an efficient hyperparameter optimization method. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Jaderberg’s/Li’s teachings to implement a variable elimination rate as taught by Li2. One would have been motivated to make this modification in order to speed up hyperparameter configuration evaluation by eliminating weaker ones. [pg. 2, top left col, ¶1, Li2]


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Koch et al. (Autotune: A Derivative-free Optimization Framework for.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL H HOANG whose telephone number is (571)272-8491. The examiner can normally be reached Mon-Fri 8:30AM-4:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/M.H.H./Examiner, Art Unit 2122  

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122