DETAILED ACTION
This action is in response to claims filed 10/01/2021 for application 16/422380 filed 05/24/2019. Claims 1-5, 9, 10, 12, 14-16, and 18-20 are amended, claims 8 and 17 are canceled, and claims 21 and 22 are new. Claims 1-7, 9-16, and 18-22 are currently pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.

4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 3-7, 9, 10, 12-16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Jaderberg et al. ("Population Based Training of Neural Networks", hereinafter "Jaderberg") in view of Sayadi et al. ("Ensemble Learning for Effective Run-Time Hardware-Based Malware Detection: A Comprehensive Analysis and Classification", hereinafter "Sayadi") and further in view of Seema ("Classification of Evolving Stream Data using Improved Ensemble Classifier", hereinafter "Seema").

Regarding claim 1, Jaderberg teaches An artificial intelligence system for improving machine learning model adaptability, the artificial intelligence system comprising: 
a population of machine learning models (“In this work we present Population Based Training (PBT), a simple asynchronous optimisation algorithm which effectively utilises a fixed computational budget to jointly optimise a population of models and their hyperparameters to maximise performance.” [pg. 1, Abstract]) configured to monitor a real-time data stream (“We used newstest2012 and newstest2013 respectively as the evaluation set used by PBT, and as the test set for monitoring.” [pg. 21, § A.4 Detailed Results: Machine Translation; Jaderberg discloses data-store to be a key-value store or simple file-system which would imply the use of real-time data. [pg. 18, § A.1 Practical implementations, ¶1]]) and
 a controller configured for evaluating and reconfiguring the population of the machine learning models in response to changes in the data stream (“At the end of training, we report tokenized BLEU score on newstest2014 as computed by multi-bleu.pl script2 . We also evaluated the original hyperparameter configuration trained for the same number of steps and obtained the BLEU score of 21.23, which is lower than both our baselines and PBT results.” [pg. 21, § A.4 Detailed Results: Machine Translation; Reconfiguring the population would be equivalent to changing hyperparameters during the training process. See further pg. 2, ¶1]), the controller comprising at least one memory device with computer-readable program code stored thereon, at least one communication device connected to a network (“a practitioner need only add the ability for population members to read and write to a shared data-store (e.g. a key-value store, or a simple file-system). [pg. 18, § A.1 Practical implementations, ¶1; communication device connected to a network is inherent in order to perform a read/write.]), and at least one processing device, wherein the at least one processing device is configured to execute the computer-readable program code to (“We used the open-sourced implementation of the Transformer framework1 with the provided transformer base single gpu architecture settings. This model has the same number of parameters as the base configuration that runs on 8 GPUs (6.5 × 107 ), but sees, in each training step, 1 16 of the number of tokens (2048 vs. 8×4096) as it uses a smaller batch size” [pg. 21, § A.4 Detailed Results: Machine Translation; Jaderberg discloses using GPU’s which implies the use of memory and program code required to run the algorithms.]): 
continuously monitor the population of the machine learning models, wherein continuously monitoring the population comprises collecting performance metrics for each of the machine learning models (“Members of the population interact with this data-store to update their current performance and can also use it to query the recent performance of other population members… Population members periodically checkpoint themselves, and when they do so they write their performance to the shared data-store” [pg. 18, § A.1 Practical Implementations; Examiner is interpreting periodically checkpoint themselves to be equivalent to continuously monitoring.]), wherein the performance metrics comprise accuracy (“More importantly, the actual performance metric Q that we truly care to optimise is often different to                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                    , for example Q could be accuracy on a validation set, or BLEU score as used in machine translation” [pg. 4, § 3. Population Based Training, ¶1]), resource efficiency (“In addition, the model selection and propagation process ensures that intermediate good models are given more computational resources, and are used as a basis of further optimisation and hyperparameter search.” [pg. 2, § Introduction, ¶3; See further Figure 1 caption]), reliability (“With just a small modification to a typical distributed hyperparameter training framework, our method allows robust and reliable training of models.” [Abstract, reliability is implied by training of the models]), stability (“We have shown consistent improvements in accuracy, training time and stability across a wide range of domains by being able to optimise over weights and hyperparameters jointly.” [pg. 13, § 5 Conclusions, ¶1]), and adaptability (“In Fig. 5 (d) we provide another demonstration that the benefits of PBT go beyond simply finding a single good fixed hyperparameter combination. Since PBT allows online adaptation of hyperparameters during training, we evaluate how important the adaptation is by comparing full PBT performance compared to using the set of hyperparameters that PBT found by the end of training” [pg. 13, § Hyperparameter Adaptivity]); 
analyze the performance metrics for each of the machine learning models by comparing the performance metrics to threshold values (“Each member of the population is trained in parallel, with iterative calls to step to update the member’s weights and eval to measure the member’s current performance. However, when a member of the population is deemed ready (for example, by having been optimised for a minimum number of steps or having reached a certain performance threshold), its weights and hyperparameters are updated by exploit and explore.” [pg. 5, ¶3; “Analyzing” would be equivalent to measuring a member’s performance and comparing to a performance threshold.), wherein the threshold values comprise accuracy thresholds (“More importantly, the actual performance metric Q that we truly care to optimise is often different to                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                    , for example Q could be accuracy on a validation set, or BLEU score as used in machine translation” [pg. 4, § 3. Population Based Training, ¶1]), computer resource use efficiency settings (“In addition, the model selection and propagation process ensures that intermediate good models are given more computational resources, and are used as a basis of further optimisation and hyperparameter search.” [pg. 2, § Introduction, ¶3; See further Figure 1 caption]), reliability thresholds reliability (“With just a small modification to a typical distributed hyperparameter training framework, our method allows robust and reliable training of models.” [Abstract, reliability is implied by training of the models]), stability thresholds (“We have shown consistent improvements in accuracy, training time and stability across a wide range of domains by being able to optimise over weights and hyperparameters jointly.” [pg. 13, § 5 Conclusions, ¶1]), and adaptability thresholds (“In Fig. 5 (d) we provide another demonstration that the benefits of PBT go beyond simply finding a single good fixed hyperparameter combination. Since PBT allows online adaptation of hyperparameters during training, we evaluate how important the adaptation is by comparing full PBT performance compared to using the set of hyperparameters that PBT found by the end of training” [pg. 13, § Hyperparameter Adaptivity]); and 
based on analyzing the performance metrics, reconfigure the population of the machine learning models (“However, when a member of the population is deemed ready (for example, by having been optimised for a minimum number of steps or having reached a certain performance threshold), its weights and hyperparameters are updated by exploit and explore. For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population, and explore could randomly perturb the hyperparameters with noise.” [pg. 5, ¶3; Updating weights and hyperparameters would be equivalent to reconfiguring the population of models.]), wherein reconfiguring the population of the machine learning models comprises retraining the machine learning models based on historical data (“Members of the population interact with this data-store to update their current performance and can also use it to query the recent performance of other population members.” [pg. 18, § A.1 Practical implementation; The citation corresponds to historical data since data is being stored in a data store and is used to retrain the model by querying the recent performances of other models.]), real-time data (“We used newstest2012 and newstest2013 respectively as the evaluation set used by PBT, and as the test set for monitoring.” [pg. 21, § A.4 Detailed Results: Machine Translation; Jaderberg discloses data-store to be a key-value store or simple file-system which would imply the use of real-time data. [pg. 18, § A.1 Practical implementations, ¶1]]), adversarial data (“Finally, the various metrics used by the community to evaluate the quality of samples produced by GAN generators are necessarily distinct from those used for the adversarial optimisation itself. We explore whether we can improve the performance of generators under these metrics by directly targeting them as the PBT meta-optimisation evaluation criteria.” [pg. 11, § 4.3 Generative Adversarial Networks, ¶4])
However Jaderberg fails to explicitly teach analyze the performance metrics for each of the machine learning models by comparing the performance metrics to energy efficiency settings
Sayadi teaches analyze the performance metrics for each of the machine learning models by comparing the performance metrics to energy efficiency settings (“HPCs are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. Performance counters data have been extensively used to predict the power, performance, and energy efficiency of computing systems” [pg. 1, §1 Introduction, ¶2; See further “For this purpose, eight robust machine learning models and two well-known ensemble learning classifiers applied on all studied ML models (sixteen in total) are implemented for malware detection and precisely compared and characterized in terms of detection accuracy, robustness, performance (accuracy×robustness), and hardware overheads” [Abstract]])
Jaderberg and Sayadi are in the same field of endeavor of training multiple machine learning models in a population. Jaderberg discloses a population based training method. Sayadi discloses training an ensemble of machine learning models based on hardware performance. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Jaderberg’s teachings to include energy efficiency settings as one of the performance metrics as taught by Sayadi. One would have been motivated to make this modification in order to analyze the energy efficiency of each machine learning model to improve its performance. [Abstract, Sayadi]
Jaderberg/Sayadi fails to explicitly teach and reconfiguring the population of the machine learning models comprises retraining the machine learning models based on synthetically generated data.
Seema teaches and reconfiguring the population of the machine learning models comprises retraining the machine learning models based on synthetically generated data (“We evaluate our ensemble on synthetic as well as real time data, compute the precision and represent it graphically using both majority voting as well as new proposed weighted averaging and compare its performance against individual classifiers.” [Abstract; See further pg. 5-6, § 6.1 Synthetic Data]).
Jaderberg, Sayadi, and Seema are all in the same field of endeavor of training multiple machine learning models in a population. Jaderberg discloses a population based training method. Sayadi discloses training an ensemble of machine learning models based on hardware performance. Seema teaches training an ensemble of classifiers on synthetic and real time data. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the teachings of Jaderberg/Sayadi to include retraining the machine learning models on synthetic data as taught by Seema. One would have been motivated to make this modification in order to model specific factors that each model should be able to handle and thus improve the resulting classifications. [pg. 5, § 6.1 Synthetic Data, Seema]

	Regarding claim 3, Jaderberg/Sayadi/Seema teaches The artificial intelligence system of claim 1, where Jaderberg further teaches wherein the at least one processing device is configured to execute the computer-readable program code to, when reconfiguring the population of the machine learning models, change architectural parameters of the population (“its weights and hyperparameters are updated by exploit and explore. For example, exploit could replace the current weights with the weights that have the highest recorded performance in the  rest of the population, and explore could randomly perturb the hyperparameters with noise.” [pg. 5, ¶3; updating, replacing, perturbing weights and hyperparameters would all be equivalent to reconfiguring architectural parameters.]) by at least one of adding a new model to the population, removing a current model from the population, and reweighting a current model from the population (“If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued.” [pg. 3, Figure 1, a model replacing itself with a better performing model would be equivalent to adding a new model and removing a current model. See further “For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population”, pg. 5, ¶3; This citation corresponds to reweighting a current model. note: The claim limitation under BRI only requires the examiner to teach “at least one”, however Jaderberg teaches adding, removing, and reweighting.]).

	Regarding claim 4, Jaderberg/Sayadi/Seema teaches The artificial intelligence system of claim 1, wherein the at least one processing device is configured to execute the computer-readable program code to, when analyzing the performance metrics for each of the machine learning models, evaluate an output diversity of the machine learning models (“In Fig. 5 (a) we demonstrate the effect of population size on the performance of PBT when training FuN on Atari. In general, we find that if the population size is too small (10 or below) we tend to encounter higher variance and can suffer from poorer results – this is to be expected as PBT is a greedy algorithm and so can get stuck in local optima if there is not sufficient population to maintain diversity and scope for exploration. However, these problems rapidly disappear as population size increases and we see 12 improved results as the population size grows. In our experiments, we observe that a population size of between 20 and 40 is sufficient to see strong and consistent improvements; larger populations tend to fare even better, although we see diminishing returns for the cost of additional population members.” [pg. 12-13, § Population Size; Population size would be a form of evaluating output diversity.]).

Regarding claim 5, Jaderberg/Sayadi/Seema teaches The artificial intelligence system of claim 4, where Jaderberg further teaches wherein the at least one processing device is configured to execute the computer-readable program code to, when evaluating the output diversity of the machine learning models, determine a shared convergent output from a number of the machine learning models, and in response to determining the shared convergent output reconfigure the population of the machine learning models (“its weights and hyperparameters are updated by exploit and explore. For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population, and explore could randomly perturb the hyperparameters with noise. After exploit and explore, iterative training continues using step as before. This cycle of local iterative training (with step) and exploitation and exploration using the rest of the population (with exploit and explore) is repeated until convergence of the model.” [pg. 5, ¶3; Jaderberg discloses training until a convergence of the model is reached which would be equivalent to determining a shared convergent output. Additionally, weights and parameters are updated during this iterative process which would correspond to reconfiguring the population in response to determining if a convergence has been reached or not.]).

Regarding claim 6, Jaderberg/Sayadi/Seema teaches The artificial intelligence system of claim 1, where Jaderberg further teaches wherein the at least one processing device is further configured to execute the computer-readable program code to: 
identify at least one of a convergent output (“This cycle of local iterative training (with step) and exploitation and exploration using the rest of the population (with exploit and explore) is repeated until convergence of the model.” [pg. 5, ¶3]) and a divergent output of the machine learning models to evaluate diversity of the population (“Incorrectly chosen hyperparameters can lead to bad solutions or even a failure of the optimisation of θ to converge” [pg. 4, 3 Population Based Training, ¶4; Incorrect hyperparameters would imply that θ would be diverging. Additionally, Jaderberg discloses “Like RL agent training, GAN training can be remarkably brittle and unstable in the face of suboptimal hyperparameter selection and even unlucky random initialisation, with generators often collapsing to a single mode or diverging entirely.” [pg. 10, § 4.3 Generative Adversarial Networks, ¶2; Hyperparameters are used to reconfigure the population which would inherently evaluate diversity of the population.]]); and 
reconfigure the population of the machine learning models in response to identifying the at least one of the convergent output and the divergent output (“Incorrectly chosen hyperparameters can lead to bad solutions or even a failure of the optimisation of θ to converge. Correctly selecting hyperparameters requires strong prior knowledge on h to exist or to be found (most often through multiple optimisation processes with different h). Furthermore, due to the dependence of h on iteration step, the number of possible values grows exponentially with time. Consequently, it is common practise to either make all ht equal to each other (e.g. constant learning rate through entire training, or constant regularisation strength) or to predefine a simple schedule (e.g. learning rate annealing). In both cases one needs to search over multiple possible values of h.” [pg. 4, § 3 Population Based Training, ¶4; Jaderberg discloses searching for values of h (hyperparameters) based off cases where θ converges and fails to converge (i.e. diverge). Using this knowledge, tuning is done to the training process in order to find the correct hyperparameters for convergence which would be equivalent to reconfiguring the population.]).

Regarding claim 7, Jaderberg/Sayadi/Seema teaches The artificial intelligence system of claim 6, where Jaderberg further teaches wherein the at least one processing device is further configured to execute the computer-readable program code to inject at least one of the convergent output and the divergent output back into the data stream (“
    PNG
    media_image1.png
    148
    586
    media_image1.png
    Greyscale
” [pg. 4, § 3 Population Based Training, ¶3; Sequence of updates would imply training using (i.e. injecting) the previous output (i.e. a converging/diverging output) until the output reaches convergence.]), wherein the at least one of the convergent output and the divergent output are used to incrementally train the machine learning models (“After exploit and explore, iterative training continues using step as before. This cycle of local iterative training (with step) and exploitation and exploration using the rest of the population (with exploit and explore) is repeated until convergence of the model.” [pg. 5, ¶3; incrementally training would be equivalent to updating weights and perturbing the weights at each time step until convergence is reached.]).

Regarding claim 9, Jaderberg/Sayadi/Seema teaches The artificial intelligence system of claim 1, wherein the at least one processing device is further configured to execute the computer-readable program code to, reconfiguring the population of the machine learning models comprises when retraining the machine learning models models, retrain the machine learning models incrementally over a predetermined period of time. (“This iterative optimisation process can be computationally expensive, due to the number of steps T required to find θ∗ as well as the computational cost of each individual step, often resulting in the optimisation of θ taking days, weeks, or even months.” [pg. 4, § 3 Population Based Training, ¶4; Each individual step is equivalent to training incrementally.]).

Regarding claim 10, Jaderberg teaches A computer-implemented method for improving machine learning model adaptability within an artificial intelligence system, the computer-implemented method comprising:
providing a population of machine learning models (“In this work we present Population Based Training (PBT), a simple asynchronous optimisation algorithm which effectively utilises a fixed computational budget to jointly optimise a population of models and their hyperparameters to maximise performance.” [pg. 1, Abstract]) configured to monitor a real-time data stream (“We used newstest2012 and newstest2013 respectively as the evaluation set used by PBT, and as the test set for monitoring.” [pg. 21, § A.4 Detailed Results: Machine Translation; Jaderberg discloses data-store to be a key-value store or simple file-system which would imply the use of real-time data. [pg. 18, § A.1 Practical implementations, ¶1]]) and
 providing a controller configured for evaluating and reconfiguring the population of the machine learning models in response to changes in the data stream (“At the end of training, we report tokenized BLEU score on newstest2014 as computed by multi-bleu.pl script2 . We also evaluated the original hyperparameter configuration trained for the same number of steps and obtained the BLEU score of 21.23, which is lower than both our baselines and PBT results.” [pg. 21, § A.4 Detailed Results: Machine Translation; Reconfiguring the population would be equivalent to changing hyperparameters during the training process. See further pg. 2, ¶1]), the controller comprising at least one memory device with computer-readable program code stored thereon, at least one communication device connected to a network (“a practitioner need only add the ability for population members to read and write to a shared data-store (e.g. a key-value store, or a simple file-system). [pg. 18, § A.1 Practical implementations, ¶1; communication device connected to a network is inherent in order to perform a read/write.]), and at least one processing device; (“We used the open-sourced implementation of the Transformer framework1 with the provided transformer base single gpu architecture settings. This model has the same number of parameters as the base configuration that runs on 8 GPUs (6.5 × 107 ), but sees, in each training step, 1 16 of the number of tokens (2048 vs. 8×4096) as it uses a smaller batch size” [pg. 21, § A.4 Detailed Results: Machine Translation; Jaderberg discloses using GPU’s which implies the use of memory and program code required to run the algorithms.]):
continuously monitoring, with the controller, the population of the machine learning models, wherein continuously monitoring the population comprises collecting performance metrics for each of the machine learning models (“Members of the population interact with this data-store to update their current performance and can also use it to query the recent performance of other population members… Population members periodically checkpoint themselves, and when they do so they write their performance to the shared data-store” [pg. 18, § A.1 Practical Implementations; Examiner is interpreting periodically checkpoint themselves to be equivalent to continuously monitoring.]), wherein the performance metrics comprise accuracy (“More importantly, the actual performance metric Q that we truly care to optimise is often different to                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                    , for example Q could be accuracy on a validation set, or BLEU score as used in machine translation” [pg. 4, § 3. Population Based Training, ¶1]), resource efficiency (“In addition, the model selection and propagation process ensures that intermediate good models are given more computational resources, and are used as a basis of further optimisation and hyperparameter search.” [pg. 2, § Introduction, ¶3; See further Figure 1 caption]), reliability (“With just a small modification to a typical distributed hyperparameter training framework, our method allows robust and reliable training of models.” [Abstract, reliability is implied by training of the models]), stability (“We have shown consistent improvements in accuracy, training time and stability across a wide range of domains by being able to optimise over weights and hyperparameters jointly.” [pg. 13, § 5 Conclusions, ¶1]), and adaptability (“In Fig. 5 (d) we provide another demonstration that the benefits of PBT go beyond simply finding a single good fixed hyperparameter combination. Since PBT allows online adaptation of hyperparameters during training, we evaluate how important the adaptation is by comparing full PBT performance compared to using the set of hyperparameters that PBT found by the end of training” [pg. 13, § Hyperparameter Adaptivity]); 
analyzing, with the controller, the performance metrics for each of the machine learning models by comparing the performance metrics to threshold values (“Each member of the population is trained in parallel, with iterative calls to step to update the member’s weights and eval to measure the member’s current performance. However, when a member of the population is deemed ready (for example, by having been optimised for a minimum number of steps or having reached a certain performance threshold), its weights and hyperparameters are updated by exploit and explore.” [pg. 5, ¶3; “Analyzing” would be equivalent to measuring a member’s performance and comparing to a performance threshold.), wherein the threshold values comprise accuracy thresholds (“More importantly, the actual performance metric Q that we truly care to optimise is often different to                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                    , for example Q could be accuracy on a validation set, or BLEU score as used in machine translation” [pg. 4, § 3. Population Based Training, ¶1]), computer resource use efficiency settings (“In addition, the model selection and propagation process ensures that intermediate good models are given more computational resources, and are used as a basis of further optimisation and hyperparameter search.” [pg. 2, § Introduction, ¶3; See further Figure 1 caption]), reliability thresholds reliability (“With just a small modification to a typical distributed hyperparameter training framework, our method allows robust and reliable training of models.” [Abstract, reliability is implied by training of the models]), stability thresholds (“We have shown consistent improvements in accuracy, training time and stability across a wide range of domains by being able to optimise over weights and hyperparameters jointly.” [pg. 13, § 5 Conclusions, ¶1]), and adaptability thresholds adaptability (“In Fig. 5 (d) we provide another demonstration that the benefits of PBT go beyond simply finding a single good fixed hyperparameter combination. Since PBT allows online adaptation of hyperparameters during training, we evaluate how important the adaptation is by comparing full PBT performance compared to using the set of hyperparameters that PBT found by the end of training” [pg. 13, § Hyperparameter Adaptivity]); and
based on analyzing the performance metrics, reconfiguring, with the controller, the population of the machine learning models (“However, when a member of the population is deemed ready (for example, by having been optimised for a minimum number of steps or having reached a certain performance threshold), its weights and hyperparameters are updated by exploit and explore. For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population, and explore could randomly perturb the hyperparameters with noise.” [pg. 5, ¶3; Updating weights and hyperparameters would be equivalent to reconfiguring the population of models.]), wherein reconfiguring the population of the machine learning models comprises retraining the machine learning models based on historical data (“Members of the population interact with this data-store to update their current performance and can also use it to query the recent performance of other population members.” [pg. 18, § A.1 Practical implementation; The citation corresponds to historical data since data is being stored in a data store and is used to retrain the model by querying the recent performances of other models.]), real-time data (“We used newstest2012 and newstest2013 respectively as the evaluation set used by PBT, and as the test set for monitoring.” [pg. 21, § A.4 Detailed Results: Machine Translation; Jaderberg discloses data-store to be a key-value store or simple file-system which would imply the use of real-time data. [pg. 18, § A.1 Practical implementations, ¶1]]), adversarial data (“Finally, the various metrics used by the community to evaluate the quality of samples produced by GAN generators are necessarily distinct from those used for the adversarial optimisation itself. We explore whether we can improve the performance of generators under these metrics by directly targeting them as the PBT meta-optimisation evaluation criteria.” [pg. 11, § 4.3 Generative Adversarial Networks, ¶4])
However Jaderberg fails to explicitly teach analyzing the performance metrics for each of the machine learning models by comparing the performance metrics to energy efficiency settings
Sayadi teaches analyzing the performance metrics for each of the machine learning models by comparing the performance metrics to energy efficiency settings (“HPCs are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. Performance counters data have been extensively used to predict the power, performance, and energy efficiency of computing systems” [pg. 1, §1 Introduction, ¶2; See further “For this purpose, eight robust machine learning models and two well-known ensemble learning classifiers applied on all studied ML models (sixteen in total) are implemented for malware detection and precisely compared and characterized in terms of detection accuracy, robustness, performance (accuracy×robustness), and hardware overheads” [Abstract]])
Jaderberg and Sayadi are in the same field of endeavor of training multiple machine learning models in a population. Jaderberg discloses a population based training method. Sayadi discloses training an ensemble of machine learning models based on hardware performance. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Jaderberg’s teachings to include energy efficiency settings as one of the performance metrics as taught by Sayadi. One would have been motivated to make this modification in order to analyze the energy efficiency of each machine learning model to improve its performance. [Abstract, Sayadi]
Jaderberg/Sayadi fails to explicitly teach and reconfiguring the population of the machine learning models comprises retraining the machine learning models based on synthetically generated data.
Seema teaches and reconfiguring the population of the machine learning models comprises retraining the machine learning models based on synthetically generated data (“We evaluate our ensemble on synthetic as well as real time data, compute the precision and represent it graphically using both majority voting as well as new proposed weighted averaging and compare its performance against individual classifiers.” [Abstract; See further pg. 5-6, § 6.1 Synthetic Data]).
Jaderberg, Sayadi, and Seema are all in the same field of endeavor of training multiple machine learning models in a population. Jaderberg discloses a population based training method. Sayadi discloses training an ensemble of machine learning models based on hardware performance. Seema teaches training an ensemble of classifiers on synthetic and real time data. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the teachings of Jaderberg/Sayadi to include retraining the machine learning models on synthetic data as taught by Seema. One would have been motivated to make this modification in order to model specific factors that each model should be able to handle and thus improve the resulting classifications. [pg. 5, § 6.1 Synthetic Data, Seema]

 Regarding claim 12, Jaderberg/Sayadi/Seema teaches The computer-implemented method of claim 10, where Jaderberg further teaches wherein reconfiguring the population of the machine learning models comprises changing architectural parameters of the population (“its weights and hyperparameters are updated by exploit and explore. For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population, and explore could randomly perturb the hyperparameters with noise.” [pg. 5, ¶3; updating, replacing, perturbing weights and hyperparameters would all be equivalent to reconfiguring architectural parameters.]) 
by at least one of adding a new model to the population, removing a current model from the population, and reweighting a current model from the population (“If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued.” [pg. 3, Figure 1, a model replacing itself with a better performing model would be equivalent to adding a new model and removing a current model. See further “For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population”, pg. 5, ¶3; This citation corresponds to reweighting a current model. note: The claim limitation under BRI only requires the examiner to teach “at least one”, however Jaderberg teaches adding, removing, and reweighting.]).

Regarding claim 13, Jaderberg/Sayadi/Seema teaches The computer-implemented method of claim 10, where Jaderberg further teaches wherein analyzing the performance metrics for each of the machine learning models further comprises evaluating an output diversity of the machine learning models (“In Fig. 5 (a) we demonstrate the effect of population size on the performance of PBT when training FuN on Atari. In general, we find that if the population size is too small (10 or below) we tend to encounter higher variance and can suffer from poorer results – this is to be expected as PBT is a greedy algorithm and so can get stuck in local optima if there is not sufficient population to maintain diversity and scope for exploration. However, these problems rapidly disappear as population size increases and we see 12 improved results as the population size grows. In our experiments, we observe that a population size of between 20 and 40 is sufficient to see strong and consistent improvements; larger populations tend to fare even better, although we see diminishing returns for the cost of additional population members.” [pg. 12-13, § Population Size; Population size would be a form of evaluating output diversity.]).

Regarding claim 14, Jaderberg/Sayadi/Seema teaches The computer-implemented method of claim 13, where Jaderberg further teaches wherein evaluating the output diversity of the machine learning models further comprises 
determining a shared convergent output from a number of the machine learning models, and in response to determining the shared convergent output reconfiguring the population of the machine learning models (“its weights and hyperparameters are updated by exploit and explore. For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population, and explore could randomly perturb the hyperparameters with noise. After exploit and explore, iterative training continues using step as before. This cycle of local iterative training (with step) and exploitation and exploration using the rest of the population (with exploit and explore) is repeated until convergence of the model.” [pg. 5, ¶3; Jaderberg discloses training until a convergence of the model is reached which would be equivalent to determining a shared convergent output. Additionally, weights and parameters are updated during this iterative process which would correspond to reconfiguring the population in response to determining if a convergence has been reached or not.]).

Regarding claim 15, Jaderberg/Sayadi/Seema teaches The computer-implemented method of claim 10, where Jaderberg further teaches comprising: 
identifying, with the controller, at least one of a convergent output (“This cycle of local iterative training (with step) and exploitation and exploration using the rest of the population (with exploit and explore) is repeated until convergence of the model.” [pg. 5, ¶3]) and a divergent output of the machine learning models to evaluate diversity of the population (“Incorrectly chosen hyperparameters can lead to bad solutions or even a failure of the optimisation of θ to converge” [pg. 4, 3 Population Based Training, ¶4; Incorrect hyperparameters would imply that θ would be diverging. Additionally, Jaderberg discloses “Like RL agent training, GAN training can be remarkably brittle and unstable in the face of suboptimal hyperparameter selection and even unlucky random initialisation, with generators often collapsing to a single mode or diverging entirely.” [pg. 10, § 4.3 Generative Adversarial Networks, ¶2; Hyperparameters are used to reconfigure the population which would inherently evaluate diversity of the population.]]); and 
reconfiguring, with the controller, the population of the machine learning models in response to identifying the at least one of the convergent output and the divergent output (“Incorrectly chosen hyperparameters can lead to bad solutions or even a failure of the optimisation of θ to converge. Correctly selecting hyperparameters requires strong prior knowledge on h to exist or to be found (most often through multiple optimisation processes with different h). Furthermore, due to the dependence of h on iteration step, the number of possible values grows exponentially with time. Consequently, it is common practise to either make all ht equal to each other (e.g. constant learning rate through entire training, or constant regularisation strength) or to predefine a simple schedule (e.g. learning rate annealing). In both cases one needs to search over multiple possible values of h.” [pg. 4, § 3 Population Based Training, ¶4; Jaderberg discloses searching for values of h (hyperparameters) based off cases where θ converges and fails to converge (i.e. diverge). Using this knowledge, tuning is done to the training process in order to find the correct hyperparameters for convergence which would be equivalent to reconfiguring the population.]).

Regarding claim 16, Jaderberg/Sayadi/Seema teaches The computer-implemented method of claim 15, where Jaderberg further teaches comprising injecting at least one of the convergent output and the divergent output back into the data stream (“
    PNG
    media_image1.png
    148
    586
    media_image1.png
    Greyscale
” [pg. 4, § 3 Population Based Training, ¶3; Sequence of updates would imply training using (i.e. injecting) the previous output (i.e. a converging/diverging output) until the output reaches convergence.]), wherein the at least one of the convergent output and the divergent output are used to incrementally train the machine learning models (“After exploit and explore, iterative training continues using step as before. This cycle of local iterative training (with step) and exploitation and exploration using the rest of the population (with exploit and explore) is repeated until convergence of the model.” [pg. 5, ¶3; incrementally training would be equivalent to updating weights and perturbing the weights at each time step until convergence is reached.]).

Regarding claim 18, Jaderberg/Sayadi/Seema teaches The computer-implemented method of claim 10, where Jaderberg further teaches wherein reconfiguring the population of the machine learning models comprises retraining the machine learning models incrementally over a predetermined period of time (“This iterative optimisation process can be computationally expensive, due to the number of steps T required to find θ∗ as well as the computational cost of each individual step, often resulting in the optimisation of θ taking days, weeks, or even months.” [pg. 4, § 3 Population Based Training, ¶4; Each individual step is equivalent to training incrementally.]).

Claims 2, 11, and 19-22 are rejected under 35 U.S.C. 103 as being unpatentable over Jaderberg in view of Sayadi and Seema and further in view of Signorelli et al. ("Model-based clustering for populations of networks", hereinafter "Signorelli").

Regarding claim 2, Jaderberg/Sayadi/Seema teaches The artificial intelligence system of claim 1, however fails to explicitly teach wherein the population of the machine learning models are clustered into a plurality of sub-populations, and wherein the at least one processing device is configured to execute the computer-readable program code to, when analyzing the performance metrics for each of the machine learning models, hierarchically evaluate at least a portion of the sub-populations.
Signorelli teaches wherein the population of the machine learning models are clustered into a plurality of sub-populations (“In this paper we consider the existence of clusters of graphs with similar f (Y|θk): if any such cluster exists, we would like to borrow information among graphs within that cluster, so as to estimate a joint model within the cluster rather than many separate network models. As a result, we assume that the sequence S arises from M ≤ K subpopulations of graph models” [pg. 4, § 2.1 Specification of the mixture model, ¶2]), and wherein the at least one processing device is configured to execute the computer-readable program code to, when analyzing the performance metrics for each of the machine learning models, hierarchically evaluate at least a portion of the sub-populations (“To illustrate the proposed methodology, we cluster the 10 daily networks into two subpopulations and describe differences in the pattern of interactions between departments in these subpopulations (given the small number of graphs, we do not consider more than two clusters). We initialize the EM algorithm with 10 different starting points, and select the solution with the highest maximized likelihood.” [pg. 16, bottom para, Analyzing the differences between networks in a these subpopulations would be equivalent to evaluating a portion of sub-populations. Additionally, “hierarchically evaluating” would correspond to selecting the “highest maximized likelihood” when comparing networks.]).
Jaderberg, Sayadi, Seema, and Signorelli are all in the same field of endeavor of training multiple machine learning models in a population. Jaderberg discloses a population based training method. Sayadi discloses training an ensemble of machine learning models based on hardware performance. Seema teaches training an ensemble of classifiers on synthetic and real time data. Signorelli discloses a model-based clustering method for populations of networks. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the population of models disclosed by Jaderberg/Sayadi/Seema by clustering the population into sub-populations as taught by Signorelli. One would have been motivated to make this modification in order to improve computing time by identifying subpopulations and selecting the best model among the cluster. [pg. 7, § Simulations, Signorelli]

Regarding claim 11, Jaderberg/Sayadi/Seema teaches The computer-implemented method of claim 10, however fails to explicitly teach wherein the population of the machine learning models are clustered into a plurality of sub-populations, and wherein analyzing the performance metrics for each of the machine learning models further comprises hierarchically evaluating at least a portion of the sub-populations.
Signorelli teaches wherein the population of the machine learning models are clustered into a plurality of sub-populations (“In this paper we consider the existence of clusters of graphs with similar f (Y|θk): if any such cluster exists, we would like to borrow information among graphs within that cluster, so as to estimate a joint model within the cluster rather than many separate network models. As a result, we assume that the sequence S arises from M ≤ K subpopulations of graph models” [pg. 4, § 2.1 Specification of the mixture model, ¶2]), and wherein analyzing the performance metrics for each of the machine learning models further comprises hierarchically evaluating at least a portion of the sub-populations (“To illustrate the proposed methodology, we cluster the 10 daily networks into two subpopulations and describe differences in the pattern of interactions between departments in these subpopulations (given the small number of graphs, we do not consider more than two clusters). We initialize the EM algorithm with 10 different starting points, and select the solution with the highest maximized likelihood.” [pg. 16, bottom para, Analyzing the differences between networks in a these subpopulations would be equivalent to evaluating a portion of sub-populations. Additionally, “hierarchically evaluating” would correspond to selecting the “highest maximized likelihood” when comparing networks.]).
Jaderberg, Sayadi, Seema, and Signorelli are all in the same field of endeavor of training multiple machine learning models in a population. Jaderberg discloses a population based training method. Sayadi discloses training an ensemble of machine learning models based on hardware performance. Seema teaches training an ensemble of classifiers on synthetic and real time data. Signorelli discloses a model-based clustering method for populations of networks. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the population of models disclosed by Jaderberg/Sayadi/Seema by clustering the population into sub-populations as taught by Signorelli. One would have been motivated to make this modification in order to improve computing time by identifying subpopulations and selecting the best model among the cluster. [pg. 7, § Simulations, Signorelli]

Regarding claim 19, Jaderberg teaches An artificial intelligence system for improving machine learning model adaptability, the artificial intelligence system comprising: 
the population (“In this work we present Population Based Training (PBT), a simple asynchronous optimisation algorithm which effectively utilises a fixed computational budget to jointly optimise a population of models and their hyperparameters to maximise performance.” [pg. 1, Abstract]) being configured to collaboratively monitor a real-time data stream (“We used newstest2012 and newstest2013 respectively as the evaluation set used by PBT, and as the test set for monitoring.” [pg. 21, § A.4 Detailed Results: Machine Translation; Jaderberg discloses data-store to be a key-value store or simple file-system which would imply the use of real-time data. [pg. 18, § A.1 Practical implementations, ¶1]]); and 
a controller configured for evaluating and reconfiguring the population of the machine learning models in response to changes in the data stream (“At the end of training, we report tokenized BLEU score on newstest2014 as computed by multi-bleu.pl script2 . We also evaluated the original hyperparameter configuration trained for the same number of steps and obtained the BLEU score of 21.23, which is lower than both our baselines and PBT results.” [pg. 21, § A.4 Detailed Results: Machine Translation; Reconfiguring the population would be equivalent to changing hyperparameters during the training process. See further pg. 2, ¶1]), the controller comprising at least one memory device with computer-readable program code stored thereon, at least one communication device connected to a network (“a practitioner need only add the ability for population members to read and write to a shared data-store (e.g. a key-value store, or a simple file-system). [pg. 18, § A.1 Practical implementations, ¶1; communication device connected to a network is inherent in order to perform a read/write.]), and at least one processing device, wherein the at least one processing device is configured to execute the computer-readable program code to (“We used the open-sourced implementation of the Transformer framework1 with the provided transformer base single gpu architecture settings. This model has the same number of parameters as the base configuration that runs on 8 GPUs (6.5 × 107), but sees, in each training step, 1 16 of the number of tokens (2048 vs. 8×4096) as it uses a smaller batch size” [pg. 21, § A.4 Detailed Results: Machine Translation; Jaderberg discloses using GPU’s which implies the use of memory and program code required to run the algorithms.]): 
continuously monitor the population of the machine learning models, wherein continuously monitoring the population comprises collecting performance metrics for the population (“Members of the population interact with this data-store to update their current performance and can also use it to query the recent performance of other population members… Population members periodically checkpoint themselves, and when they do so they write their performance to the shared data-store” [pg. 18, § A.1 Practical Implementations; Examiner is interpreting periodically checkpoint themselves to be equivalent to continuously monitoring.]), wherein the performance metrics comprise accuracy (“More importantly, the actual performance metric Q that we truly care to optimise is often different to                 
                    
                        
                            Q
                        
                        ^
                    
                
            , for example Q could be accuracy on a validation set, or BLEU score as used in machine translation” [pg. 4, § 3. Population Based Training, ¶1]), resource efficiency (“In addition, the model selection and propagation process ensures that intermediate good models are given more computational resources, and are used as a basis of further optimisation and hyperparameter search.” [pg. 2, § Introduction, ¶3; See further Figure 1 caption]), reliability (“With just a small modification to a typical distributed hyperparameter training framework, our method allows robust and reliable training of models.” [Abstract, reliability is implied by training of the models]), stability (“We have shown consistent improvements in accuracy, training time and stability across a wide range of domains by being able to optimise over weights and hyperparameters jointly.” [pg. 13, § 5 Conclusions, ¶1]), and adaptability (“In Fig. 5 (d) we provide another demonstration that the benefits of PBT go beyond simply finding a single good fixed hyperparameter combination. Since PBT allows online adaptation of hyperparameters during training, we evaluate how important the adaptation is by comparing full PBT performance compared to using the set of hyperparameters that PBT found by the end of training” [pg. 13, § Hyperparameter Adaptivity]); 
analyze the performance metrics for the population by comparing the performance metrics to threshold values (“However, when a member of the population is deemed ready (for example, by having been optimised for a minimum number of steps or having reached a certain performance threshold), its weights and hyperparameters are updated by exploit and explore. For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population, and explore could randomly perturb the hyperparameters with noise.” [pg. 5, ¶3; Updating weights and hyperparameters would be equivalent to reconfiguring the population of models.]), wherein the threshold values comprise accuracy thresholds (“More importantly, the actual performance metric Q that we truly care to optimise is often different to                 
                    
                        
                            Q
                        
                        ^
                    
                
            , for example Q could be accuracy on a validation set, or BLEU score as used in machine translation” [pg. 4, § 3. Population Based Training, ¶1]), computer resource use efficiency settings (“In addition, the model selection and propagation process ensures that intermediate good models are given more computational resources, and are used as a basis of further optimisation and hyperparameter search.” [pg. 2, § Introduction, ¶3; See further Figure 1 caption]), reliability thresholds reliability (“With just a small modification to a typical distributed hyperparameter training framework, our method allows robust and reliable training of models.” [Abstract, reliability is implied by training of the models]), stability thresholds (“We have shown consistent improvements in accuracy, training time and stability across a wide range of domains by being able to optimise over weights and hyperparameters jointly.” [pg. 13, § 5 Conclusions, ¶1]), and adaptability thresholds adaptability (“In Fig. 5 (d) we provide another demonstration that the benefits of PBT go beyond simply finding a single good fixed hyperparameter combination. Since PBT allows online adaptation of hyperparameters during training, we evaluate how important the adaptation is by comparing full PBT performance compared to using the set of hyperparameters that PBT found by the end of training” [pg. 13, § Hyperparameter Adaptivity]), and
 wherein analyzing the performance metrics comprises evaluating an output diversity of the populuation (“In Fig. 5 (a) we demonstrate the effect of population size on the performance of PBT when training FuN on Atari. In general, we find that if the population size is too small (10 or below) we tend to encounter higher variance and can suffer from poorer results – this is to be expected as PBT is a greedy algorithm and so can get stuck in local optima if there is not sufficient population to maintain diversity and scope for exploration. However, these problems rapidly disappear as population size increases and we see 12 improved results as the population size grows. In our experiments, we observe that a population size of between 20 and 40 is sufficient to see strong and consistent improvements; larger populations tend to fare even better, although we see diminishing returns for the cost of additional population members.” [pg. 12-13, § Population Size; Population size would be a form of evaluating output diversity.]); and 
based on analyzing the performance metrics, reconfigure the populuation (“However, when a member of the population is deemed ready (for example, by having been optimised for a minimum number of steps or having reached a certain performance threshold), its weights and hyperparameters are updated by exploit and explore. For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population, and explore could randomly perturb the hyperparameters with noise.” [pg. 5, ¶3; Updating weights and hyperparameters would be equivalent to reconfiguring the population of models.]), wherein reconfiguring comprises incrementally retraining the population (“This iterative optimisation process can be computationally expensive, due to the number of steps T required to find θ∗ as well as the computational cost of each individual step, often resulting in the optimisation of θ taking days, weeks, or even months.” [pg. 4, § 3 Population Based Training, ¶4; Each individual step is equivalent to training incrementally.]) based on historical data (“Members of the population interact with this data-store to update their current performance and can also use it to query the recent performance of other population members.” [pg. 18, § A.1 Practical implementation; The citation corresponds to historical data since data is being stored in a data store and is used to retrain the model by querying the recent performances of other models.]), real-time data (“We used newstest2012 and newstest2013 respectively as the evaluation set used by PBT, and as the test set for monitoring.” [pg. 21, § A.4 Detailed Results: Machine Translation; Jaderberg discloses data-store to be a key-value store or simple file-system which would imply the use of real-time data. [pg. 18, § A.1 Practical implementations, ¶1]]), adversarial data (“Finally, the various metrics used by the community to evaluate the quality of samples produced by GAN generators are necessarily distinct from those used for the adversarial optimisation itself. We explore whether we can improve the performance of generators under these metrics by directly targeting them as the PBT meta-optimisation evaluation criteria.” [pg. 11, § 4.3 Generative Adversarial Networks, ¶4])
 However Jaderberg fails to explicitly teach a population of machine learning models clustered into a plurality of hierarchical sub-populations
for the hierarchical sub-populations 
the portion of the hierarchical sub-populations.
Signorelli teaches a population of machine learning models clustered into a plurality of hierarchical sub-populations (“In this paper we consider the existence of clusters of graphs with similar f (Y|θk): if any such cluster exists, we would like to borrow information among graphs within that cluster, so as to estimate a joint model within the cluster rather than many separate network models. As a result, we assume that the sequence S arises from M ≤ K subpopulations of graph models” [pg. 4, § 2.1 Specification of the mixture model, ¶2])
for the hierarchical sub-populations (“In this Section we evaluate the performance of the proposed clustering method with respect to network size (represented by the number of nodes v), the number of networks K and the number of subpopulations M.” [pg. 7, 4 Simulations, ¶1; See further pg. 16, bottom para discloses comparing networks and selecting the highest maximized likelihood which corresponds to “hierarchical”])
the portion of the hierarchical sub-populations (“To illustrate the proposed methodology, we cluster the 10 daily networks into two subpopulations and describe differences in the pattern of interactions between departments in these subpopulations (given the small number of graphs, we do not consider more than two clusters). We initialize the EM algorithm with 10 different starting points, and select the solution with the highest maximized likelihood.” [pg. 16, bottom para, Analyzing the differences between networks in a these subpopulations would be equivalent to evaluating a portion of sub-populations. Additionally, “hierarchical” would correspond to selecting the highest maximized likelihood when comparing networks.]).
Jaderberg and Signorelli are both in the same field of endeavor of population based training of neural networks. Jaderberg discloses a population based training method. Signorelli discloses a model-based clustering method for populations of networks. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Jaderberg’s population based training method by performing the training steps on the sub-populations disclosed by Signorelli. One would have been motivated to make this modification in order to improve computing time by identifying subpopulations and selecting the best model among the cluster. [pg. 7, § Simulations, Signorelli]
However Jaderberg/Signorelli fails to explicitly teach analyze the performance metrics for each of the machine learning models by comparing the performance metrics to energy efficiency settings
Sayadi teaches analyze the performance metrics for each of the machine learning models by comparing the performance metrics to energy efficiency settings (“HPCs are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. Performance counters data have been extensively used to predict the power, performance, and energy efficiency of computing systems” [pg. 1, §1 Introduction, ¶2; See further “For this purpose, eight robust machine learning models and two well-known ensemble learning classifiers applied on all studied ML models (sixteen in total) are implemented for malware detection and precisely compared and characterized in terms of detection accuracy, robustness, performance (accuracy×robustness), and hardware overheads” [Abstract]])
Jaderberg, Signorelli and Sayadi are all in the same field of endeavor of training multiple machine learning models in a population. Jaderberg discloses a population based training method. Signorelli discloses a model-based clustering method for populations of networks. Sayadi discloses training an ensemble of machine learning models based on hardware performance. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Jaderberg’s/Signorelli’s teachings to include energy efficiency settings as one of the performance metrics as taught by Sayadi. One would have been motivated to make this modification in order to analyze the energy efficiency of each machine learning model to improve its performance. [Abstract, Sayadi]
Jaderberg/Signorelli/Sayadi fails to explicitly teach and reconfiguring the population of the machine learning models comprises retraining the machine learning models based on synthetically generated data.
Seema teaches and reconfiguring the population of the machine learning models comprises retraining the machine learning models based on synthetically generated data (“We evaluate our ensemble on synthetic as well as real time data, compute the precision and represent it graphically using both majority voting as well as new proposed weighted averaging and compare its performance against individual classifiers.” [Abstract; See further pg. 5-6, § 6.1 Synthetic Data]).
Jaderberg, Signorelli, Sayadi, and Seema are all in the same field of endeavor of training multiple machine learning models in a population. Jaderberg discloses a population based training method. Signorelli discloses a model-based clustering method for populations of networks. Sayadi discloses training an ensemble of machine learning models based on hardware performance. Seema teaches training an ensemble of classifiers on synthetic and real time data. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the teachings of Jaderberg/Signorelli/Sayadi to include retraining the machine learning models on synthetic data as taught by Seema. One would have been motivated to make this modification in order to model specific factors that each model should be able to handle and thus improve the resulting classifications. [pg. 5, § 6.1 Synthetic Data, Seema]

Regarding claim 20, Jaderberg/Signorelli/Sayadi/Seema teaches The artificial intelligence system of claim 19, where Jaderberg further teaches wherein reconfiguring the population further comprises changing architectural parameters comprising at least one of adding a new model, removing a current model, and reweighting a current model (“If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued.” [pg. 3, Figure 1, a model replacing itself with a better performing model would be equivalent to adding a new model and removing a current model. See further “For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population”, pg. 5, ¶3; This citation corresponds to reweighting a current model]).
Signorelli further teaches the at least a portion of the hierarchical sub-populations (“To illustrate the proposed methodology, we cluster the 10 daily networks into two subpopulations and describe differences in the pattern of interactions between departments in these subpopulations (given the small number of graphs, we do not consider more than two clusters). We initialize the EM algorithm with 10 different starting points, and select the solution with the highest maximized likelihood.” [pg. 16, bottom para, Analyzing the differences between networks in a these subpopulations would be equivalent to evaluating a portion of sub-populations. Additionally, “hierarchical” would correspond to selecting the highest maximized likelihood when comparing networks.])
of the hierarchical sub-populations  ([pg. 7, 4 Simulations, ¶1])
Jaderberg, Sayadi, Seema, and Signorelli are all in the same field of endeavor of training multiple machine learning models in a population. Jaderberg discloses a population based training method. Sayadi discloses training an ensemble of machine learning models based on hardware performance. Seema teaches training an ensemble of classifiers on synthetic and real time data. Signorelli discloses a model-based clustering method for populations of networks. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the population of models disclosed by Jaderberg/Sayadi/Seema by clustering the population into sub-populations as taught by Signorelli. One would have been motivated to make this modification in order to improve computing time by identifying subpopulations and selecting the best model among the cluster. [pg. 7, § Simulations, Signorelli]

Regarding claim 21, Jaderberg/Signorelli/Sayadi/Seema teaches The artificial intelligence system of claim 19, where Jaderberg further teaches wherein the at least one processing device is configured to execute the computer-readable program code to, when analyzing the performance metrics for each of the machine learning models, evaluate an output diversity of the machine learning models (“In Fig. 5 (a) we demonstrate the effect of population size on the performance of PBT when training FuN on Atari. In general, we find that if the population size is too small (10 or below) we tend to encounter higher variance and can suffer from poorer results – this is to be expected as PBT is a greedy algorithm and so can get stuck in local optima if there is not sufficient population to maintain diversity and scope for exploration. However, these problems rapidly disappear as population size increases and we see 12 improved results as the population size grows. In our experiments, we observe that a population size of between 20 and 40 is sufficient to see strong and consistent improvements; larger populations tend to fare even better, although we see diminishing returns for the cost of additional population members.” [pg. 12-13, § Population Size; Population size would be a form of evaluating output diversity.]).

Regarding claim 22, Jaderberg/Signorelli/Sayadi/Seema teaches The artificial intelligence system of claim 21, where Jaderberg further teaches wherein the at least one processing device is configured to execute the computer-readable program code to, when evaluating the output diversity of the machine learning models, determine a shared convergent output from a number of the machine learning models, and in response to determining the shared convergent output reconfigure the population of the machine learning models (“its weights and hyperparameters are updated by exploit and explore. For example, exploit could replace the current weights with the weights that have the highest recorded performance in the rest of the population, and explore could randomly perturb the hyperparameters with noise. After exploit and explore, iterative training continues using step as before. This cycle of local iterative training (with step) and exploitation and exploration using the rest of the population (with exploit and explore) is repeated until convergence of the model.” [pg. 5, ¶3; Jaderberg discloses training until a convergence of the model is reached which would be equivalent to determining a shared convergent output. Additionally, weights and parameters are updated during this iterative process which would correspond to reconfiguring the population in response to determining if a convergence has been reached or not.]).

Response to Arguments
Regarding Claims 1-7, 9-16, and 18-22:
In response to applicant’s arguments regarding amended claims 1, 10, and 19 on pgs. 11-17 regarding the 35 U.S.C. § 101 rejection has been considered and are persuasive. Therefore, the rejection has been withdrawn.

Regarding the rejection of claims 1, 10, and 19 under 35 U.S.C. §102 and §103:

Regarding applicant’s arguments on pgs. 17-19 with respect to amended claims 1, 10, and 19 that the cited prior art of Jaderberg and Signorelli fails to explicitly teach the amended features has been considered but are not persuasive. Jaderberg teaches performance metrics comprising accuracy, resource, reliability, stability, and adaptability however fails to explicitly teach threshold values comprising energy efficiency settings which is now taught by the newly presented art of Sayadi. Furthermore Jaderberg teaches retraining the machine learning models based on historical/real-time/adversarial data but does not teach based on synthetically generated data which is now taught by the newly presented art of Seema. Please see the updated 103 rejection including the newly presented art for amended claims 1, 10, and 19. 


Conclusion
Applicant's amendment necessitated the new grounds of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL H HOANG whose telephone number is (571)272-8491. The examiner can normally be reached Mon-Fri 8:30AM-4:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/M.H.H./Examiner, Art Unit 2122    



/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122