DETAILED ACTION
1. 	This action is in response to amendments and arguments filed  Office Action is in response to the amendments filed 3 October 2022 for application 15/453342 filed on 8 March 2017. Currently claims 1-6, 8-15, and 17-20 are pending. Claims 7 and 16  have been previously canceled.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to claims 1-6, 8-15, 17-20  have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-4, 6, 9-13, 15, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (“Efficient Hyper-parameter Optimization for NLP Applications,” 2015, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2112-2117), hereinafter referred to as Wang, in view of Swersky et al. (“Multi-task bayesian optimization”. In Proc. of NIPS, 2013, pp. 1-9), hereinafter referred to as Swersky, and in further view of Swersky et al. (“Freeze-Thaw Bayesian Optimization”, https://arxiv.org/pdf/1406.3896.pdf, arXiv:1406.3896v1 [stat.ML] 16 Jun 2014, pp. 1-12), hereinafter referred to as Swersky2.

In regards to claim 1, Wang teaches a method for providing hyperparameter tuning relating to a predictive learning model, comprising: identifying a number of rounds to perform hyperparameter tuning evaluations ([p. 2114, Section 2.3 and Algorithm 1], This multi-stage algorithm subsumes the standard Bayesian optimization algorithm as a special case when the total number of stages S=1. In our case, for datasets used at stages 1, …, S−1, we use random sampling of full training data to get subsets of data required at these initial stages, while stage S has full data., wherein a regression model (predictive model) is used to perform hyperparameter tuning/optimization over S stages/rounds.)  performing the hyperparameter tuning evaluations for the identified number of rounds ([p. 2114, Section 2.3 and Algorithm 1] wherein Algorithm 1 discloses performing hyperparameter tuning evaluations for the S stages/rounds, which is denoted by number of stages S.), each round of the hyperparameter tuning evaluations including the steps of: retrieving a selected number of hyperparameter value sets ([p. 2114, Section 2.3, Algorithm 1, Table 1], During each stage s, the k best configurations (based on validation accuracy) passed from the previous stage1 are first evaluated on the current stage’s training data                         
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                        
                    . … 1A special case is the initial stage. We adopt the convention that a Sobol sequence is used to initialize the first stage. The value k for the first stage is the number of points in the Sobol sequence., wherein k configurations (sets) of hyperparameter are selected at each round/stage (wherein it is noted that table 1 provides an example of the particular hyperparameters in each configuration/set).)  evaluating the selected hyperparameter value sets against a training data for the predictive learning model for a selected duration, the selected duration including a number …; ([p. 2113, Section 2.1, p. 2114, Section 2.3, Algorithm 1],  Let λ={λ1, . . . , λm} denote the hyperparameters of a machine learning algorithm, and let {Λ1, …,Λm} denote their respective domains. When trained with λ on training data Ttrain, the validation accuracy on Tvalid is denoted as L(λ, Ttrain, Tvalid)., During each stage s, the k best configurations (based on validation accuracy) passed from the previous the stage are first evaluated on the current stage’s training data                         
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                        
                    ., where training data                         
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                        
                     for a given round/stage s is used to train the regression model at that stage/round such that  the number of samples in the training data discloses a selected duration for the training.); calculating a running … -based on the evaluation of the selected hyperparameter value sets([p. 2113, Section 2.2, p. 2114, Section 2.3, Algorithm 1], A common acquisition function is the expected improvement, EI, over best validation accuracy seen so far L* [a running validation accuracy]:                         
                            a
                            (
                            λ
                            ,
                            V
                            )
                            =
                            
                                
                                    ∫
                                    
                                        -
                                        ∞
                                    
                                    
                                        ∞
                                    
                                
                                
                                    m
                                    a
                                    x
                                    (
                                    L
                                    -
                                    
                                        
                                            L
                                        
                                        
                                            *
                                        
                                    
                                    ,
                                    0
                                    )
                                    
                                        
                                            p
                                        
                                        
                                            V
                                        
                                    
                                    (
                                    L
                                    |
                                    λ
                                    )
                                    
                                        
                                            d
                                        
                                        
                                            L
                                        
                                    
                                
                            
                        
                     where                         
                            
                                
                                    p
                                
                                
                                    V
                                
                            
                            (
                            L
                            |
                            λ
                            )
                        
                     denotes the probability of accuracy L given configuration λ, which is encoded by the probabilistic regression model V. The acquisition function is used to identify the next candidate (the one with the highest expected improvement over current best L*)., the standard Bayesian Optimization are initialized with these k settings and applied for Y_s−k iterations on                         
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                        
                    . … After running all S stages the algorithm terminates, and outputs the configuration with the highest validation accuracy from all hyper-parameters explored by all stages (including the initialization points explored by the first stage)., wherein a running validation accuracy is computed across successive stages/rounds for each of the hyperparameter value sets/configurations algorithm.) … creating suggested hyperparameter value sets based on the calculated running…  and a predicted … against the training data ([p. 2113, Section 2.2, p. 2114, Section 2.3, Algorithm 1] Model-based Bayesian Optimization starts with an initial set of hyperparameter settings λ1, … λn, where each setting denotes a set of assignments to all hyperparameters. These initial settings are then evaluated on the validation data and their accuracies are recorded. The algorithm then proceeds in rounds to iteratively fit a probabilistic regression model V to the recorded accuracies. A new hyperparameter configuration is then suggested by the regression model V with the help of acquisition function. Then the accuracy of the new setting is evaluated on validation data, which leads to the next iteration., wherein the regression model (with the acquisition function) suggests, for each round/stage a (new) set hyperparameter configurations (value sets) based on previously observed/learned (validation accuracy) responses to predict (validation accuracy/improvement) responses to training data (where it is noted that Wang at [p. 2114, Section 2.3, Algorithm 1] discloses this process in the Bayesian Optimization step).); evaluating the suggested hyperparameter value sets against the training data for the predictive learning model for the selected duration ([p. 2114, Section 2.3, Algorithm 1] wherein algorithm 1 discloses evaluating the suggested hyperparameter value sets against the training data for the predictive model for the selected duration:                         
                            
                                
                                    L
                                
                                
                                    j
                                
                            
                            =
                            E
                            v
                            a
                            l
                            u
                            a
                            t
                            e
                             
                            L
                            (
                            
                                
                                    λ
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                            ,
                            
                                
                                    T
                                
                                
                                    v
                                    a
                                    l
                                    i
                                    d
                                
                            
                            )
                        
                    .); updating the running … based on the evaluation of the suggested hyperparameter value sets ([p. 2113, Section 2.2, Algorithm 1] The algorithm then proceeds in rounds to iteratively fit a probabilistic regression model V to the recorded accuracies. A new hyperparameter configuration is then suggested by the regression model V with the help of acquisition function (Brochu et al., 2010). Then the accuracy of the new setting is evaluated on validation data, which leads to the next iteration., wherein the Bayesian (Gaussian process) regression model is updated based upon the evaluation of the (suggested) hyperparameter configurations/value sets at each stage such that this is an update in the (predictive) validation accuracy (metric for selecting the hyperparameter configurations).); increasing the number of … in the selected duration ([p. 2114, Section 2.3, Algorithm 1] The multi-stage algorithm as shown in Algorithm 1 is an extension of the standard Bayesian Optimization (Section 2.2) to enable speed on large-scale datasets. It proceeds in multiple stages of Bayesian Optimization with increasingly amounts of training data                         
                            
                                
                                    
                                        
                                            T
                                        
                                        
                                            t
                                            r
                                            a
                                            i
                                            n
                                        
                                        
                                            1
                                        
                                    
                                
                            
                            …
                             
                            ,
                            ≤
                            |
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    S
                                
                            
                            |
                        
                    ., wherein the increasing amounts of training data per stage discloses increasing the selected duration.); and selecting a number of hyperparameter value sets from the selected hyperparameter value sets and the suggested hyperparameter value sets ([p. 2114, Section 2.3, Algorithm 1] the top k configurations  based on validation accuracy are used to initialize the next stage’s run., wherein a number of hyperparameter value sets (k) is selected from the previously selected hyperparameter comfigurations/values sets and the suggested hyperparameter configurations/value sets.); and after performing the hyperparameter tuning evaluations for the identified number of rounds, selecting the best performing hyperparameter value set as indicated by a lowest expected variance when evaluated against the training data (([p. 2114, Section 2.3, Algorithm 1] After running all S stages the algorithm terminates, and outputs the configuration with the highest validation accuracy from all hyper-parameters explored by all stages (including the initialization points explored by the first stage)., wherein, after performing the hyperparameter tuning evaluations for the identified number of rounds,  the hyperparameter optimization framework selects the best performing hyperparameter configuration/value set when evaluated against the training data and wherein It is implicit that the highest validation accuracy indicates a lowest error rate, or a lowest expected variance.).
However, Wang does not explicitly teach … the selected duration including a number of epochs … variance … the running variance based at least on a cross-correlation between evaluated hyperparameters … variance … variance … increasing the number of epochs in the selected duration … variance. In other words, although, Wang discloses a metric for predictively modelling the performance of hyperparameter configurations according to a (running) validation accuracy metric that is computed/updated at each stage/round across the (selected and suggested) hyperparameter configurations and which is used to select and suggest hyperparameter configurations for a subsequent round, Wang does not explicitly disclose that this metric is a variance and, likewise does not disclose a variance based on cross-correlation metrics. Although Wang teaches the increase in the duration of training based on an increase in the size of the training set, he does not explicitly disclose that this increase in the duration of the training corresponds to an increase in a number of epochs.
However,  Swersky, in the analogous environment of sequential hyperparameter optimization, teaches calculating a running variance based on the evaluation of the selected hyperparameter value sets, the running variance based at least on a cross-correlation between evaluated hyperparameters, creating suggested hyperparameter value sets based on the calculated running variance and a predicted variance against the training data, … updating the running variance based on the evaluation of the suggested hyperparameter value sets ([p. 2, Section 2.1, p. 3, Section 2.3, p. 4, Section 3.2] The predictive mean and covariance under a GP can be respectively expressed as: <equations 1, 2> Here K(X, x) is the N-dimensional column vector of cross-covariances between x and the set X. The N × N matrix K(X, X) is the Gram matrix for the set X., A standard approach is to select the next point to query by finding the maximum of an acquisition function a(x ; {xn, yn}, θ) over a bounded domain in X . This is an heuristic function that uses the posterior mean and uncertainty, conditioned on the GP hyperparameters θ, in order to balance exploration and exploitation. There have been many proposals for acquisition functions, or combinations thereof [16, 2]. We will use the expected improvement criterion (EI) [15, 17], <equations 4, 5>…. Here, we formulate the entropy search problem as that of selecting the next point from a pre-specified candidate set. Given a set of C points X˜ ⊂ X , we can write the probability of a point x ∈ X˜ having the minimum function value among the points in X˜ via: <equation 6>., We wish to optimize the average performance over all k folds, but it may not be necessary to actually evaluate all of them in order to identify the quality of the hyperparameters under consideration. The predictive mean and variance of the average objective are given by: <equation 8> … We choose a (x, t) pair using a two-step heuristic. First we impute missing observations using the predictive means. We then use the estimated average function to pick a promising candidate x by optimizing EI. Conditioned on x, we then choose the task that yields the highest single-task expected improvement., wherein a framework for hyperparameter optimizations uses a Gaussian Process regression to represent the response surface for those hyperparameters such that hyperparameter sets are suggested for evaluation based upon the GP model predictive of (EI or entropy based performance with exploitation and explorations corresponding to sets based on previously best sets/response surface points or regions and suggested new sets/points or regions)  based upon at least the cross-variance between sets of hyperparameters (e.g., sigma computation in equation 2) that is used in the objective function for identifying/suggesting those sets (equations 4, 5, 8) such that the variance/cross-covariance is a running variance/cross-covariance that is developed/updated across successive iterations/rounds of the hyperparameter optimization framework and such that the cross-covariance is in a general sense a cross-correlation (i.e., it represents an association between different hyperparameter sets over time) but also implicitly includes, in a more narrow sense, the cross-correlation (relatable to the cross-covariance through the distribution means).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Swersky to calculate a running variance based on the evaluation of the selected hyperparameter value sets, the running variance based at least on a cross-correlation between evaluated hyperparameters, to create suggested hyperparameter value sets based on the calculated running variance and a predicted variance against the training data, … and to update the running variance based on the evaluation of the suggested hyperparameter value sets.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved efficiency with improved or at least comparable hyperparameter tuning performance to other methods by iteratively using GP regression in SMBO, particularly for large datasets, in which the GP model includes running variance/cross-covariance/cross-correlation values (over response surface space) to iteratively select those hyperparameter sets (Swersky, [Abstract, p. 2, Section 2.1, p. 8, Section 5, Figure 5]). 
However, Wang and Swersky do not explicitly teach …the selected duration including a number of epochs … number of epochs …. Swersky does not explicitly disclose a modification in the number of epochs such as to avoid extraneous training iterations on poorer-performing hyperparameter sets.
However,  Swersky2, in the analogous environment of hyperparameter search optimization, teaches evaluating the selected hyperparameter value sets against a training data for the predictive learning model for a selected duration, the selected duration including a number of epochs; … evaluating the suggested hyperparameter value sets against the training data for the predictive learning model for the selected duration; updating the running variance based on the evaluation of the suggested hyperparameter value sets; increasing the number of epochs in the selected duration; ([pp. 7-9, Section 6, Figure 4],  For each of these tasks, we allowed the method of [20] to select the number of training epochs to run, as a hyperparameter to be optimized between 1 and 100, and report at each epoch the cumulative number of epochs run and the lowest objective value observed over all epochs. … In Figure 4, we show a visualization of the progression of our Bayesian optimization procedure on the PMF problem. We observe here and throughout the empirical analysis that the method generally initially explored the hyperparameter space by running only a small number of epochs for various hyperparameter settings. However, once it found a promising curve, it would run it out for more epochs. Later in the optimization, the method would frequently revisit existing curves and extend them for a few epochs at a time, as we observe in Figure 4a., wherein a hyperparameter search method evaluates the performance of a selected set of hyperparameters for a (selected) duration corresponding to a number of epochs (in which the number of epochs is determined over the course of the iterative procedure with a number of epochs extending from 1 to 100 but in general corresponds to the cumulative training each time a hyperparameter set is (re)visited for training) such that this number of epochs is increased during the course of training for those hyperparameters that are most promising; in other words, as shown in Figure 4,  the selected duration of a number of epochs for each round of evaluation in Swersky2 corresponds to not just the number of epochs in a current iteration of training but also to all of the preceding epochs used up to that point for the generation and prediction of the loss curve (i.e., the number of epochs of training in each round/iteration is increasing for the hyperparameters selected as being the most promising) and wherein it is noted that Swersky2, like Swersky, also teaches the computation of a running variance of the GP-based performance characterization of hyperparameters sets (but as a function of epoch number).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang and Swersky to incorporate the teachings of Swersky2 to evaluate the selected hyperparameter value sets against a training data for the predictive learning model for a selected duration, the selected duration including a number of epochs and to increase the number of epochs in the selected duration.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved efficiency/rapidity of hyperparameter search performance by assigning training epochs according to the observed performance of hyperparameter sets with the more promising ones being allocated more training epochs, particularly in a Bayesian optimization framework (Swersky2, [pp. 7-9, Section 6, p. 9, Section 7, Figure 3, Figure 4]). 

In regards to claim 2, the rejection of claim 1 is incorporated and Wang further teaches wherein the number of rounds is based on a number of hyperparameters to evaluate ([p. 2113, Section 1, p. 2114, Section 2.3], The key intuition behind the proposed approach is that both dataset size and search space of hyperparameter can be large, and applying the Bayesian Optimization algorithm on the data can be both expensive and unnecessary, since many evaluated candidates may not even be within range of best final settings., In practice, the number of stages S [the number of rounds] and the value of k depend on the quantity of the data and the quality of stage-wise model [which includes a number of hyperparameters to evaluate]. , wherein the number of stages/rounds depends upon quality and quantity (which includes a number of hyperparameters) to evaluate)).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Swersky and Swersky2 for the same reasons as pointed out for claim 1.

In regards to claim 3, the rejection of claim 1 is incorporated and Wang further teaches wherein the selected duration includes … of evaluating the selected hyperparameter value sets and the suggested hyperparameter value sets against a training data for the predictive learning model  ([p. 2114, Section 2.3 and Algorithm 1], During each stage s, the k best configurations (based on validation accuracy) passed from the previous stage are first evaluated on the current stage’s training data                         
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                        
                     [where the combined number of samples in                         
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                        
                     and                         
                            
                                
                                    T
                                
                                
                                    v
                                    a
                                    l
                                    i
                                    d
                                
                                
                            
                        
                     includes at least two iterations of evaluating], and then the standard Bayesian Optimization algorithm are initialized with these k settings and applied for Y_s−k iterations on                         
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                        
                     (discounting the k evaluations done earlier in the stage), where Y_s is the total number of iterations for stage s. Then the top k configurations based on validation accuracy are used to initialize the next stage’s run [where the combined number of samples in                         
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                        
                     and                         
                            
                                
                                    T
                                
                                
                                    v
                                    a
                                    l
                                    i
                                    d
                                
                                
                            
                        
                     includes at least two iterations of evaluating]. … In our experiments, we empirically choose their values to be S=2 and k=3 which result in a good balance between accuracy and speed on the given datasets., wherein the number of rounds is S=2 (i.e., distinct training data sets are applied for each of two successive rounds).).
However, Wang and Swersky do not explicitly teach …two epochs…. Although Wang teaches the increase in the duration of training based on an increase in the size of the training set, he does not explicitly disclose that this increase in the duration of the training corresponds to an increase in a number of epochs. Swersky does not explicitly disclose a modification in the number of epochs such as to avoid extraneous training iterations on poorer-performing hyperparameter sets.
However,  Swersky2, in the analogous environment of hyperparameter search optimization, teaches wherein the selected duration includes two epochs of evaluating the selected hyperparameter value sets and the suggested hyperparameter value sets against a training data for the predictive learning model; ([pp. 7-9, Section 6, Figure 4],  For each of these tasks, we allowed the method of [20] to select the number of training epochs to run, as a hyperparameter to be optimized between 1 and 100, and report at each epoch the cumulative number of epochs run and the lowest objective value observed over all epochs. … In Figure 4, we show a visualization of the progression of our Bayesian optimization procedure on the PMF problem. We observe here and throughout the empirical analysis that the method generally initially explored the hyperparameter space by running only a small number of epochs for various hyperparameter settings. However, once it found a promising curve, it would run it out for more epochs. Later in the optimization, the method would frequently revisit existing curves and extend them for a few epochs at a time, as we observe in Figure 4a., wherein a hyperparameter search method evaluates the performance of a selected set of hyperparameters for a (selected) duration corresponding to a number of epochs in which each time the GP-based hyperparameter performance is updated/quantified (for more promising hyperparameter sets), training is run for a few epochs in which the number of epochs (interpreted as including 2 epochs) but also wherein (and alternatively) the performance is determined over the course of the iterative procedure with a number of epochs extending from 1 to 100 (i.e., it includes 2).).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang and Swersky to incorporate the teachings of Swersky2 for the selected duration includes two epochs of evaluating the selected hyperparameter value sets and the suggested hyperparameter value sets against a training data for the predictive learning model.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved efficiency/rapidity of hyperparameter search performance by assigning training epochs according to the observed performance of hyperparameter sets with the more promising ones being allocated more training epochs, particularly in a Bayesian optimization framework (Swersky2, [pp. 7-9, Section 6, p. 9, Section 7, Figure 3, Figure 4]). 

In regards to claim 4, the rejection of claim 3 is incorporated and Wang further teaches, wherein the selected duration increases linearly ([p. 2115, Section 3.1] the multi-stage method uses the same 30% training data at the initial stage, and full training data at the subsequent stage.,  wherein an increase from 30% to 100% of training data in one experiment discloses the selected duration increasing linearly.).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Swersky and Swersky2 for the same reasons as pointed out for claims 1 and 2 respectively.

In regards to claim 6, the rejection of claim 1 is incorporated and Wang further teaches, wherein the number of rounds is based on a temporal factor for identifying a finalist grouping of hyperparameters ([p. 2114, Section 2.3], The multi-stage algorithm as shown in Algorithm 1 is an extension of the standard Bayesian Optimization (Section 2.2) to enable speed on large-scale datasets. … After running all S stages the algorithm terminates, and outputs the configuration with the highest validation accuracy from all hyper-parameters explored by all stages (including the initialization points explored by the first stage). … In our experiments, we empirically choose their values to be S=2 and k=3 which result in a good balance between accuracy and speed on the given datasets., wherein the number of stages/rounds is based on speed considerations (a temporal factor) for arriving at the final set of tuned hyperparameters.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Swersky and Swersky2 for the same reasons as pointed out for claim 1.

In regards to claim 9, the rejection of claim 1 is incorporated and Wang further teaches, wherein selecting the number of hyperparameter value sets from the selected hyperparameter value sets and the suggested hyperparameter value sets further comprises selecting a specified percentage of the hyperparameter value sets ([p. 2114, Section 2.3 and Algorithm 1], the top k configurations based on validation accuracy are used to initialize the next stage’s run., wherein it is implicit that the top k configurations (the number of hyperparameter value sets from the selected and suggested hyperparameter value sets) out of the Y_s number of configurations, as shown in Algorithm 1, discloses a specified percentage of the hyperparameter value sets.).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Swersky and Swersky2 for the same reasons as pointed out for claim 1.

Claim 10 is rejected because it is just a system implementation of the same subject matter of claim 1 which can be found in Wang, Swersky, and Swersky2. It is noted that Claim 10 in addition recites a data processor and memory which can also be found in Wang ([p. 2113, Sections 1 and 2] The key intuition behind the proposed approach is that both dataset size and search space of hyperparameter can be large, and applying the Bayesian Optimization algorithm on the data can be both expensive and unnecessary, since many evaluated candidates may not even be within range of best final settings. … The new multi-stage Bayesian Optimization is a generalization of the standard Bayesian Optimization for hyper-parameter learning. It is designed to scale standard Bayesian Optimization to large amounts of training data., wherein it is implicit that a processing unit and memory including computer readable instructions would be necessary to perform the disclosed method in Wang). 

Claim 11/10 is rejected because it is just a system implementation of the same subject matter of claim 2/1 which can be found in Wang, Swersky, and Swersky2.

Claim 12/10 is rejected because it is just a system implementation of the same subject matter of claim 3/1 which can be found in Wang, Swersky, and Swersky2.

Claim 13/12 is rejected because it is just a system implementation of the same subject matter of claim 4/3 which can be found in Wang, Swersky, and Swersky2.

Claim 15/10 is rejected because it is just a system implementation of the same subject matter of claim 6/1 which can be found in Wang, Swersky, and Swersky2.

Claim 18/10 is rejected because it is just a system implementation of the same subject matter of claim 9/1 which can be found in Wang, Swersky, and Swersky2.

Claim 19 is rejected because it is just a computer readable storage device implementation of the same subject matter of claim 1 which can be found in Wang, Swersky, and Swersky2. It is noted that Claim 19 in addition recites a computer readable storage device which can also be found in Wang ([p. 2113, Sections 1 and 2] The key intuition behind the proposed approach is that both dataset size and search space of hyperparameter can be large, and applying the Bayesian Optimization algorithm on the data can be both expensive and unnecessary, since many evaluated candidates may not even be within range of best final settings. … The new multi-stage Bayesian Optimization is a generalization of the standard Bayesian Optimization for hyper-parameter learning. It is designed to scale standard Bayesian Optimization to large amounts of training data., wherein it is implicit that a processing unit executing computer readable instructions in a computer readable storage device would be necessary to perform the disclosed method in Wang). 

Claim 20/19 is rejected because it is just a computer readable storage device implementation of the same subject matter of claim 3/1 which can be found in Wang, Swersky, and Swersky2.

Claims 5, 8, 14, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Wang, in view of Swersky, in view of Swersky2 and in further view of Luo et al. (“MLBCD: a machine learning tool for big clinical data,” 2015, Health Information Science and Systems, 3:3, pp. 1-19), hereinafter referred to as Luo.

In regards to claim 5, the rejection of claim 3 is incorporated and Wang, Swersky, and Swersky2 do not further teach, wherein the selected duration increases exponentially. The increase in the training data set size in Wang is not disclosed as being exponential. Neither Swersky nor Swersky2 discloses an increase in training set size.
However, Luo, in the analogous environment of hyperparameter optimization  teaches wherein the selected duration increases exponentially (p. 9, The training and test samples] As shown in Fig. 3, the training sample expands from one round to the next. An effective expansion method is to increase the training size exponentially, e.g., double the training size each round., wherein the training set size is increased exponentially from round-to-round).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang, Swersky, and Swersky2 to incorporate the teachings of Luo for the selected duration to increase exponentially.  The modification would have been obvious because one of ordinary skill in the art before the effective filing date to improve training efficiency by increasing the selected duration of training exponentially (with this configuration of iterative training duration increase also yielding predictable results) (Luo, [p. 9, “The training and test samples”]). 

In regards to claim 8, the rejection of claim 1 is incorporated and Wang, Swersky do not further teach, wherein the number of rounds is based on a confidence of each of the hyperparameters identified in a finalist grouping of hyperparameters. Neither Wang nor Swersky determine a number of stages of training based on a confidence level of the hyperparameters. Although Swersky2 discloses that the representation of the uncertainty (e.g., covariance) in the GP performance model affects the determination of whether or not to continue with training, he does not explicitly disclose that a number of stages of training is based on a confidence level of the hyperparameters. 
However, Luo, in the analogous environment of hyperparameter optimization, teaches wherein the number of rounds is based on a confidence of each of the hyperparameters identified in a finalist grouping of hyperparameters ([p. 9, The accuracy difference threshold, pp. 10-11, Iterations of the search process], The accuracy difference threshold τ is used to eliminate unpromising machine learning algorithms and identify unpromising combinations of hyper-parameter values., We repeat the above process [see section A subsequent round that is not the final one] for a pre-determined number of rounds (e.g., 5) until the accuracy difference threshold τ reaches a pre-determined minimum value, such as 0.05. … After τ reaches the pre-determined minimum value, each pair of a remaining promising algorithm and a combination of hyper-parameter values has similar potential. The pair achieving the highest accuracy is the best one found., wherein the number of rounds is determined according to the accuracy difference (a confidence) reaching a minimum value such that the pair of hyperparameters (finalist grouping) achieving the highest accuracy is selected as the best set.).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang, Swersky, and Swersky2 to incorporate the teachings of Luo for the number of rounds to be based on a confidence of each of the hyperparameters identified in a finalist grouping of hyperparameters.  The modification would have been obvious because one of ordinary skill in the art before the effective filing date to improve training efficiency by increasing the selected duration of training exponentially (with this configuration of iterative training duration increase also yielding predictable results) with the termination of this iterative hyperparameter selection across expanding training sets based upon those hyperparameters reaching a particular quality/confidence  (Luo, [p. 9, “The training and test samples”, pp. 10-11, Iterations of the search process]). 

Claim 14/10 is rejected because it is just a system implementation of the same subject matter of claim 5/1 which can be found in Wang, Swersky, Swersky2, and Luo.

Claim 17/10 is rejected because it is just a system implementation of the same subject matter of claim 8/1 which can be found in Wang, Swersky, Swersky2, and Luo.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Jamieson et al., (“Non-Stochastic Best Arm Identification and Hyperparameter Optimization”,  https://arxiv.org/pdf/1502.07943.pdf, arXiv:1502.07943v1 [cs.LG] 27 Feb 2015, pp. 1-13) teach the modification of a number of epochs/iterations of training in hyperparameter search according to an overall budget such that the number of iterations doubles each time the number of hyperparameter sets to be searched is halved.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT LEWIS KULP whose telephone number is (571)272-7983. The examiner can normally be reached M, Th, F 8-5:30; Tu 8-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ROBERT LEWIS KULP/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124