DETAILED ACTION
This non-final rejection is responsive to the request for continued examination filed 05 January 2021.
Claims 1, 10, and 19 are amended. Claims 7 and 16 are cancelled. No claims have been added or withdrawn. Therefore, claims 1-6, 8-15, and 17-20 are presently pending.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 05 January 2021 has been entered.

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

 Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-4, 6, 9-13, 15, and 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (“Efficient Hyper-parameter Optimization for NLP Applications,” 2015, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2112-2117) (“Wang”) in view of Hutter et al. (“Sequential Model-Based Optimization for General Algorithm Configuration,” 2011, LION 5, LNCS 6683, pp. 507-523) (“Hutter”).
Regarding claim 1, Wang teaches a method for providing hyperparameter tuning relating to a predictive learning model, comprising: 
identifying a number of rounds to perform hyperparameter tuning evaluations (Wang, p. 2114, Section 2.3 and Algorithm 1, “This multi-stage algorithm subsumes the standard Bayesian optimization algorithm as a special case when the total number of stages S=1. In our case, for datasets used at stages 1, …, S−1, we use random sampling of full training data to get subsets of data required at these initial stages, while stage S has full data.”); 
performing the hyperparameter tuning evaluations for the identified number of rounds (Wang, p. 2114, Algorithm 1 discloses performing hyperparameter tuning evaluations for the identified number of rounds, which is denoted by number of stages S.), 
each round of the hyperparameter tuning evaluations including the steps of: 
retrieving a selected number of hyperparameter value sets (Wang, p. 2114, Section 2.3 and Algorithm 1, “During each stage s, the k best configurations (based on validation accuracy) [a selected number of hyperparameter value sets] passed from the previous stage1 are first evaluated on the current stage’s training data                         
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                        
                    . … 1A special case is the initial stage. We adopt the convention that a Sobol sequence is used to initialize the first stage. The value k for the first stage is the number of points in the Sobol sequence.”); 
evaluating the selected hyperparameter value sets against a training data for the predictive learning model for a selected duration (Wang, p. 2113, Section 2.1, “Let λ={λ1, . . . , λm} denote the hyperparameters of a machine learning algorithm, and let {Λ1, …,Λm} denote their respective domains. When trained with λ on training data Ttrain, the validation accuracy on Tvalid is denoted as L(λ, Ttrain, Tvalid).” Wang, p. 2114, Section 2.3 and Algorithm 1, “During each stage s, the k best configurations [the selected number of hyperparameter value sets] (based on validation accuracy) passed from the previous stage are first evaluated on the current stage’s training data                         
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                        
                     [a training data, where the number of samples in the training data discloses a selected duration].”); 
calculating a running [validation accuracy] based on the evaluation of the selected hyperparameter value sets (Wang, p. 2113, Section 2.2, “A common acquisition function is the expected improvement, EI, over best validation accuracy seen so far L* [a running validation accuracy]:                         
                            a
                            (
                            λ
                            ,
                            V
                            )
                            =
                            
                                
                                    ∫
                                    
                                        -
                                        ∞
                                    
                                    
                                        ∞
                                    
                                
                                
                                    m
                                    a
                                    x
                                    (
                                    L
                                    -
                                    
                                        
                                            L
                                        
                                        
                                            *
                                        
                                    
                                    ,
                                    0
                                    )
                                    
                                        
                                            p
                                        
                                        
                                            V
                                        
                                    
                                    (
                                    L
                                    |
                                    λ
                                    )
                                    
                                        
                                            d
                                        
                                        
                                            L
                                        
                                    
                                
                            
                        
                     where                         
                            
                                
                                    p
                                
                                
                                    V
                                
                            
                            (
                            L
                            |
                            λ
                            )
                        
                     denotes the probability of accuracy L given configuration λ, which is encoded by the probabilistic regression model V. The acquisition function is used to identify the next candidate (the one with the highest expected improvement over current best L*).” Wang, p. 2114, Section 2.3 and Algorithm 1, “the standard Bayesian Optimization algorithm [which involves calculating the running validation accuracy] are initialized with these k settings [the selected hyperparameter value sets] and applied for Y_s−k iterations on                         
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                        
                    . … After running all S stages the algorithm terminates, and outputs the configuration with the highest validation accuracy from all hyper-parameters explored by all stages (including the initialization points explored by the first stage).”); 
creating suggested hyperparameter value sets based on the calculated running [validation accuracy] (Wang, pp. 2113, Section 2.2, “Model-based Bayesian Optimization starts with an initial set of hyperparameter settings λ1, … λn, where each setting denotes a set of assignments to all hyperparameters. These initial settings are then evaluated on the validation data and their accuracies are recorded. The algorithm then proceeds in rounds to iteratively fit a probabilistic regression model V to the recorded accuracies. A new hyperparameter configuration is then suggested [creating suggested hyperparameter value sets] by the regression model V with the help of acquisition function [based on the calculated running validation accuracy]. Then the accuracy of the new setting is evaluated on validation data, which leads to the next iteration.” Wang, p. 2114, Section 2.3, Algorithm 1 discloses this process in the Bayesian Optimization step.); 
evaluating the suggested hyperparameter value sets against the training data for the predictive learning model for the selected duration (Wang, p. 2114, Algorithm 1 discloses evaluating the suggested value sets against the training data for the predictive model for the selected duration: “                        
                            
                                
                                    L
                                
                                
                                    j
                                
                            
                            =
                            E
                            v
                            a
                            l
                            u
                            a
                            t
                            e
                             
                            L
                            (
                            
                                
                                    λ
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    s
                                
                            
                            ,
                            
                                
                                    T
                                
                                
                                    v
                                    a
                                    l
                                    i
                                    d
                                
                            
                            )
                        
                    .”); 
increasing the selected duration (Wang, p. 2114, Section 2.3 and Algorithm 1, “The multi-stage algorithm as shown in Algorithm 1 is an extension of the standard Bayesian Optimization (Section 2.2) to enable speed on large-scale datasets. It proceeds in multiple stages of Bayesian Optimization with increasingly amounts of training data                         
                            
                                
                                    
                                        
                                            T
                                        
                                        
                                            t
                                            r
                                            a
                                            i
                                            n
                                        
                                        
                                            1
                                        
                                    
                                
                            
                            …
                             
                            ,
                            ≤
                            |
                            
                                
                                    T
                                
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                                
                                    S
                                
                            
                            |
                        
                    .” The increasing amounts of training data per stage disclose increasing the selected duration.); and 
selecting a number of hyperparameter value sets from the selected hyperparameter value sets and the suggested hyperparameter value sets (Wang, p. 2114, Section 2.3 and Algorithm 1, “the top k configurations [a number of hyperparameter value sets from the selected and suggested hyperparameter value sets] based on validation accuracy are used to initialize the next stage’s run.”); and 
after performing the hyperparameter tuning evaluations for the identified number of rounds, selecting the best performing hyperparameter value set as indicated by a lowest expected variance when evaluated against the training data (Wang, p. 2114, Section 2.3 and Algorithm 1, “After running all S stages the algorithm terminates [after performing the hyperparameter tuning evaluations for the identified number of rounds], and outputs the configuration with the highest validation accuracy [selecting the best performing hyperparameter value set] from all hyper-parameters explored by all stages (including the initialization points explored by the first stage) [when evaluated against the training data].” It is implicit that the highest validation accuracy indicates a lowest error rate, or a lowest expected variance.).
Wang discloses the method comprising calculating a running validation accuracy based on the evaluation of the selected hyperparameter value sets and creating suggested hyperparameter value sets based on the calculated running validation accuracy. 
Wang does not disclose the italicized portion of the method, comprising:
…
calculating a running variance based on the evaluation of the selected hyperparameter value sets; [and]
creating suggested hyperparameter value sets based on the calculated running variance.
….
However, Hutter discloses the method comprising:
calculating a running variance based on the evaluation of the selected hyperparameter value sets (Hutter, pp. 514-515, Section 4.3, “To quantify how promising a configuration θ is, it uses the model’s predictive distribution for θ to compute its expected positive improvement (EI(θ)) over the best configuration seen so far (the incumbent). … Specifically, we use the E[Iexp] criterion [based on the evaluation of the selected hyperparameter value sets] introduced in [14] for log-transformed costs; given the predictive mean μθ and variance σθ2 of the log-transformed cost of a configuration θ.”); [and]
creating suggested hyperparameter value sets based on the calculated running variance (Hutter, p. 515, Section 4.3, “We compute EI for all [configurations] used in previous target algorithm runs, pick the ten configurations with maximal EI [based on the calculated running variance], and initialize a local search at each of them. To seamlessly handle mixed categorical/numerical parameter spaces, we use a randomized one-exchange neighbourhood, including the set of all configurations that differ in the value of exactly one discrete parameter, as well as four random neighbours for each numerical parameter [creating suggested hyperparameter value sets]. … we use a best improvement search, evaluating EI for all neighbours at once; we stop each local search once none of the neighbours has larger EI.”).
Wang discloses the calculation of a running validation accuracy but does not disclose calculating a running variance that is used in creating suggested hyperparameter value sets. However, Hutter is also directed to calculating an expected improvement in selecting promising configurations for a model and discloses the use of a running variance value in computing this expected improvement. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the running calculation in Wang to include a running variance, as disclosed in Hutter. Doing so “offers an automatic tradeoff between exploitation (focusing on known good parts of the space) and exploration (gathering more information in unknown parts of the space)” and allows evaluations “based on the model predictions μθ and σθ2 without running the target algorithm”; further, “batch model predictions (and thus batch EI computations) for a set of N configurations are much cheaper than separate predictions for N configurations” (Hutter, p. 515, Section 4.3). 

Regarding claim 2, Wang in view of Hutter teaches the method of claim 1, wherein the number of rounds is based on a number of hyperparameters to evaluate (Wang, p. 2113, Section 1, “The key intuition behind the proposed approach is that both dataset size and search space of hyperparameter can be large, and applying the Bayesian Optimization algorithm on the data can be both expensive and unnecessary, since many evaluated candidates may not even be within range of best final settings.” Wang, p. 2114, Section 2.3, “In practice, the number of stages S [the number of rounds] and the value of k depend on the quantity of the data and the quality of stage-wise model [which includes a number of hyperparameters to evaluate].”).

Regarding claim 3, Wang in view of Hutter teaches the method of claim 1, wherein the selected duration includes two iterations of evaluating the selected hyperparameter value sets and the suggested hyperparameter value sets against a training data for the predictive learning model (Wang, p. 2114, Section 2.3 and Algorithm 1, “During each stage s, the k best configurations (based on validation accuracy) passed from the previous stage are first evaluated on the current stage’s training data                                 
                                    
                                        
                                            T
                                        
                                        
                                            t
                                            r
                                            a
                                            i
                                            n
                                        
                                        
                                            s
                                        
                                    
                                
                             [where the combined number of samples in                                 
                                    
                                        
                                            T
                                        
                                        
                                            t
                                            r
                                            a
                                            i
                                            n
                                        
                                        
                                            s
                                        
                                    
                                
                             and                                 
                                    
                                        
                                            T
                                        
                                        
                                            v
                                            a
                                            l
                                            i
                                            d
                                        
                                        
                                    
                                
                             includes at least two iterations of evaluating], and then the standard Bayesian Optimization algorithm are initialized with these k settings and applied for Y_s−k iterations on                                 
                                    
                                        
                                            T
                                        
                                        
                                            t
                                            r
                                            a
                                            i
                                            n
                                        
                                        
                                            s
                                        
                                    
                                
                             (discounting the k evaluations done earlier in the stage), where Y_s is the total number of iterations for stage s. Then the top k configurations based on validation accuracy are used to initialize the next stage’s run [where the combined number of samples in                                 
                                    
                                        
                                            T
                                        
                                        
                                            t
                                            r
                                            a
                                            i
                                            n
                                        
                                        
                                            s
                                        
                                    
                                
                             and                                 
                                    
                                        
                                            T
                                        
                                        
                                            v
                                            a
                                            l
                                            i
                                            d
                                        
                                        
                                    
                                
                             includes at least two iterations of evaluating]. … In our experiments, we empirically choose their values to be S=2 and k=3 which result in a good balance between accuracy and speed on the given datasets.”).

Regarding claim 4, Wang in view of Hutter teaches the method of claim 3, wherein the selected duration increases linearly (Wang, p. 2115, Section 3.1, in one experiment, “the multi-stage method uses the same 30% training data at the initial stage, and full training data at the subsequent stage.” An increase from 30% to 100% of training data discloses the selected duration increasing linearly.).

Regarding claim 6, Wang in view of Hutter teaches the method of claim 1, wherein the number of rounds is based on a temporal factor for identifying a finalist grouping of hyperparameters (Wang, p. 2114, Section 2.3, “The multi-stage algorithm as shown in Algorithm 1 is an extension of the standard Bayesian Optimization (Section 2.2) to enable speed on large-scale datasets. … After running all S stages the algorithm terminates, and outputs the configuration with the highest validation accuracy from all hyper-parameters explored by all stages (including the initialization points explored by the first stage) [identifying a finalist grouping of hyperparameters]. … In our experiments, we empirically choose their values to be S=2 [example of the number of rounds] and k=3 which result in a good balance between accuracy and speed [a temporal factor] on the given datasets.”).

Regarding claim 9, Wang in view of Hutter teaches the method of claim 1, wherein selecting the number of hyperparameter value sets from the selected hyperparameter value sets and the suggested hyperparameter value sets further comprises selecting a specified percentage of the hyperparameter value sets (Wang, p. 2114, Section 2.3 and Algorithm 1, “the top k configurations [the number of hyperparameter value sets from the selected and suggested hyperparameter value sets] based on validation accuracy are used to initialize the next stage’s run.” It is implicit that the top k configurations out of the Y_s number of configurations, as shown in Algorithm 1, discloses a specified percentage of the hyperparameter value sets.).

Regarding claims 10-13, 15, and 18; claims 10-13, 15, and 18 are directed to a system for providing hyperparameter tuning relating to a predictive learning model, comprising a processing unit and a memory including computer readable instructions, which when executed by the processing unit, causes the system to be operable to perform the method recited in claims 1-4, 6, and 9, respectively. Therefore the rejections made to claims 1-4, 6, and 9 are applied to claims 10-13, 15, and 18.
In addition, Wang teaches, “The key intuition behind the proposed approach is that both dataset size and search space of hyperparameter can be large, and applying the Bayesian Optimization algorithm on the data can be both expensive and unnecessary, since many evaluated candidates may not even be within range of best final settings. … The new multi-stage Bayesian Optimization is a generalization of the standard Bayesian Optimization for hyper-parameter learning. It is designed to scale standard Bayesian Optimization to large amounts of training data.” (Wang, p. 2113, Sections 1 and 2). It is implicit that a processing unit and memory including computer readable instructions would be necessary to perform the disclosed method in Wang.

Regarding claims 19-20, claims 19-20 are directed to a computer readable storage device including computer readable instructions, which when executed by a processing unit, performs steps for providing hyperparameter tuning relating to a predictive learning model, comprising the method recited in claims 1 and 3, respectively. Therefore, the rejection made to claims 1 and 3 are applied to claims 19-20. 
In addition, Wang teaches, “The key intuition behind the proposed approach is that both dataset size and search space of hyperparameter can be large, and applying the Bayesian Optimization algorithm on the data can be both expensive and unnecessary, since many evaluated candidates may not even be within range of best final settings. … The new multi-stage Bayesian Optimization is a generalization of the standard Bayesian Optimization for hyper-parameter learning. It is designed to scale standard Bayesian Optimization to large amounts of training data.” (Wang, p. 2113, Sections 1 and 2).  It is implicit that a processing unit executing computer readable instructions in a computer readable storage device would be necessary to perform the disclosed method in Wang.

Claims 5, 8, 14, and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Hutter, further in view of Luo et al. (“MLBCD: a machine learning tool for big clinical data,” 2015, Health Information Science and Systems, 3:3, pp. 1-19) (“Luo”).
Regarding claim 5, Wang in view of Hutter teaches the method of claim 3.
Neither Wang nor Hutter teach the method, wherein the selected duration increases exponentially 
However, Luo teaches the method, wherein the selected duration increases exponentially (Luo, p. 9, The training and test samples, “As shown in Fig. 3, the training sample expands from one round to the next. An effective expansion method is to increase the training size exponentially, e.g., double the training size each round.”).
The combination of Wang and Hutter and the disclosure of Luo are directed to sequential optimization for identifying optimal hyper-parameter values. Wang discloses increasing the selected duration linearly but not exponentially. However, Luo discloses increasing the selected duration exponentially. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the increase of selected duration in Wang to increase exponentially, as disclosed in Luo, to yield predictable results of increasing the selected duration.

Regarding claim 8, Wang in view of Hutter teaches the method of claim 1.
Neither Wang nor Hutter explicitly disclose the method, wherein the number of rounds is based on a confidence of each of the hyperparameters identified in a finalist grouping of hyperparameters 
However, Luo teaches the method, wherein the number of rounds is based on a confidence of each of the hyperparameters identified in a finalist grouping of hyperparameters (Luo, p. 9, The accuracy difference threshold, “The accuracy difference threshold τ is used to eliminate unpromising machine learning algorithms and identify unpromising combinations of hyper-parameter values.” Luo, pp. 10-11, Iterations of the search process, “We repeat the above process [see section A subsequent round that is not the final one] for a pre-determined number of rounds (e.g., 5) until the accuracy difference threshold τ reaches a pre-determined minimum value, such as 0.05. … After τ reaches the pre-determined minimum value, each pair of a remaining promising algorithm and a combination of hyper-parameter values has similar potential. The pair [including each of the hyperparameters identified in a finalist grouping of hyperparameters] achieving the highest accuracy is the best one found.”).
The combination of Wang and Hutter and the disclosure of Luo are directed to sequential optimization for identifying optimal hyper-parameter values. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the number of rounds in Wang to be based on a confidence of each of the hyperparameters identified in a finalist grouping of hyperparameters, as disclosed in Luo. One would be motivated to do so, because as “the training sample expands over rounds, we will have an increasingly better idea of the potential of an algorithm and/or combination of hyper-parameter values. To use this property to expedite the process of narrowing down the search space, τ is decreased over rounds. One approach is to perform linear decrease, such as by 0.07 per round, until τ reaches a pre-determined minimum value, such as 0.05.” (Luo, p. 10, The accuracy difference threshold).

Regarding claims 14 and 17, claims 14 and 17 are directed to a system for providing hyperparameter tuning relating to a predictive learning model, comprising a processing unit and a memory including computer readable instructions, which when executed by the processing unit, causes the system to be operable to perform the method recited in claims 5 and 8, respectively. Therefore the rejections made to claims 5 and 8 are applied to claims 14 and 17.
In addition, Wang teaches, “The key intuition behind the proposed approach is that both dataset size and search space of hyperparameter can be large, and applying the Bayesian Optimization algorithm on the data can be both expensive and unnecessary, since many evaluated candidates may not even be within range of best final settings. … The new multi-stage Bayesian Optimization is a generalization of the standard Bayesian Optimization for hyper-parameter learning. It is designed to scale standard Bayesian Optimization to large amounts of training data.” (Wang, p. 2113, Sections 1 and 2). It is implicit that a processing unit and memory including computer readable instructions would be necessary to perform the disclosed method in Wang.













Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Provost et al. (“Efficient Progressive Sampling,” 1999, ACM, KDD-99, pp. 23-32) (“Provost”) discloses “methods for progressive sampling” and “how best to take into account prior expectations of accuracy convergence” (Provost, Abstract).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to CATHERINE F LEE whose telephone number is (571)270-7487.  The examiner can normally be reached on Monday thru Friday, 10:00AM-6:00PM EDT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/C.F.L./Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124