DETAILED ACTION
This is the first office action regarding application number 16/384,588, filed April 15, 2019.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Information Disclosure Statement
These references listed in the following Information Disclosure Statements have not been considered due to the following reasons:
IDS 10/27/2020: “Machine Learning Approaches for Time Series Data”, dated May 19, 2019, 25 pages. It is not clear where this reference is extracted, as the reference does not have an author and a date, and examiner was unable to find its origin through a Google search. Applicant is asked to clarify the origin of the reference, and if necessary, provide an updated copy of the reference with the appropriate author and date for further consideration.  
IDS 10/27/2020: Ng, “Data preprocessing for machine learning: options and recommendations”, dated Jun 22, 2020, 12 pages. The author of this reference does not appear to be Andrew Ng, as Andrew Ng merely provides a source quotation used in the Introduction section of the reference. There is also no date associated with this reference. Examiner found a variation of this reference through a Google search, last 

Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they do not include the following reference sign(s) mentioned in the description: Figure 9 BARE HARDWARE block is missing reference character 920, which is specified in paragraph [0110]. 
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Specification
The disclosure is objected to because of the following informalities:
Paragraph [0041]: The last sentence should reference Figure 1 step 208 instead of step 206 (“Then, inferencing occurs in step 208. Appropriate correction is required.
Paragraph [0042]: The first sentence should reference Figure 1 step 206 instead of step 208 (“Steps 202, 204, and 206 are preparatory …”). Appropriate correction is required.

Claim Objections





Claims 5 and 17 are objected to because of the following informality: Both claims contain the limitation “training the trainable regressor with the second dataset and a most accurate hyperparameter configuration of said measuring said accuracy”. The term “said measuring said accuracy” should be cleaned up (i.e., remove one of the “said” terms) to indicate that “a most accurate hyperparameter configuration” is based on the earlier measuring accuracy limitation found in the same respective claim. Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.


Claims 6, 8, 18, and 20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for 
Regarding Claims 6 and 18,
Both claims recite the limitation “said inferred duration is relative to the reference duration”. The term “is relative to” results in this limitation being broadly interpreted to cover all possible relationships between an inferred duration and a reference duration. However, paragraph [0078] of the specification only provides two examples of a relationship between an inferred duration and a reference duration: “For example, a normalized duration may be a ratio of empirical time to reference time, … In an embodiment, normalized duration is a percent deviation of empirical time from reference time.” and hence does not have support to broadly cover all relationships between an inferred duration and a reference duration. The specification must describe and support the claims such that the public is informed of the boundaries of what constitutes infringement of the patent, as well as determining whether the claimed invention meets all the criteria for patentability by distinctly claiming the subject matter which the inventor regards as the invention. See MPEP 2163. Given that there is no support of this limitation present in the specification, this claim limitation found in Claims 6 and 18 fails to comply with the written description requirement.
Claims 7-8 and Claims 19-20 are dependent claims of their respective parent Claims 6 and 18, and as such, inherit the same lack of written description issue found in Claims 7-8. However, Claims 7 and 19 further narrow the limitation “said inferred duration is relative to the reference duration” to comprise either one of two limitations (“a percent deviation of said duration from the reference duration, or a ratio of said duration to the reference duration”), and as such, does not inherit the lack of written description issue from their respective Claims 6 and 18. However, Claims 8 and 20 do not perform this same narrowing of the claim limitation “said inferred duration is relative to the reference duration”, and hence Claims 8 and 20 are also rejected under the same lack of written description issue by virtue of dependency.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 6-7, 9-11, 13, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Reif et al., Prediction of Classifier Training Time Including Parameter .
Regarding Claim 1, Reif teaches
A method comprising: 
for each landmark configuration of a plurality of landmark configurations that each contain a plurality of values for a plurality of hyperparameters of a machine learning (ML) model (Examiner’s note: Under its broadest reasonable interpretation, a landmark configuration is identified as a set of known hyperparameters (and their corresponding values) for an associated machine learning algorithm used for prediction. Reif Table 1 shows a list of target classifiers and their associated hyperparameters (with corresponding range values for each hyperparameter value), where each classifier and their corresponding hyperparameters correspond to a “… landmark configuration … that each contain a plurality of values for a plurality of hyperparameters of a machine learning model” and the plurality of target classifiers and their associated hyperparameter values correspond to “a plurality of landmark configurations” (Reif p.266 Section 5 Evaluation, 1st paragraph: “The used classifiers as well as their optimized parameters are listed in Table 1.”).): 
…
measuring a duration of a plurality of durations spent training, based on a dataset, the ML model (Examiner’s note: Measuring the prediction run-time of each target classifier using predefined combination sets of hyperparameter values (Reif p.263 Section 3 Run-time of a Grid Search) and associated meta-features of a known dataset, corresponding to “measuring a duration of a plurality of durations spent training, based on a dataset, the ML model” (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used as the target variable.”).); 
predicting, by a trainable regressor, an inferred duration needed to train the ML model based on (Reif p.265 Figure 3: examiner’s note: Under its broadest reasonable interpretation, “predicting, by a trainable regressor, an inferred duration needed to train the ML model based on …” is interpreted as referencing both training and trained aspects, where the first part of the claim limitation of “predicting, by a trainable regressor, an inferred duration …” refers to using the trainable regressor (once trained) to make an inferred duration prediction, while the second part of the claim limitation of “… needed to train the ML model based on …” refers to the training data needed to train the trainable regressor. Referring to Reif Figure 3 Application section, a time prediction model performs a prediction of a time x, where the time prediction model is based on the regression learner that was trained in Reif Figure 3 Training section, such that the time prediction model represents a trainable regressor, corresponding to “predicting by a trainable regressor, an inferred duration …”. The training of the regression learner involves providing meta-feature data and associated measured run-times as input training data into the regression learner, thus corresponding to “… needed to train the ML model based on …”, with the elements of the input training data indicated in further detail by the subsequent claim elements (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. … After the learning phase, the resulting model can be used to predict the run-time of an unknown dataset … The overall approach is illustrated in Figure 3.”).): 
…
a plurality of values, based on the dataset, of a plurality of meta-features (Reif p.265 Figure 3: examiner’s note: Referring to Reif Figure 3 Training section, analyzing dataset instances to generate meta-features, where the meta-features represent properties of the respective dataset (Reif p.262 Section 2 Meta-Learning: “…meta-learning is based on features of datasets. These features are often called meta-features. They describe properties of a dataset … Simple meta-features use directly accessible properties like the number of samples, the number of attributes or the number of classes. More sophisticated features are statistical measures …”). A detailed list of meta-features (grouped by category) is listed in Reif Section 4.1 Traditional Meta-Features (corresponding to “a plurality of values, based on the dataset, of a plurality of meta-features”), with the application of each of these groups of meta-features as part of the input training data to train a regression learner (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used at the target variable. … The overall approach is illustrated in Figure 3.”).), and 
said plurality of durations (Reif p.265 Figure 3: examiner’s note: Referring to Reif Figure 3 Training section, measuring the run-time of a target classifier that is learning optimized hyperparameters for each dataset and associated meta-features, and using the associated measured run-times (corresponding to “said plurality of durations”) as part of the input training data to train a regression learner (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used at the target variable. … The overall approach is illustrated in Figure 3.”).) and 
the values of the plurality of landmark configurations (Reif p.265 Figure 3: examiner’s note: Referring to Reif Figure 3 Training section, measuring the run-times to train a particular classifier to learn optimized hyperparameters (and their associated values) for each dataset and associated meta-features, where the hyperparameter values defined based on the interval ranges shown in Reif Table 1 correspond to “the values of the plurality of landmark configurations” (Reif p.263 Section 3 Run-Time of a Grid Search, 1st paragraph: “Since the performance of most classifiers depends on parameter values, the parameters are usually optimized. A simple and often used method for parameter optimization is grid search. All predefined combinations of parameter values are evaluated to determine the best of them. … different parameter combinations require different amounts of time. The plot shows the run-time of training the Ripper classifier for different combinations of its two parameters sample ratio and pureness.” and Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used at the target variable. … The overall approach is illustrated in Figure 3.”).).  
Reif p.265 Figure 3) in order to measure the respective run-times, Reif does not explicitly teach
for each landmark configuration of a plurality of landmark configurations that each contain a plurality of values for a plurality of hyperparameters of a machine learning (ML) model: 
… configuring the ML model based on the landmark configuration; …
predicting, by a trainable regressor, an inferred duration needed to train the ML model based on: 
a proposed configuration of the ML model, …
	Sturlaugson teaches
for each landmark configuration of a plurality of landmark configurations that each contain a plurality of values for a plurality of hyperparameters of a machine learning (ML) model: 
… configuring the ML model based on the landmark configuration (Sturlaugson Figure 2, elements 30, 32: examiner’s note: A machine learning model 32 within an experiment module 30 in a machine learning system, with a set of hyperparameters and their values, corresponding to “a landmark configuration” containing “a plurality of values for a plurality of hyperparameters” (Sturlaugson paragraph [0020]: “The machine learning models 32 include a machine learning algorithm and one or more associated parameter values for the machine learning algorithm.”), with each different type of machine learning algorithm providing different hyperparameters and associated values (Sturlaugson paragraph [0022], corresponding to “a plurality of landmark configurations”), each of which can be loaded onto a machine learning model, thus corresponding to “configuring the ML model based on the landmark configuration” (Sturlaugson paragraph [0017]: “Generally, machine learning systems 10 are configured to calculate and/or to estimate the performance of one or more machine learning algorithms configured with one or more specific parameters (also referred to as hyper-parameters) with respect to a given set of data. The machine learning algorithm along with its associated specific parameter values form, at least in part, the machine learning model 32 …”).); …
predicting, by a trainable regressor, an inferred duration needed to train the ML model based on:
a proposed configuration of the ML model (Sturlaugson Figure 2, elements 30, 32: examiner’s note: Under its broadest reasonable interpretation, “a proposed configuration of the ML model” refers to a selected set of hyperparameters and its associated values being used under the training phase (as indicated by the phrase “… needed to train the ML model based on …)”. The experiment module 30 performs the experiments/trials to test each machine learning model 32, under both training and evaluation phases, by selecting the specific machine learning algorithm and a range and/or set of one or more associated parameters, with one of these trials loading a set of associated parameters corresponding to “a proposed configuration of the ML model” (Sturlaugson paragraphs [0033]-[0034]: “Experiment module 30 of the machine learning system 10 is configured to test (e.g., to train and evaluate) each of the machine learning models 32 of the selection of machine learning models 32 provided by the data input module 20 to produce a performance result for each machine learning model 32. … Experiment module 30 may be configured to automatically and/or autonomously design and carry out the specified experiments (also called trials) to test each of the machine learning models 32. … For example, the selection of machine learning models 32 received by the data input module 20 may include specific machine learning algorithms and a range and/or a set of one or more associated parameters to test. … That is, the experiment module 30 may generate a machine learning model 32 for each unique combination of parameters specified by the selection.”), where the results of the Sturlaugson paragraph [0042]: “The performance result for each machine learning model 32 … may include an indicator, value, and/or result related to …  an accuracy,… . Additionally or alternatively, the indicator, value, and/or result may be related to computational efficiency, memory required, and/or execution speed.”).), …
Both Reif and Sturlaugson are analogous art since they both teach training and evaluating machine learning algorithms using an associated set of hyperparameters.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the ML model (i.e., a target classifier and a predetermined combination set of associated hyperparameters) identified in the training phase taught in Reif and perform the training steps for the ML model using the experiment module taught in Sturlaugson as a way to train and measure the associated run-time of the ML model. The motivation to combine is taught in Sturlaugson, since this method allows a large number of different combinations of machine learning algorithms and their associated hyperparameters to be trained and evaluated in an automated fashion, which provides a user the ability to tailor a machine learning model for various applications and datasets by identifying and selecting a best trained machine learning model based on a performance measurement (Sturlaugson paragraphs [0003]-[0005]: “… a broad array of machine learning algorithm are available, with new algorithms the subject of active research. … The large number of machine learning options available to address a problem makes it difficult to choose the best option or even a well-performing option. The amount, type, and quality of data affect the accuracy and stability of training and the resultant trained models. Further, problem-specific considerations, such as tolerance of errors (e.g., false positives, false negatives) scalability, and execution speed, limit the acceptable choices. … Therefore, there exists a need for comparing machine learning models for applicability to various specific problems.”).  
Regarding Claim 6, Reif in view of Sturlaugson teaches
The method of Claim 1 wherein: 
the plurality of landmark configurations comprises a reference configuration (Reif p.267 Table 1: examiner’s note: Under its broadest reasonable interpretation, a “reference configuration” is interpreted as a landmark configuration in which the measured run-time was performed for the target classifier, serving as a baseline to perform future predictions. Reif Table 1 shows a list of target classifiers and their associated hyperparameters (with corresponding range values for each hyperparameter value), where each classifier are identified as simple learners used to predict more sophisticated classifiers, and as such, each target classifier and its associated hyperparameter values represent a reference configuration (thus corresponding to “the plurality of landmark configurations comprises a reference configuration”) (Reif pp.265-266 Section 4.2 Time-Based Meta-Features, 1st paragraph: “Landmarking have been successfully used in the past for different meta-learning approaches [15][4][2][9]. The approach use performance values of simple classifier for predicting the performance of more sophisticated algorithms. Analogically, we sue the run-time of the same simple learners for predicting the run-time of a sophisticated classifier. The used classifiers are Naïve Bayes, One-Nearest Neighbor, and Decision Stumps.” and Reif p.266 Section 5 Evaluation, 1st paragraph: “We evaluated the presented approach on real world datasets from the UCI machine learning repository [1] and StatLib[18]. The run-time of a grid search for five different classifiers are investigated. The used classifiers as well as their optimized parameters are listed in Table 1.”).); 
said duration spent training is a reference duration when said landmark configuration is said reference configuration (Reif p.268 Normalized Absolute Error, 1st paragraph: examiner’s note: Under its broadest reasonable interpretation, this claim limitation in a method when said landmark configuration is said reference configuration” is not required to be met, and the claimed invention can be practiced without the condition occurring. See MPEP 2111.04(II). Applicant is advised to amend the claim to positively cite the condition as being fulfilled, since no patentable weight is given for the subsequent claim language following a contingent clause that does not require the condition to be fulfilled for practicing the claimed invention. However, for the purposes of examination, this contingent clause will be treated as if the condition were fulfilled. 
Under its broadest reasonable interpretation, a “reference duration” is interpreted as the measured run-time for a landmark configuration for a target classifier serving as a baseline to perform future predictions. As identified by this claim limitation (“said duration spent training is a reference duration”), the measuring of the prediction run-time of each target classifier from Reif Table 1 corresponds to a “reference duration”, with each of the measured run-times performed during evaluation of these target classifiers representing a respective “reference configuration” (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used as the target variable.” and Reif p.266 Section 5 Evaluation, 1st paragraph: “We evaluated the presented approach on real world datasets from the UCI machine learning repository [1] and StatLib[18]. The run-time of a grid search for five different classifiers are investigated. The used classifiers as well as their optimized parameters are listed in Table 1. ... we only considered datasets with a run-time of the grid search in a defined interval. … Additionally, datasets with a run-time of the algorithm greater than 24 hours have been neglected as well because of the computational effort.”).); 
a normalized duration, of a plurality of normalized durations, is based on said duration relative to the reference duration (Reif p.268 Table 3; Reif p.268 Normalized Absolute Error, 1st-2nd paragraphs: examiner’s note: Under its broadest reasonable interpretation, the “said duration relative to the reference duration” is interpreted as a relationship between “said duration” (interpreted as an inferred duration from Claim 1) and a “reference duration” (interpreted as the landmark configuration for a target classifier serving as a baseline to perform future predictions). Performances of the different sets of target classifiers and their predicted run-times are evaluated using a normalized absolute error calculation (corresponding to “a normalized duration”). A normalized absolute error is calculated based on a prediction run-time                         
                            
                                
                                    t
                                
                                
                                    p
                                
                            
                        
                     (corresponding to an “inferred duration”) and a measured baseline run-time                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     (corresponding to a “reference duration”), where the error calculation equation itself is in the form of a ratio of normalized durations (i.e., the divisor is based on the predicted run-time normalized to the measured run-time, while the dividend is based on the baseline run-time normalized to the measured run-time), with the calculated normalized absolute errors for each target classifier and associated meta-features shown in Reif Table 3 (corresponding to “a normalized duration, of a plurality of normalized durations…”) (Reif p.268 Normalized Absolute Error, 1st – 2nd paragraphs: “… the normalized absolute error was determined that serves as a comparison to a baseline. The absolute error of the prediction by the presented approach is divided by the absolute error of the prediction by a baseline method:                         
                            e
                            =
                            
                                
                                    
                                        
                                            t
                                        
                                        
                                            m
                                        
                                    
                                    -
                                    
                                        
                                            t
                                        
                                        
                                            p
                                        
                                    
                                
                            
                            /
                            |
                            
                                
                                    t
                                
                                
                                    m
                                
                            
                            -
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                            |
                        
                    , where                         
                            
                                
                                    t
                                
                                
                                    m
                                
                            
                        
                     is the actual measured time,                         
                            
                                
                                    t
                                
                                
                                    p
                                
                            
                        
                     the predicted time of the presented approach, and                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     the time predicted by the baseline method. For the baseline method, the predicted run-time is simply the average run-time of the classifier. Hence, the baseline method predicts the same run-time for every dataset. … Table 4 shows the normalized absolute errors.”).); 
said predicting based on the plurality of durations comprises predicting based on the plurality of normalized durations (Reif p.268 Table 3; Reif p.268 Normalized Absolute Error, 1st-2nd paragraphs: examiner’s note: Performances of the different sets of target classifiers and their predicted run-times are evaluated using a normalized absolute error calculation (corresponding to “a normalized duration”). A normalized absolute error is calculated based on a prediction time                         
                            
                                
                                    t
                                
                                
                                    p
                                
                            
                        
                     (corresponding to an “inferred duration”) and a measured baseline time                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     (corresponding to a “reference duration”), where the error calculation equation itself is in the form of a ratio of normalized durations (i.e., the divisor is based on the predicted run-time subtracted (i.e., normalized) from the measured run-time, while the dividend is based on the baseline run-time subtracted (i.e., normalized) from the measured run-time), with the calculated normalized absolute errors for each target classifier and associated meta-features shown in Reif Table 3 (corresponding to “said predicting based on a plurality of normalized durations comprises predicting based on the plurality of normalized durations”).); 
said inferred duration is relative to the reference duration (Reif p.268 Table 3; Reif p.268 Normalized Absolute Error, 1st-2nd paragraphs: examiner’s note: Under its broadest reasonable interpretation, the “said duration relative to the reference duration” is interpreted as a relationship between “said duration” (interpreted as an inferred duration from Claim 1) and a “reference duration” (interpreted as the landmark configuration for a target classifier serving as a baseline to perform future predictions). Performances of the different sets of target classifiers and their predicted run-times are evaluated using a normalized absolute error calculation (corresponding to “a normalized duration”). A normalized absolute error is calculated based on a prediction run-time                         
                            
                                
                                    t
                                
                                
                                    p
                                
                            
                        
                     (corresponding to an “inferred duration”) and a measured baseline run-time                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     (corresponding to a “reference duration”), thus corresponding to “said inferred duration is relative to the reference duration”, with the calculated normalized absolute errors for each target classifier and associated meta-features shown in Reif Table 3).).  
Regarding Claim 7, Reif in view of Sturlaugson teaches
The method of Claim 6 wherein said duration relative to the reference duration comprises: 
a percent deviation of said duration from the reference duration, or 
a ratio of said duration to the reference duration (Reif p.268 Normalized Absolute Error, 1st-2nd paragraphs: examiner’s note: Under its broadest reasonable interpretation, the “said duration relative to the reference duration” is interpreted as a relationship between “said duration” (interpreted as an inferred duration from Claim 1) and a “reference duration” (interpreted as the landmark configuration in which the measured time was performed for a target classifier). Performances of the different sets of target classifiers and their predicted run-times are evaluated using a normalized absolute error calculation. A normalized absolute error is calculated based on a prediction run-time                         
                            
                                
                                    t
                                
                                
                                    p
                                
                            
                        
                     (corresponding to an “inferred duration”) and a measured baseline run-time                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     (corresponding to a “reference duration”), where the error calculation equation itself is in the form of a ratio (thus corresponding to “wherein said duration relative to the reference duration comprises … a ratio of said duration to the reference duration”) (Reif p.268 Normalized Absolute Error, 1st – 2nd paragraphs: “… the normalized absolute error was determined that serves as a comparison to a baseline. The absolute error of the prediction by the presented approach is divided by the absolute error of the prediction by a baseline method:                         
                            e
                            =
                            
                                
                                    
                                        
                                            t
                                        
                                        
                                            m
                                        
                                    
                                    -
                                    
                                        
                                            t
                                        
                                        
                                            p
                                        
                                    
                                
                            
                            /
                            |
                            
                                
                                    t
                                
                                
                                    m
                                
                            
                            -
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                            |
                        
                    , where                         
                            
                                
                                    t
                                
                                
                                    m
                                
                            
                        
                     is the actual measured time,                         
                            
                                
                                    t
                                
                                
                                    p
                                
                            
                        
                     the predicted time of the presented approach, and                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     the time predicted by the baseline method. For the baseline method, the predicted run-time is simply the average run-time of the classifier. Hence, the baseline method predicts the same run-time for every dataset. … Table 4 shows the normalized absolute errors.”).).  
Regarding Claim 9, Reif in view of Sturlaugson teaches
The method of Claim 1 wherein the plurality of landmark configurations comprises, for each numeric hyperparameter of said plurality of hyperparameters: 
a landmark configuration having said plurality of values that contains a minimum value for said hyperparameter (Reif p.267 Table 1: examiner’s note: According to the Merriam-Webster dictionary, the term “and/or” indicates that two words or expressions are to be taken together or individually, and as such, the claim limitations in this claim connected by this term will be treated as an “or” in the context of this claim. Reif Table 1 contains a set of target classifiers with their associated hyperparameters and interval values (corresponding to “the plurality of landmark configurations”). For example, for a k-NN classifier, the k parameter (corresponding to “numeric parameter of said plurality of hyperparameters”) has range of [1, 1000], where the interval values are expressed as a minimum and maximum value (thus corresponding to “for each numeric hyperparameter of said plurality of hyperparameters: a landmark configuration having said plurality of values that contains a minimum value for said hyperparameter”).

    PNG
    media_image1.png
    647
    917
    media_image1.png
    Greyscale

), 
a landmark configuration having said plurality of values that contains a maximum value for said hyperparameter, and/or 
a landmark configuration having said plurality of values that contains a value for said hyperparameter that is halfway between two of: 
said minimum value, said maximum value, and a default value.  
Regarding Claim 10, Reif in view of Sturlaugson teaches
The method of Claim 1 wherein the plurality of landmark configurations comprises, 
for each hyperparameter of said plurality of hyperparameters that is categorical, landmark configurations that each have said plurality of values that contains a distinct value for said hyperparameter (Reif p.267 Table 1: examiner’s note: Reif Table 1 (displayed at the end of Claim 9) shows hyperparameters and their associated interval values for a set of target classifiers, where a plurality of predetermined combinations of hyperparameters and their values can be generated (corresponding to “the plurality of landmark configurations”). The interval ranges indicate either numerical or non-numerical values (where the hyperparameter with non-numerical ranges correspond to “… hyperparameter of said plurality of hyperparameters that is categorical …”). For example, a k-NN classifier with the non-numeric hyperparameter ‘weighted vote’ can take possible values {yes, no}, thus corresponding to “for each hyperparameter of said plurality of hyperparameters that is categorical, landmark configurations that each have said plurality of values that contains a distinct value for said hyperparameter”.).  
Regarding Claim 11, Reif in view of Sturlaugson teaches
The method of Claim 1 wherein said plurality of meta-features comprises: 
a count of features of said dataset, 
a count of numeric features of said dataset (Reif p.265 Section 4.1 Traditional Meta-Features: examiner’s note: According to the Merriam-Webster dictionary, the term “and/or” indicates that two words or expressions are to be taken together or individually, and as such, the claim limitations in this claim connected by this term will be treated as an “or” in the context of this claim. Reif Section 4.1 lists the set of meta-features that can be identified for a dataset into four category groups, where in the Simple meta-features category, various counts such as Reif p.265 Section 4.1 Traditional Meta-Features, 1st-2nd paragraphs: “As a first set of meta-features, we used typical measures from the previously mentioned groups. This set includes the following 34 meta-features: Simple meta-features, number of samples, number of classes, number of attributes, number of nominal attributes, number of numerical attributes …”).), 
a value, of a feature of said dataset, that is: 
a minimum, a maximum, a mean, and/or a quantile, a total count of examples within said dataset, 
a majority count of examples within said dataset having a majority label of a plurality of labels of said dataset, 
a minority count of examples within said dataset having a minority label of a plurality of labels of said dataset, and/or 
a ratio of two of: said total count, said majority count, and said minority count.  
Regarding Claim 13, Reif teaches
One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors (Examiner’s note: Performing the training and evaluation of the machine learning models using a computer (containing memory storage) identified by its processor type (AMD Opteron) running a single-threaded program within an open-source package (RapidMiner), where both the program and open-source package is understood to be stored in memory or computer-readable medium located on the computer (Reif p.266 Section 5 Evaluation: “The complete evaluation was done using RapidMiner [13]. It is an open source data mining and pattern recognition framework implemented in Java. All times have been measured on an AMD Opteron using a single-threaded program.”).), cause: 
for each landmark configuration of a plurality of landmark configurations that each contain a plurality of values for a plurality of hyperparameters of a machine learning (ML) model : 
…
measuring a duration of a plurality of durations spent training, based on a dataset, the ML model (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.); 
predicting, by a trainable regressor, an inferred duration needed to train the ML model based on (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.): 
…
a plurality of values, based on the dataset, of a plurality of meta-features (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.), and 
said plurality of durations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.) and 
the values of the plurality of landmark configurations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.).  
	While Reif indicates that target classifiers are being trained using a set of hyperparameters through a grid search (Reif p.265 Figure 3) in order to measure the respective run-times, Reif does not explicitly teach
for each landmark configuration of a plurality of landmark configurations that each contain a plurality of values for a plurality of hyperparameters of a machine learning (ML) model:
… configuring the ML model based on the landmark configuration; …
predicting, by a trainable regressor, an inferred duration needed to train the ML model based on:
a proposed configuration of the ML model, …
	Sturlaugson teaches
for each landmark configuration of a plurality of landmark configurations that each contain a plurality of values for a plurality of hyperparameters of a machine learning (ML) model:
… configuring the ML model based on the landmark configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.) …
predicting, by a trainable regressor, an inferred duration needed to train the ML model based on:
a proposed configuration of the ML model (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.) …
Both Reif and Sturlaugson are analogous art since they both teach training and evaluating machine learning algorithms using an associated set of hyperparameters.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the ML model (i.e., a target classifier and a predetermined combination set of associated hyperparameters) identified in the training phase taught in Reif and perform the training steps for the ML model using the experiment module taught in Sturlaugson as a way to train and measure the associated run-time of the ML model. The motivation to combine is taught in Sturlaugson, since this method allows a large number of different combinations of machine learning algorithms and their associated hyperparameters to be trained and evaluated in an automated fashion, which provides a user the ability to tailor a machine learning model for various applications and datasets by identifying and selecting a best trained machine learning model based on a performance measurement (Sturlaugson paragraphs [0003]-[0005]: “… a broad array of machine learning algorithm are available, with new algorithms the subject of active research. … The large number of machine learning options available to address a problem makes it difficult to choose the best option or even a well-performing option. The amount, type, and quality of data affect the accuracy and stability of training and the resultant trained models. Further, problem-specific considerations, such as tolerance of errors (e.g., false positives, false negatives) scalability, and execution speed, limit the acceptable choices. … Therefore, there exists a need for comparing machine learning models for applicability to various specific problems.”).  
Regarding Claim 18, Reif in view of Sturlaugson teaches
The one or more non-transitory computer-readable media of Claim 13 wherein: 
the plurality of landmark configurations comprises a reference configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 6, and hence is rejected under similar rationale.); 
said duration spent training is a reference duration when said landmark configuration is said reference configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 6, and hence is rejected under similar rationale.); 
a normalized duration, of a plurality of normalized durations, is based on said duration relative to the reference duration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 6, and hence is rejected under similar rationale.); 
said predicting based on the plurality of durations comprises predicting based on the plurality of normalized durations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 6, and hence is rejected under similar rationale.); 
said inferred duration is relative to the reference duration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 6, and hence is rejected under similar rationale.).  
Regarding Claim 19, Reif in view of Sturlaugson teaches
The one or more non-transitory computer-readable media of Claim 18 wherein said duration relative to the reference duration comprises: 
a percent deviation of said duration from the reference duration, or 
a ratio of said duration to the reference duration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 7, and hence is rejected under similar rationale.).  
Claims 2-4 and 14-16 are rejected under 35 U.S.C. 103 as being unpatentable over Reif et al., Prediction of Classifier Training Time Including Parameter Optimization, in Bach et al., KI 2011: Advances in Artificial Intelligence, LNAI 7006, Springer-Verlag Berlin Heidelberg 2011, pp.260-271 [hereafter referred as Reif] in view of Sturlaugson et al., U.S. PGUPB 2016/0358099, published 12/8/2016 [hereafter referred as Sturlaugson] as applied to Claims 1 and 13; in further view of Hutter et al., Algorithm runtime prediction: Methods & evaluation, Artificial Intelligence 206 (2014), Elsevier B.V. 2013, pp.79-111 [hereafter referred as Hutter].
Regarding Claim 2, Reif in view of Sturlaugson as applied to Claim 1 teaches
The method of Claim 1 wherein: 
said dataset is a first dataset (Examiner’s note: Measuring the prediction run-time of each target classifier using predefined combination sets of hyperparameter values (Reif p.263 Section 3 Run-time of a Grid Search) and associated meta-features of a known dataset, where the known dataset (from Claim 1) corresponds to “said dataset” and is assigned the role of “a first dataset” (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used as the target variable.”).); 
a plurality of exploratory configurations is larger than the plurality of landmark configurations ([Reif p.263 Section 3 Run-time of a Grid Search: examiner’s note: A combination set of hyperparameter values derived from the set of target classifiers and their respective interval ranges (shown in Reif Table 1) are applied for each machine learning model, corresponding to “a plurality of exploratory configurations” (Reif p.263 Section 3 Run-time of a Grid Search, 1st paragraph: “Since the performance of most classifiers depends on parameter values, the parameters are usually optimized. … All predefined combinations of parameter values are evaluated to determine the best of them. … different parameter combinations require different amounts of time. The plot shows the run-time of training the Ripper classifier for different combinations of its two parameters sample ratio and pureness.”).] [Sturlaugson Figure 2, elements 30, 32: examiner’s note: The experiment module 30 generating multiple combinations of hyperparameters for each machine learning algorithm based on the original set of associated hyperparameters, where the original set of associated hyperparameters represent “a plurality of landmark configurations”, and each generated combination of hyperparameters represent “a plurality of exploratory configurations”. Given that multiple combinations of hyperparameters are generated from an original set of associated hyperparameters with interval ranges, this satisfies the condition of “a plurality of exploratory configurations is larger than the plurality of landmark configurations” (Sturlaugson paragraph [0034]: “… the selection of machine learning models 32 received by the data input module 20 may include specific machine learning algorithms and a range and/or a set of one or more associated parameters to test. The experiment module 30 may apply these range(s) and/or set(s) to identify a group of machine learning models 32. That is, the experiment module 30 may generate a machine learning model 32 for each unique combination of parameters specified by the selection. … the selection of machine learning models 32 may identify an artificial neural network as (one of) the machine learning algorithm(s) and associated parameters as 10-20 nodes and a learning rate decay of 0 or 0.01. The experiment module 30 may interpret this selection as at least four machine learning models: an artificial neural network with 10 nodes and a learning rate decay of 0, an artificial neural network with 10 nodes and a learning rate decay of 0.01, an artificial neural network with 20 nodes and a learning rate decay of 0, and an artificial neural network with 20 nodes and a learning rate decay of 0.01.”).]); 
the method further comprises: 
for each exploratory configuration of the plurality of exploratory configurations that each contain a plurality of values for said plurality of hyperparameters (Sturlaugson Figure 2, elements 30, 32: examiner’s note: The experiment module 30 performs the experiments/trials to test each machine learning model 32, under both training and evaluation phases, by selecting the specific machine learning algorithm and a range and/or set of one or more associated parameters, with one of these trials loading a set of associated parameters corresponding to “an exploratory configuration that each contain a plurality of values for said plurality of hyperparameters” (Sturlaugson paragraphs [0033]-[0034]: “Experiment module 30 of the machine learning system 10 is configured to test (e.g., to train and evaluate) each of the machine learning models 32 of the selection of machine learning models 32 provided by the data input module 20 to produce a performance result for each machine learning model 32. … Experiment module 30 may be configured to automatically and/or autonomously design and carry out the specified experiments (also called trials) to test each of the machine learning models 32. … For example, the selection of machine learning models 32 received by the data input module 20 may include specific machine learning algorithms and a range and/or a set of one or more associated parameters to test. … That is, the experiment module 30 may generate a machine learning model 32 for each unique combination of parameters specified by the selection.”).): 
configuring the ML model based on the exploratory configuration (Sturlaugson Figure 2, elements 30, 32: examiner’s note: The experiment module 30 performs the Sturlaugson paragraphs [0033]-[0034]: “Experiment module 30 of the machine learning system 10 is configured to test (e.g., to train and evaluate) each of the machine learning models 32 of the selection of machine learning models 32 provided by the data input module 20 to produce a performance result for each machine learning model 32. … Experiment module 30 may be configured to automatically and/or autonomously design and carry out the specified experiments (also called trials) to test each of the machine learning models 32. … For example, the selection of machine learning models 32 received by the data input module 20 may include specific machine learning algorithms and a range and/or a set of one or more associated parameters to test. … That is, the experiment module 30 may generate a machine learning model 32 for each unique combination of parameters specified by the selection.”).); 
measuring a second duration spent training, based on a second dataset, the ML model (Sturlaugson Figure 2, elements 30, 32: examiner’s note: The experiment module 30 performs the experiments/trials to test each machine learning model 32, under both training and evaluation phases, by selecting the specific machine learning algorithm and a range and/or set of one or more associated parameters (corresponding to “an exploratory configuration”), where the experiment module further subdivides a dataset into training datasets and evaluation datasets, where the training datasets (corresponding to “a second dataset”) are used to further train the selected machine learning algorithm and its associated exploratory configuration (Sturlaugson paragraphs [0034]-[0036]: “Experiment module 30 of the machine learning system 10 is configured to test (e.g., to train and evaluate) each of the machine learning models 32 of the selection of machine learning models 32 provided by the data input module 20 to produce a performance result for each machine learning model 32. … That is, the experiment module 30 may generate a machine learning model 32 for each unique combination of parameters specified by the selection. … Experiment module 30 may be configured, optionally for each machine learning model 32 independently, to divide the dataset into a training dataset (a subset of the dataset) and an evaluation dataset (another subset of the dataset). The same training dataset and evaluation dataset may be used for one or more, optionally all, of the machine learning models 32. … The experiment module 30 may be configured to train the machine learning model(s) 32 with the respective training dataset(s) (to produce a trained model) … ”), where the results of the training produces a performance result that is related to execution speed (which is interpreted as a measured run-time, and corresponds to “measuring a second duration spent training, based on a second dataset, the ML model”) (Sturlaugson paragraph [0042]: “The performance result for each machine learning model 32 … may include an indicator, value, and/or result related to …  an accuracy,… . Additionally or alternatively, the indicator, value, and/or result may be related to computational efficiency, memory required, and/or execution speed.”).); and 
generating, within a plurality of training tuples, a training tuple (Reif p.265 Figure 3: examiner’s note: Referring to Reif Figure 3 Training section, a set of measured run-times and associated meta-features corresponding to known datasets are used to form a set of input training data (corresponding to “generating, within a plurality of training tuples, a training tuple”) to a regression learner.) based on: 
the second duration (Reif p.265 Figure 3: examiner’s note: Referring to Reif Figure 3 Training section, measuring the run-time of a target classifier that is learning optimized hyperparameters for each dataset and associated meta-features, and using the associated measured run-times (corresponding to “said plurality of durations”) as part of the input training data (corresponding to “a training tuple”) to train a regression learner (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used at the target variable. … The overall approach is illustrated in Figure 3.”).), …
… a plurality of values, based on the second dataset, of said plurality of meta- features (Reif p.265 Figure 3: examiner’s note: Referring to Reif Figure 3 Training section, analyzing dataset instances to generate meta-features, where the meta-features are based on dataset features, where this method of analyzing dataset instances to generate meta-features is independent of any dataset (and hence can apply to a “first dataset”, a “second dataset”, etc.) (Reif p.262 Section 2 Meta-Learning: “…meta-learning is based on features of datasets. These features are often called meta-features. They describe properties of a dataset … Simple meta-features use directly accessible properties like the number of samples, the number of attributes or the number of classes. More sophisticated features are statistical measures …”). A list of meta-features (grouped by category) is listed in Reif Section 4.1 Traditional Meta-Features (corresponding to “a plurality of values, based on a second dataset, of Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used at the target variable. … The overall approach is illustrated in Figure 3.”).); 
training the trainable regressor based on the plurality of training tuples (Reif p.265 Figure 3: examiner’s note: Referring to Reif Figure 3 Training section, a set measured run-times and associated meta-features corresponding to known datasets are used to form a set of input training data to a regression learner (corresponding to “training the trainable regressor based on the plurality of training tuples”).).  
While Reif in view of Sturlaugson teaches generating a plurality of training tuples containing measured run-times and a plurality of meta-features associated with known datasets as input training data into a regression learner, Reif in view of Sturlaugson does not explicitly teach
the method further comprises: …
… generating, within a plurality of training tuples, a training tuple based on: 
… the plurality of values of the exploratory configuration, …
Hutter teaches
the method further comprises: …
… generating, within a plurality of training tuples, a training tuple based on: 
… the plurality of values of the exploratory configuration (Examiner’s note: Training an empirical performance model representing an EPM regression model (Hutter p.79 Section 1 Introduction, 1st paragraph: “… a considerable body of work has shown how to use supervised machine learning methods to build regression models … we refer to such models as empirical performance models (EPMs).” and Hutter p.82 Section 3.1 Preliminaries, 2nd paragraph: “… EPMs can predict any type of performance measure that can be evaluated in single algorithm runs, such as runtime, …”) by constructing input training data with parameter configurations                         
                            
                                
                                    θ
                                
                                
                                    i
                                
                            
                        
                     (Hutter p.81 Section 2.2 Related Work on Predicting Runtime of Parameterized Algorithms, 1st paragraph: “ … parameters can be treated as additional inputs to the model … and a model can be learned in the standard way.” and Hutter p.82 Section 3.1 Preliminaries, 1st paragraph: “We define the configuration space of a parameterized algorithm with k parameters                         
                            
                                
                                    θ
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    θ
                                
                                
                                    k
                                
                            
                        
                    ”), a set of feature vectors                         
                            
                                
                                    z
                                
                                
                                    i
                                
                            
                             
                        
                    representing problem-specific instance features (“meta-features”), and corresponding performance value                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                     representing a measured run-time (Hutter p.82 Section 3.1 Preliminaries, 3rd paragraph: “…we focus on runtime as a performance measure…”), with this input training data for the EPM (including the parameter configurations) corresponding to “generating, within a plurality of training tuples, a training tuple based on: … the plurality of values of the exploratory configuration, …” (Hutter p.82 Section 3.1 Preliminaries, 2nd paragraph: “To construct an EPM for an algorithm A with configuration space on an instance set Π, we run 𝒜 on various combinations of configurations                         
                            
                                
                                    θ
                                
                                
                                    i
                                
                            
                        
                    ∈θ  and instances                         
                            
                                
                                    π
                                
                                
                                    i
                                
                            
                        
                    = Π, and record the resulting performance values                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                    . We record the k-dimensional parameter configuration i and the m-dimensional feature vector                         
                            
                                
                                    z
                                
                                
                                    i
                                
                            
                        
                     of the instance used in the i-th run, and combine them to form a p = k + m-dimensional vector of predictor variables                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                     =                         
                            
                                
                                    [
                                    
                                        
                                            θ
                                        
                                        
                                            i
                                        
                                        
                                            T
                                        
                                    
                                    , 
                                    
                                        
                                            z
                                        
                                        
                                            i
                                        
                                        
                                            T
                                        
                                    
                                    ]
                                
                                
                                    T
                                
                            
                        
                    . The training data for our regression models is then simply {(                        
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                        
                    ,                        
                            
                                
                                    y
                                
                                
                                    1
                                
                            
                        
                    ), …, (                        
                            
                                
                                    x
                                
                                
                                    n
                                
                            
                        
                    ,                        
                            
                                
                                    y
                                
                                
                                    n
                                
                            
                        
                    )}.”).), …
Both Reif in view of Sturlaugson and Hutter are analogous art since they both teach using regression algorithms to predict run-time for a machine learning model based on a set of hyperparameters, measured run-times, and associated meta-features corresponding to known datasets.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the training tuple for the regression learner taught in Reif in view of Sturlaugson and enhance it to include the hyperparameter values associated with a machine learning model as taught in Hutter as a way to perform run-time predictions for a machine learning model. The motivation to combine is taught in Hutter, since performing predictions on a model trained over its entire distribution (including its meta-features and hyperparameter values) allows a more accurate assessment of a model’s confidence at a particular input, where a selected model with a higher confidence results in the model having a better than average expectation of making more accurate predictions (Hutter p.82 Section 3.1 Preliminaries, 2nd paragraph: “Given an algorithm 𝒜 with configuration space Θ and a distribution of instances with feature space ℱ, an EPM is a stochastic process f: ℐ→∆(ℝ) that defines a probability distribution over performance measures for each combination of a parameter configuration θ∈Θ of 𝒜 and a problem instance with features z∈ ℱ. The prediction of an entire distribution allows us to assess the model’s confidence at a particular input, which is essential, e.g., in model-based algorithm configuration [7,6,58,55].”).
Regarding Claim 3, Reif in view of Sturlaugson, in further view of Hutter teaches
The method of Claim 2 wherein:
the second dataset is the first dataset, 
the second dataset is larger than the first dataset, 
the second dataset is a subsample of the first dataset (Sturlaugson Figure 2, elements 30, 32: examiner’s note: The experiment module 30 performs the experiments/trials to test each machine learning model 32, under both training and evaluation phases, by selecting the specific machine learning algorithm and a range and/or set of one or more associated parameters, where the experiment module further subdivides a dataset (“a first dataset”) into training datasets and evaluation datasets, where the training datasets (corresponding to “a second dataset is a subsample of the first dataset”) are used to further train the selected machine learning algorithm and its associated exploratory configuration (Sturlaugson paragraphs [0034]-[0036]: “Experiment module 30 of the machine learning system 10 is configured to test (e.g., to train and evaluate) each of the machine learning models 32 of the selection of machine learning models 32 provided by the data input module 20 to produce a performance result for each machine learning model 32. … That is, the experiment module 30 may generate a machine learning model 32 for each unique combination of parameters specified by the selection. … Experiment module 30 may be configured, optionally for each machine learning model 32 independently, to divide the dataset into a training dataset (a subset of the dataset) and an evaluation dataset (another subset of the dataset). The same training dataset and evaluation dataset may be used for one or more, optionally all, of the machine learning models 32. … The experiment module 30 may be configured to train the machine learning model(s) 32 with the respective training dataset(s) (to produce a trained model) … ”).), or 
said proposed configuration is contained in: said plurality of exploratory configurations, and/or said plurality of landmark configurations.  
Regarding Claim 4, Reif in view of Sturlaugson, in further view of Hutter teaches
The method of Claim 2 wherein said training the trainable regressor comprises 
measuring accuracy of the trainable regressor based on: 
mean-squared error (MSE), 
coefficient of determination (R2), 
Pearson correlation coefficient (Reif p.268 Section 5.1 Correlation, 1st-2nd paragraphs: examiner’s note: According to the Merriam-Webster dictionary, the term “and/or” indicates that two words or expressions are to be taken together or individually, and as such, the claim limitations in this claim connected by this term will be treated as an “or” in the context of this claim. Prediction performances of the different sets of target classifiers (corresponding to a classifier accuracy, Sturlaugson p.266 3rd paragraph (Section 4.2 Time-Based Meta-Features): “…If the traditional-meta-features are calculated anyway, e.g., for predicting the accuracy of classifiers…”) running predetermined combination sets of hyperparameters (“landmark configurations”) and their predicted run-times are evaluated using a Pearson product moment correlation coefficient (corresponding to “wherein said training the trainable regressor comprises measuring accuracy of the trainable regressor based on: … Pearson correlation coefficient” (Reif p.268 Section 5.1 Correlation, 1st-2nd paragraphs: “The Pearson product moment correlation coefficient (PMCC) of the actual run-time and the predicted run-time was calculated. The correlation between two variables X and Y is defined as                         
                            
                                
                                    ρ
                                
                                
                                    X
                                    ,
                                    Y
                                
                            
                            =
                            E
                            
                                
                                    
                                        
                                            X
                                            -
                                            
                                                
                                                    μ
                                                
                                                
                                                    x
                                                
                                            
                                        
                                    
                                    
                                        
                                            Y
                                            -
                                            
                                                
                                                    μ
                                                
                                                
                                                    y
                                                
                                            
                                        
                                    
                                
                            
                            /
                            
                                
                                    σ
                                
                                
                                    x
                                
                            
                            
                                
                                    σ
                                
                                
                                    y
                                
                            
                        
                    . … The results are values in the interval [-1, 1]. … Table 3 shows the correlations coefficients for all five target classifiers and the investigated sets of meta-features.”).), and/or 
Spearman rank correlation.  
Regarding Claim 14, Reif in view of Sturlaugson as applied to Claim 13 teaches
The one or more non-transitory computer-readable media of Claim 13 wherein: 
said dataset is a first dataset (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.); 
a plurality of exploratory configurations is larger than the plurality of landmark configurations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.); 
the instructions further cause: 
for each exploratory configuration of the plurality of exploratory configurations that each contain a plurality of values for said plurality of hyperparameters (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.): 
configuring the ML model based on the exploratory configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.); 
measuring a second duration spent training, based on a second dataset, the ML model (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.); and 
generating, within a plurality of training tuples, a training tuple (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.) based on: 
the second duration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.), … 
… a plurality of values, based on the second dataset, of said plurality of meta- features (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.); 
training the trainable regressor based on the plurality of training tuples (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.).  
While Reif in view of Sturlaugson teaches generating a plurality of training tuples containing measured run-times and a plurality of meta-features associated with known datasets as input training data into a regression learner, Reif in view of Sturlaugson does not explicitly teach
the instructions further cause: 
… generating, within a plurality of training tuples, a training tuple based on: 
… the plurality of values of the exploratory configuration, …
Hutter teaches
the instructions further cause: 
… generating, within a plurality of training tuples, a training tuple based on: 
… the plurality of values of the exploratory configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.), …
Both Reif in view of Sturlaugson and Hutter are analogous art since they both teach using regression algorithms to predict run-time for a machine learning model based on a set of hyperparameters, measured run-times, and associated meta-features corresponding to known datasets.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the training tuple for the regression learner taught in Reif in view of Sturlaugson and enhance it to include the hyperparameter values associated with a machine learning model taught in Hutter as a way to perform run-time predictions for a machine learning model. The motivation to combine is taught in Hutter, since performing predictions on a model trained over its entire distribution (including its meta-features and (Hutter p.82 Section 3.1 Preliminaries, 2nd paragraph: “Given an algorithm 𝒜 with configuration space Θ and a distribution of instances with feature space ℱ, an EPM is a stochastic process f: ℐ→∆(ℝ) that defines a probability distribution over performance measures for each combination of a parameter configuration θ∈Θ of 𝒜 and a problem instance with features z∈ ℱ. The prediction of an entire distribution allows us to assess the model’s confidence at a particular input, which is essential, e.g., in model-based algorithm configuration [7,6,58,55].”).
Regarding Claim 15, Reif in view of Sturlaugson, in further view of Hutter teaches
The one or more non-transitory computer-readable media of Claim 14 wherein: 
the second dataset is the first dataset, 
the second dataset is larger than the first dataset, 
the second dataset is a subsample of the first dataset (This claim limitation is similar in scope to a corresponding claim limitation in Claim 3, and hence is rejected under similar rationale.), or 
said proposed configuration is contained in: said plurality of exploratory configurations, and/or said plurality of landmark configurations.  
Regarding Claim 16, Reif in view of Sturlaugson, in further view of Hutter teaches
The one or more non-transitory computer-readable media of Claim 14 wherein said training the trainable regressor comprises 
measuring accuracy of the trainable regressor based on: 
mean-squared error (MSE), 
coefficient of determination (R2), 
Pearson correlation coefficient (This claim limitation is similar in scope to a corresponding claim limitation in Claim 4, and hence is rejected under similar rationale.), and/or 
Spearman rank correlation.  
Claims 5 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Reif et al., Prediction of Classifier Training Time Including Parameter Optimization, in Bach et al., KI 2011: Advances in Artificial Intelligence, LNAI 7006, Springer-Verlag Berlin Heidelberg 2011, pp.260-271 [hereafter referred as Reif] in view of Sturlaugson et al., U.S. PGUPB 2016/0358099, published 12/8/2016 [hereafter referred as Sturlaugson], in further view of Hutter et al., Algorithm runtime prediction: Methods & evaluation, Artificial Intelligence 206 (2014), Elsevier B.V. 2013, pp.79-111 [hereafter referred as Hutter] as applied to Claims 2 and 14; in even further view of Kobayashi et al., U.S. PGPUB 2017/0061329, published 3/2/2017 [hereafter referred as Kobayashi].
Regarding Claim 5, Reif in view of Sturlaugson, in further view of Hutter as applied to Claim 2 teaches
The method of Claim 2, wherein said training the trainable regressor comprises: 
… hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor ([Sturlaugson Figure 2, elements 32, 36, 38: examiner’s note: A machine learning model 32 may include a macro-procedure which is an ensemble of micro-procedures (where each micro-procedure is a trainable machine learning model), and the macro-procedure can include a machine learning algorithm (corresponding to the “trainable regressor”) and associated parameter values that are independent from those used in each micro-procedure, thus corresponding to “… hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor” (Sturlaugson paragraph [0023]: “Machine learning model 32 may be a macro-procedure 36 that combines the outcomes of an ensemble of micro-procedures 38. Each micro-procedure 38 includes a machine learning algorithm and its associated parameter values. Optionally, each micro-procedure 38 includes a different combination of machine learning algorithm and associated parameter values.” and Sturlaugson paragraph [0026]: “Macro-procedures 36 may include a machine learning algorithm and associated parameter values that are independent and/or distinct from the micro-procedures 38.”).]):
training the trainable regressor with a subset of the second dataset and the hyperparameter configuration (Sturlaugson Figure 2, elements 30, 32, 36, 38: examiner’s note: Each macro-procedure with its independent set of hyperparameter values (corresponding to the “trainable regressor”) use the same training and evaluation datasets used in training each micro-procedure (Sturlaugson paragraph [0040]: “Experiment module 30 is configured to train each of the machine learning models 32 using supervised learning to produce a trained model for each machine learning model. Experiment module 30 is configured to evaluate and/or to validate each trained model to produce a performance result for each machine learning model. Evaluation and/or validation may be performed by applying the trained model to the respective evaluation dataset and comparing the trained model results to the known output values. For machine learning models 32 which are macro-procedures 36, the experiment module 30 may be configured to generate a trained macro-procedure by independently training each micro-procedure 38 of the macro-procedure 36 to produce an ensemble of trained micro-procedures and, if the macro procedure 36 itself includes a machine learning algorithm, training the macro-procedure 36 with the ensemble of trained micro-procedures 38.”), which can be further subdivided into multiple subsets of data through cross-validation (thus corresponding to “training the trainable regressor with a subset of the second dataset and the hyperparameter configuration”) (Sturlaugson paragraph [0041]: “Evaluation and/or validation may be performed by cross validation (multiple rounds of validation) … Cross validation is a process in which the original dataset is divided multiple times (to form multiple training datasets and corresponding evaluation datasets), the machine learning model 32 is trained and evaluated with each division (each training dataset and corresponding evaluation dataset) to produce an evaluation result for each division, and the evaluation results are combined to produce the performance result…”).); and 
measuring accuracy of the trainable regressor based on said hyperparameter configuration (Examiner’s note: “The performance result for each machine learning model 32 and/or the individual evaluation results for each round of validation may include an indicator, value, and/or result related to … an accuracy …”).); 
training the trainable regressor with the second dataset (Sturlaugson Figure 2, elements 30, 32, 36, 38: examiner’s note: Each macro-procedure with its independent set of hyperparameter values (corresponding to the “trainable regressor”) use the same training and evaluation datasets used in training each micro-procedure, thus corresponding to “training the trainable regressor with the second dataset…” (Sturlaugson paragraph [0040]: “Experiment module 30 is configured to train each of the machine learning models 32 using supervised learning to produce a trained model for each machine learning model. Experiment module 30 is configured to evaluate and/or to validate each trained model to produce a performance result for each machine learning model. Evaluation and/or validation may be performed by applying the trained model to the respective evaluation dataset and comparing the trained model results to the known output values. For machine learning models 32 which are macro-procedures 36, the experiment module 30 may be configured to generate a trained macro-procedure by independently training each micro-procedure 38 of the macro-procedure 36 to produce an ensemble of trained micro-procedures and, if the macro procedure 36 itself includes a machine learning algorithm, training the macro-procedure 36 with the ensemble of trained micro-procedures 38.”).) …
Reif in view of Sturlaugson, in further view of Hutter teaches optimizing hyperparameter values for the regression learner through a grid search, which requires searching all possibilities of a hyperparameter space for the best set of hyperparameters (corresponding to “for each hyperparameter configuration … of the trainable regressor”; Reif p.267 2nd paragraph: “The presented approach was evaluated by a leave-one-out cross-validation for every algorithm. We used the regression variant of a Support Vector Machine, the ϵ-SVR, as meta-learning scheme. The parameters γ and C of the ϵ-SVR have been optimized by a grid search. LibSVM[7] was used as implementation.”), Reif in view of Sturlaugson, in further view of Hutter does not explicitly teach
for each hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor: … training … a most accurate hyperparameter configuration of said measuring said accuracy.  
Kobayashi teaches
for each hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor: … training … a most accurate hyperparameter configuration of said measuring said accuracy (Kobayashi Figure 18, elements 137, 138; Figure 19, steps S71..S72..S79, S80, S81: A machine learning service performing searching for an optimized set of hyperparameters during training of a machine learning model (where the model can be a regression model such as a random forest, thus corresponding to “a trainable regressor”, Kobayashi paragraph [0092]: “The machine learning device 100 is able to use a plurality of machine learning algorithms. … Examples of the machine learning algorithms include … a random forest.”), where the searching is performed using a hyperparameter adjustment unit to produce a set of hyperparameter vectors (through a grid search) that achieved the best prediction performance based on sets of hyperparameter vectors found in earlier learning steps (corresponding to “… a plurality of hyperparameter configurations…”, Kobayashi paragraph [0209]: “… the hyperparameter adjustment unit 137 generates a hyperparameter vector applied to a machine learning algorithm to be executed by the step execution unit 138. Grid search or random search may be used to generate the hyperparameter vector.” and Kobayashi paragraph [0211]: “… the hyperparameter adjustment unit 137 may perform the search by starting with a hyperparameter vector                         
                            
                                
                                    θ
                                
                                
                                    j
                                    =
                                    i
                                
                            
                        
                    , that achieved the best prediction performance in the last learning step…”), and the search flow is controlled through a step execution unit which extracts a set of hyperparameter vectors from the hyperparameter adjustment unit over multiple learning steps (corresponding to “for each hyperparameter configuration of a plurality of hyperparameter configurations …”). The step execution unit also performs cross-validation using training datasets, and repeats the steps of generating sets of hyperparameter vectors and cross-validation over H iterations (Kobayashi paragraphs [0214], [0215]-[0227]; Kobayashi Figure 19, steps S71..S72..S79, S80, S81), in order to produce a set of H predictions with corresponding hyperparameter vectors, where the iteration that has the best prediction performance (corresponding to an accuracy, Kobayashi paragraph [0055]: “The prediction performance of an individual model indicates the accuracy thereof, namely, indicates the capability of accurately predicting results of unknown cases.”) is selected, thereby outputting a selected machine learning model with the best prediction performance and an optimized set of hyperparameters, resulting in “… training the trainable regressor with … a most accurate hyperparameter configuration of said measuring said accuracy” (Kobayashi paragraph [0214]: “ … Next, the step execution unit 138 selects a model that indicates the best prediction performance from a plurality of models that correspond to the plurality of hyperparameter vectors. The step execution unit 138 outputs the selected model, the prediction performance thereof, the hyperparameter vector used to generate the model, and the execution time.”).).  
Both Reif in view of Sturlaugson, in further view Hutter and Kobayashi are analogous art since they both teach training machine learning algorithms using cross-validation techniques.
Reif in view of Sturlaugson, in further view of Hutter and enhance it to include the step execution unit and hyperparameter adjustment unit taught in Kobayashi as a way to train a machine learning model to produce an optimized set of hyperparameter values. The motivation to combine is taught in Kobayashi, as a way to automate the training, evaluation, and selection of machine learning model using large datasets by starting with a model with a known prediction performance trained within a given period of time, and to improve the prediction performance of a machine learning model by using incrementally larger datasets until the desired prediction performance is achieved, thereby saving computational resources and shortening the time to learn a model (Kobayashi paragraphs [0004]-[0005]: “In machine learning, it is preferable that the accuracy of an individual learned model, namely, the capability of accurately predicting results of unknown cases (which may be referred to as a prediction performance) be high. If a larger size of training data is used in learning, a model indicating a higher prediction performance is obtained. However, if a larger size of training data is used, more time is needed to learn a model. Thus, progressive sampling has been proposed as a method for efficiently obtaining a model indicating a practically sufficient prediction performance. … With the progressive sampling, first, a computer learns a model by using a small size of training data. Next, by using test data indicating a known case different from the training data, the computer compares a result predicted by the model with the known result and evaluates the prediction performance of the learned model. If the prediction performance is not sufficient, the computer learns a model again by using a larger size of training data than the size of the last training data. The computer repeats this procedure until a sufficiently high prediction performance is obtained. In this way, the computer can avoid using an excessively large size of training data and can shorten the time needed to learn a model.”).
Regarding Claim 17, Reif in view of Sturlaugson, in further view of Hutter as applied to Claim 14 teaches
The one or more non-transitory computer-readable media of Claim 14 wherein said training the trainable regressor comprises: 
… hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor (This claim limitation is similar in scope to a corresponding claim limitation in Claim 5, and hence is rejected under similar rationale.):
training the trainable regressor with a subset of the second dataset and the hyperparameter configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 5, and hence is rejected under similar rationale.); and 
measuring accuracy of the trainable regressor based on said hyperparameter configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 5, and hence is rejected under similar rationale.);
training the trainable regressor with the second dataset (This claim limitation is similar in scope to a corresponding claim limitation in Claim 5, and hence is rejected under similar rationale.) …
While Reif in view of Sturlaugson, in further view of Hutter teaches optimizing hyperparameter values for the regression learner through a grid search (Reif p.267 2nd paragraph: “The presented approach was evaluated by a leave-one-out cross-validation for every algorithm. We used the regression variant of a Support Vector Machine, the ϵ-SVR, as meta-learning scheme. The parameters γ and C of the ϵ-SVR have been optimized by a grid search. LibSVM[7] was used as implementation.”), Reif in view of Sturlaugson, in further view of Hutter does not explicitly teach
for each hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor: … training … a most accurate hyperparameter configuration of said measuring said accuracy.  
Kobayashi teaches
for each hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor: … training … a most accurate hyperparameter configuration of said measuring said accuracy (This claim limitation is similar in scope to a corresponding claim limitation in Claim 5, and hence is rejected under similar rationale.).  
Both Reif in view of Sturlaugson, in further view Hutter and Kobayashi are analogous art since they both teach training machine learning algorithms using cross-validation techniques.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the machine learning training of the macro-procedure machine learning algorithm taught in Reif in view of Sturlaugson, in further view of Hutter and enhance it to include the step execution unit and hyperparameter adjustment unit taught in Kobayashi as a way to train a machine learning model to produce an optimized set of hyperparameter values. The motivation to combine is taught in Kobayashi, as a way to automate the training, evaluation, and selection of machine learning model using large datasets by starting with a model with a known prediction performance trained within a given period of time, and to improve the prediction performance of a machine learning model by using incrementally larger datasets until the desired prediction performance is achieved, thereby saving computational resources and shortening the time to learn a model (Kobayashi paragraphs [0004]-[0005]: “In machine learning, it is preferable that the accuracy of an individual learned model, namely, the capability of accurately predicting results of unknown cases (which may be referred to as a prediction performance) be high. If a larger size of training data is used in learning, a model indicating a higher prediction performance is obtained. However, if a larger size of training data is used, more time is needed to learn a model. Thus, progressive sampling has been proposed as a method for efficiently obtaining a model indicating a practically sufficient prediction performance. … With the progressive sampling, first, a computer learns a model by using a small size of training data. Next, by using test data indicating a known case different from the training data, the computer compares a result predicted by the model with the known result and evaluates the prediction performance of the learned model. If the prediction performance is not sufficient, the computer learns a model again by using a larger size of training data than the size of the last training data. The computer repeats this procedure until a sufficiently high prediction performance is obtained. In this way, the computer can avoid using an excessively large size of training data and can shorten the time needed to learn a model.”).
Claims 8 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Reif et al., Prediction of Classifier Training Time Including Parameter Optimization, in Bach et al., KI 2011: Advances in Artificial Intelligence, LNAI 7006, Springer-Verlag Berlin Heidelberg 2011, pp.260-271 [hereafter referred as Reif] in view of Sturlaugson et al., U.S. PGUPB 2016/0358099, published 12/8/2016 [hereafter referred as Sturlaugson] as applied to Claims 6 and 18; in further view of Raschka, Sebastian, Machine Learning FAQ: What is the difference between Pearson R and Simple Linear Regression?, retrieved from web.archive.org (https://web.archive.org/web/20160402054319/http://sebastianraschka.com:80/faq/docs/pearson-r-vs-linear-regr.html), dated 04/02/2016 [hereafter referred as Raschka].
Regarding Claim 8, Reif in view of Sturlaugson as applied to Claim 6 teaches
The method of Claim 6,
wherein said predicting based on the plurality of durations comprises predicting based on … two landmark configurations of the plurality of landmark configurations (Reif p.268 Section 5.1 Correlation, 1st-2nd paragraphs: examiner’s note: Prediction performances of the different sets of target classifiers running predetermined combination sets of hyperparameters Reif p.268 Section 5.1 Correlation, 1st-2nd paragraphs: “The Pearson product moment correlation coefficient (PMCC) of the actual run-time and the predicted run-time was calculated. The correlation between two variables X and Y is defined as                         
                            
                                
                                    ρ
                                
                                
                                    X
                                    ,
                                    Y
                                
                            
                            =
                            E
                            
                                
                                    
                                        
                                            X
                                            -
                                            
                                                
                                                    μ
                                                
                                                
                                                    x
                                                
                                            
                                        
                                    
                                    
                                        
                                            Y
                                            -
                                            
                                                
                                                    μ
                                                
                                                
                                                    y
                                                
                                            
                                        
                                    
                                
                            
                            /
                            
                                
                                    σ
                                
                                
                                    x
                                
                            
                            
                                
                                    σ
                                
                                
                                    y
                                
                            
                        
                    . … The results are values in the interval [-1, 1]. … Table 3 shows the correlations coefficients for all five target classifiers and the investigated sets of meta-features.”).).  
While Reif in view of Sturlaugson teaches a Pearson correlation coefficient, Reif in view of Sturlaugson does not explicitly teach
wherein said predicting based on the plurality of durations comprises predicting based on a slope of said normalized duration of two landmark configurations of the plurality of landmark configurations.  
Raschka teaches
wherein said predicting based on the plurality of durations comprises predicting based on a slope of said normalized duration of two landmark configurations of the plurality of landmark configurations (Examiner’s note: A Pearson correlation coefficient is shown to be a standardized slope (thus corresponding to “… a slope of said normalized duration …” (Raschka Simple Linear Regression: “ … To show how the correlation coefficient r factors in, let’s rewrite it as cov(x,y)/                        
                            
                                
                                    σ
                                
                                
                                    x
                                
                                
                                    2
                                
                            
                        
                     = cov(x,y)/                        
                            
                                
                                    σ
                                
                                
                                    x
                                
                            
                            
                                
                                    σ
                                
                                
                                    y
                                
                            
                        
                     x                         
                            
                                
                                    σ
                                
                                
                                    y
                                
                            
                        
                    /                        
                            
                                
                                    σ
                                
                                
                                    x
                                
                            
                        
                    , where the first term is equal to r, which we defined earlier; we can now see that we could use the “linear correlation coefficient” to compute the slope of the line as b =                         
                            
                                
                                    r
                                
                                
                                    x
                                    ,
                                    y
                                
                            
                            
                                
                                    σ
                                
                                
                                    y
                                
                            
                        
                    /                        
                            
                                
                                    σ
                                
                                
                                    x
                                
                            
                        
                    . … So, essentially, the linear correlation coefficient (Pearson’s r) is just the standardized slope of a simple linear regression line (fit).”).).  
Both Reif in view of Sturlaugson and Raschka are analogous art since they both teach Pearson correlation coefficient in the context of regression learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the performance measurement based on the Pearson correlation coefficient calculation of the actual and predicted run-times taught in Reif in view of Sturlaugson and treat it as a slope of normalized duration during regression line analysis as taught in Raschka as a way to perform regression line analysis for the calculated performance measurements associated with each target classifier. The motivation to combine is taught in Raschka, as a way to facilitate the calculation of regression line analysis, as standardizing variables surround a normal distribution with mean 0 and standard deviation 1 avoids computing the y-axis intercept for a linear regression line when executing optimization algorithms based on linear regression, and allows the slope of a linear regression line to be the same as the correlation coefficient, thus simplifying and making the analysis involving large datasets more computationally efficient (Raschka Standardizing Variables: “In practice, we often standardize our input variables … After standardization, our variables have the properties of a standard normal distribution with mean=0, and standard deviation 1. … This is also useful if we use optimization algorithms for multiple linear regression, such as gradient descent, instead of the closed-form solution (handy for working with large datasets). … Another advantage of this approach is that the slop is then exactly the same as the correlation coefficient, which saves another computational step.”).
Regarding Claim 20, Reif in view of Sturlaugson as applied to Claim 18 teaches
The one or more non-transitory computer-readable media of Claim 18,
wherein said predicting based on the plurality of durations comprises predicting based on … two landmark configurations of the plurality of landmark configurations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 8, and hence is rejected under similar rationale.).
While Reif in view of Sturlaugson teaches a Pearson correlation coefficient, Reif in view of Sturlaugson does not explicitly teach
wherein said predicting based on the plurality of durations comprises predicting based on a slope of said normalized duration of two landmark configurations of the plurality of landmark configurations.  
Raschka teaches
wherein said predicting based on the plurality of durations comprises predicting based on a slope of said normalized duration of two landmark configurations of the plurality of landmark configurations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 8, and hence is rejected under similar rationale.).  
Both Reif in view of Sturlaugson and Raschka are analogous art since they both teach Pearson correlation coefficient in the context of regression learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the performance measurement based on the Pearson correlation coefficient calculation of the actual and predicted run-times taught in Reif in view of Sturlaugson and treat it as a slope of normalized duration during regression line analysis as taught in Raschka as a way to perform regression line analysis for the calculated performance measurements associated with each target classifier. The motivation to combine is taught in Raschka, as a way to facilitate the calculation of regression line analysis, as standardizing variables surround a normal distribution with mean 0 and standard deviation 1 avoids computing the y-axis intercept for a linear regression line when executing optimization algorithms based on linear regression, and allows the slope of a linear regression line to be the (Raschka Standardizing Variables: “In practice, we often standardize our input variables … After standardization, our variables have the properties of a standard normal distribution with mean=0, and standard deviation 1. … This is also useful if we use optimization algorithms for multiple linear regression, such as gradient descent, instead of the closed-form solution (handy for working with large datasets). … Another advantage of this approach is that the slop is then exactly the same as the correlation coefficient, which saves another computational step.”).
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Reif et al., Prediction of Classifier Training Time Including Parameter Optimization, in Bach et al., KI 2011: Advances in Artificial Intelligence, LNAI 7006, Springer-Verlag Berlin Heidelberg 2011, pp.260-271 [hereafter referred as Reif] in view of Sturlaugson et al., U.S. PGUPB 2016/0358099, published 12/8/2016 [hereafter referred as Sturlaugson] as applied to Claim 1; in further view of Feurer et al., Initializing Bayesian Hyperparameter Optimization via Meta-Learning, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp.1128-1135 [hereafter referred as Feurer].
Regarding Claim 12, Reif in view of Sturlaugson as applied to Claim 1 teaches
The method of Claim 1,
wherein the trainable regressor is a random forest ([Reif p.262 Section 2 Meta-Learning, 5th paragraph: “Regression was also used for meta-learning … for each target classifier whose performance should be predicted, a separate regression model has to be trained. … Various meta-features and regression algorithms have been used to predict different performance measures of classification algorithms can be applied using various regression algorithms …”] [Sturlaugson Figure 2, elements 32, 36, 38: examiner’s note: The machine learning model consists of a macro-procedure which is an ensemble of micro-procedures (where each micro-procedure is a trainable machine learning model), and the Sturlaugson paragraph [0023]: “Machine learning model 32 may be a macro-procedure 36 that combines the outcomes of an ensemble of micro-procedures 38. Each micro-procedure 38 includes a machine learning algorithm and its associated parameter values. Optionally, each micro-procedure 38 includes a different combination of machine learning algorithm and associated parameter values.” and Sturlaugson paragraph [0026]: “Macro-procedures 36 may include a machine learning algorithm and associated parameter values that are independent and/or distinct from the micro-procedures 38. … Examples of macro-procedures 36 include an ensemble of learned decision trees (e.g., a random forest)…”).]) …
However, Reif in view of Sturlaugson does not explicitly teach
… the method further comprises using the random forest to rank features of said dataset by importance.  
Fuerer teaches
... the method further comprises using the random forest to rank features of said dataset by importance (Feurer p.1130 Algorithm 2: examiner’s note: Performing training of a meta-learner using sets of hyperparameter configurations and associated meta-features of datasets using the meta-learning-based initialization variant of SMBO algorithm (MI-SMBO, described in Feurer p.1130 Figure 2), where the regressor model used is a random forest (Feurer p.1130 col.1 4th paragraph-p.1130 col.2 4th paragraph (Initializing SMBO with Configurations Suggested by Meta-Learning): “… we assume that each dataset                 
                    
                        
                            D
                        
                        
                            i
                        
                    
                
             can be described by a set of F metafeatures                 
                    
                        
                            m
                        
                        
                            i
                        
                    
                
            =(                
                    
                        
                            m
                        
                        
                            1
                        
                        
                            i
                        
                    
                
            , …,                 
                    
                        
                            m
                        
                        
                            F
                        
                        
                            i
                        
                    
                
            ). … we precompute the metafeatures for all training datasets                 
                    
                        
                            D
                        
                        
                            1
                        
                    
                
            ,…,                 
                    
                        
                            D
                        
                        
                            N
                        
                    
                
            , along with the best configurations (                
                    
                        
                            
                                
                                    θ
                                
                                ^
                            
                        
                        
                            1
                        
                    
                
            ,…,                 
                    
                        
                            
                                
                                    θ
                                
                                ^
                            
                        
                        
                            N
                        
                    
                
            ). Given a new dataset                 
                    
                        
                            D
                        
                        
                            N
                            +
                            1
                        
                    
                
            , we then measure its distances to all previous datasets                 
                    
                        
                            D
                        
                        
                            i
                        
                    
                
             using a distance measure d: 𝓓 x 𝓓 [Wingdings font/0xE0] ℝ. … the measure we use (in the following denoted as                 
                    
                        
                            d
                        
                        
                            c
                        
                    
                
            ) is the negative Spearman correlation coefficient between the ranked results of a fixed set of n hyperparameter configurations on both datasets … compute                 
                    
                        
                            d
                        
                        
                            c
                        
                    
                
            (                
                    
                        
                            D
                        
                        
                            i
                        
                    
                
            ,                 
                    
                        
                            D
                        
                        
                            j
                        
                    
                
            ) for all 1≤i, j≤N and use regression to learn a function R … we implemented R using a random forest because of its robustness and speed.”). Referring to Algorithm 2 line 1, datasets are sorted by increasing distance to                 
                    
                        
                            D
                        
                        
                            N
                            +
                            1
                        
                    
                
             based on a distance metric (where a distance metric of increasing distances is interpreted to indicate the features between datasets are more distant, or less related, and as such, a sorted list of datasets based on a distance metric according to increasing distance corresponds to “using the random forest to rank features of said dataset by importance”) (Feurer p.1130 col.1 4th paragraph-p.1130 col.2 4th paragraph: “Sort dataset indices 𝛑(1), …, 𝛑(N) by increasing distance to                 
                    
                        
                            D
                        
                        
                            N
                            +
                            1
                        
                    
                
            , i.e., (𝛑(i) ≤ 𝛑(j)) ⇔ (d(                
                    
                        
                            D
                        
                        
                            N
                            +
                            1
                        
                    
                
            ,                 
                    
                        
                            D
                        
                        
                            i
                        
                    
                
            ) ≤ d(                
                    
                        
                            D
                        
                        
                            N
                            +
                            1
                        
                    
                
            ,                 
                    
                        
                            D
                        
                        
                            j
                        
                    
                
            ))”).).  
Both Reif in view of Sturlaugson and Feurer are analogous art since they both teach performing meta-learning training over a set of hyperparameters and associated meta-features of datasets using a regressor model.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the ϵ-SVR regressor model taught in Reif in view of Sturlaugson and replace it with the random forest model as taught in Feurer as a way to perform meta-learning training and evaluation over a set of hyperparameters and associated meta-features of datasets. The motivation to combine is taught in Feurer, since the MI-SMBO algorithm performs model selection and hyperparameter optimization starting from promising configurations that performed well on similar datasets, thus potentially speeding up the overall search for hyperparameters and reducing the computation time taken to train a machine learning model, resulting in improved computational efficiency and minimized computation time during training phase of a machine learning model (Feurer Abstract: “Model selection and hyperparameter optimization is crucial in applying machine learning to a novel dataset. Recently, a sub-community of machine learning has focused on solving this problem with Sequential Model-based Bayesian Optimization (SMBO), demonstrating substantial successes in many applications. However, for computationally expensive algorithms the overhead of hyperparameter optimization can still be prohibitive. In this paper we mimic a strategy human domain experts use: speed up optimization by starting from promising configurations that performed well on similar datasets.”).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM WAI YIN KWAN whose telephone number is 303-297-4332.  The examiner can normally be reached on Monday-Friday 8:00am - 4:30pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 571-272-3768.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/WILLIAM WAI YIN KWAN/Examiner, Art Unit 2121                                                                                                                                                                                                        



/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121