DETAILED ACTION
This is the response to applicant’s amendment action regarding application number 16/384,588, filed April 15, 2019.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendments
The amendment filed October 27, 2021 has been entered. Examiner acknowledges receipt of Amendments to Application 16/384,588, which include: Amendments to the Specification p.2, Amendments to the Drawings p.3 and Appendix (1 page), Amendments to the Claims pp.4-11, and Remarks pp.12-18 (containing applicant’s amendments). 
Regarding applicant’s Remarks on p.12, examiner has acknowledged Claims 1, 3-9, 11, 13, and 15-20 have been amended. Claims 1-20 remain pending in the application. 
Examiner has acknowledged applicant’s Amendments to the Specification, and they have overcome the specification objections previously set forth in the Non-Final Office Action mailed September 16, 2021. 
Examiner has acknowledged applicant’s Amendments to the Drawings and Appendix (1 page), and they have overcome the drawing objection previously set forth in the Non-Final Office Action mailed September 16, 2021. 
Regarding applicant’s Remarks on p.12, examiner acknowledges applicant’s Amendments to the Claims have removed the informalities from Claims 5 and 17 that were identified in the Non-Final Office Action mailed September 16, 2021. 
Regarding applicant’s Remarks on p.12, examiner acknowledges applicant’s Amendments to the Claims have removed the respective claim limitations from Claims 6 and 18 that were identified in the Non-Final Office Action mailed September 16, 2021 as failing to comply with the written description requirement. Therefore the respective §112(a) rejections previously set forth in the Non-Final Office 
Examiner has noted that the applicant did not address the two issues found in the Information Disclosure Statement dated 10/27/2020, that were identified in the Non-Final Office Action mailed September 16, 2021. These two issues are listed in the relevant section below. 

Response to Arguments
Examiner acknowledges receipt of Arguments to Application 16/384,588, which include: Remarks pp.12-18 (containing applicant’s arguments). 
Regarding applicant’s Remarks on pp.12-17 for Claims 1, 6-7, 9-11,13, and 18-19 under 35 U.S.C. 103 as being unpatentable over Reif et al., Prediction of Classifier Training Time Including Parameter Optimization, 2011 [hereafter referred as Reif] in view of Sturlaugson et al., U.S. PGPUB 2016/0358099, published 12/8/2016 [hereafter referred as Sturlaugson]; for Claims 2-4 and 14-16 under 35 U.S.C. 103 as being unpatentable over Reif in view of Sturlaugson, in further view of Hutter et al., Algorithm Runtime Prediction: Methods & Evaluation, arXiv:1211.0906v2, published October 26, 2013 [hereafter referred as Hutter]; for Claims 5 and 17 under 35 U.S.C. 103 as being unpatentable over Reif in view of Sturlaugson, in further view of Hutter, in even further view of Kobayashi et al., U.S. PGPUB 2017/0061329, published 3/2/2017 [hereafter referred as Kobayashi]; for Claims 8 and 20 under 35 U.S.C. 103 as being unpatentable over Reif in view of Sturlaugson, in further view of Raschka, Sebastian, Machine Learning FAQ: What is the difference between Pearson R and Simple Linear Regression?, retrieved from web.archive.org, dated 04/02/2016 [hereafter referred as Raschka]; and for Claim 12 under 35 U.S.C. 103 as being unpatentable over Reif in view of Sturlaugson, in further view of Feurer et al., Initializing Bayesian Hyperparameter Optimization via Meta-Learning, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015 [hereafter referred as Feurer], examiner acknowledges applicant’s arguments and have considered them, and have found them to be not persuasive. Examiner has also noted applicant has amended the claims such that it necessitates further examination and re-evaluation of the amended and related original claims. The updated claim mappings according to the applicant’s amended claims are provided in the relevant sections indicated below. 
Regarding applicant’s Remarks on p.13:
“A. A range is not a value 
Claim 1 in its present form recites "each value of the plurality of values of the landmark configuration corresponds to a distinct hyperparameter" per the specification (0030). The Office action (page 7) alleges "range...for each hyperparameter". A range is not a value. Reif lacks the claimed landmark configuration.” 
Examiner has considered this argument but finds the argument to be not persuasive. Reif Table 1 summarizes the classifiers and their corresponding optimized parameters (i.e., hyperparameters), with those parameters shown in terms of interval ranges and steps, where the numeric intervals denote the start and end value for each parameter/hyperparameter, while the steps for those numeric intervals denote the next numeric value within the interval range that is being identified. A person having ordinary skill in the art would be able to read and understand this table, with the given numeric interval range and defined steps for each hyperparameter, as a representation to express values for the corresponding hyperparameter that were optimized by a grid search for each respective classifier. In fact, applicant’s own specification uses this same range notation to define and summarize example hyperparameter values, as indicated in applicant’s specification [0016]: “For example, in the case of a support vector machines classifier (SVC) ML model, important hyperparameters to tune are gamma and C. In many practical cases, tuning ranges for gamma and C are [0.1, 10.0] and [1.0, 1000.0], respectively. Those are continuous ranges of real numbers, meaning that both ranges, although bounded, still contain an infinite amount of values to explore.”; applicant’s specification [0054]: “… For example, even when landmark configurations (i.e., hyperparameter values) are constant …”; and applicant’s specification [0082]: “… Hyperparameter 711 may have a natural range. For example, an MLP may have as few as one or two layers or as many layers as an implementation allows … Thus, hyperparameter 711 may have an integer range of 1-10, with one being a minimum, ten being a maximum, and five being a midpoint/mean of those two extremes. Those values of 1, 5, and 10 are shown.”. Examiner has annotated Reif Table 1 below to illustrate one example (Ripper classifier with hyperparameters ‘sample ratio’, ‘prune benefit’, ‘pureness’, and ‘criterion’, with the different values within the identified range shown). As indicated above, similar to the applicant’s specification, Reif uses this range notation to represent a plurality of values for each Reif pp.263-264 Section 3 Run-Time of a Grid Search: “A simple and often used method for parameter optimization is a grid search. All predefined combinations of parameter values are evaluated to determine the best of them.”, which are the result of the methodology that is used to analyze all possible combinations described in Reif p.264 3rd paragraph: “… Figure 2 shows the run-time for the typically optimized parameters γ and C of a Support Vector Machine (SVM). As visible in Figure 2(a), a higher value of the kernel parameter γ leads to a shorter run-time for the diabetes dataset, whereas for the prnn_synth dataset in Figure 2(b), a higher value of γ results in a longer run-time.”, using the applicable datasets as described in Reif p.266 Section 5 Evaluation: “We evaluated the presented approach on real world datasets from the UCI machine learning repository [1] and StatLib [18]. The run-time of a grid search for five different classifiers are investigated. The used classifiers as well as their optimized parameters are listed in Table 1.”

    PNG
    media_image1.png
    599
    832
    media_image1.png
    Greyscale

Regarding applicant’s Remarks on p.13:
“B. Wrong kind of landmark 
Claim 1 recites "configuring the ML model based on the landmark configuration". Reif does not use a landmark for ML model configuration. Reif (Abstract) teaches "landmarking features". Reif (Table 3) explains "feature sets that include the proposed time-landmarking features". Reif does not use features for ML model configuration. Reif has the wrong kind of landmark. Reif is mischaracterized.”
Examiner has considered this argument but finds the argument to be not persuasive. Based on applicant’s claim 1, examiner notes that the term “landmark configuration” includes a plurality of hyperparameter values, and a measured duration (of a plurality of durations) spent training for the machine-learning model, where each of these durations is interpreted as a run-time training duration for the machine-learning model. Under its broadest reasonable interpretation, the term “feature” defined in Merriam-Webster dictionary is something that is “an interesting or important part, quality, ability, etc.”, which are exactly what the applicant is identifying as being part of a landmark configuration. Applicant’s specification uses the term “feature” to describe hyperparameters and training duration, as indicated in  [0016]-[0017]: “Hyperparameter features that affect model training time for multiple datasets may be collected. These features are a collection of hyperparameter settings for an ML model of interest. … For example, in the case of a support vector machines classifier (SVC) ML model, important hyperparameters to tune are gamma and C. … In an embodiment, these configuration landmarks and their benchmark training times are encoded as features into a feature vector with which a prediction regressor may be trained or otherwise applied.”. As for the argument that Reif does not use “features” for ML model configuration, examiner notes that the grid search optimization taught in Reif requires identification and selection of hyperparameters and their associated values to be applied to a machine-learning or neural network model in order to determine an optimal set of hyperparameter combinations, where these hyperparameters and their associated values are features for a dataset, as shown in Reif p.265 Figure 3. This method of identifying and selecting hyperparameters to determine an optimized combination of hyperparameters is described in Reif pp.263-264 Figures 1 and 2; Section 3 Run-Time of a Grid Search: “… as visible for the cloud dataset in Figure 1(a), different parameter combinations require different amounts of time. The plot shows the run-time of training the Ripper classifier for different combinations of its two parameters sample ratio and pureness. … Furthermore, the measured run-times can differ even more between multiple datasets. Figure 2 shows the run-time for the typically optimized parameters γ and C of a Support Vector Machine (SVM). As visible in Figure 2(a), a higher value of the kernel parameter γ leads to a shorter run-time for the diabetes dataset, whereas for the prnn_synth dataset in Figure 2(b), a higher value of γ results in a longer run-time.”. Furthermore, the measurement of a run-time for an optimized hyperparameter combination found during the grid search is explicitly Reif p.264 Section 4 Methodology, and this single measured run-time value itself is also incorporated as an additional feature (along with a set of meta-features associated with respective datasets) as shown in the annotated Reif Figure 3 below. Hence Reif does indeed explicitly teach specific functionality and methodology within the same scope as the applicant’s claim scope, and does not constitute a mischaracterization.

    PNG
    media_image2.png
    316
    853
    media_image2.png
    Greyscale

Regarding applicant’s Remarks on p.14:
“C. Duration of single training 
Claim 1 in its present form recites "measuring a duration...spent training...the ML model once" per the specification (0035). The Office action (page 7) says "using predefined combination sets of hyperparameter values (Reif p.263 Section 3 Run-time of a Grid Search)", where sets is plural, which means that Reif's measured duration necessarily entails multiple trainings. Reif cannot use a single training because Reif times one search that necessarily entails multiple trainings per the express purpose of Reif whose Title is "Prediction of...Time including Parameter Optimization", where optimization means search that necessarily entails multiple trainings, which is why Reif (Introduction) says "time could mean...several weeks or even longer." Reif (cited Table 1) says "Steps" that mean how many different values of one hyperparameter may be tried during optimization. Table 1 shows each hyperparameter has at least nine steps and one hyperparameter has 100 steps, which causes many different hyperparameters configurations in a single optimization run. Furthermore for each hyperparameters configuration visited during the grid search, Reif (page 267) says, "The presented approach was evaluated by a leave-one-out cross-validation" that requires multiple training runs for an individual hyperparameters configuration and a validation run that all contribute to a single grid search time. Thus, Reif's "run-time" is not analogous to the claimed time "spent training...once".”
Examiner has considered this argument but finds the argument to be not persuasive. Applicant’s claim limitation in Claim 1 is recited as such: “for each landmark configuration … configuring the ML model based on the landmark configuration; and measuring a duration of a plurality of durations spent training, based on a dataset, the ML model”, which is interpreted to indicate that a single runtime duration is measured and used as a feature for a corresponding dataset and meta-feature set, but that one runtime duration represents one of a plurality of runtime durations that are being measured for each dataset provided to a machine-learning model loaded with a particular landmark configuration (i.e. an optimized hyperparameter combination of values). As discussed and shown earlier in Reif p.265 Figure 3, the measured run-time for an optimized hyperparameter combination is measured once per dataset, and is used as an additional feature along with the associated meta-features for the corresponding dataset, and further explained in Reif p.264 Section 4 Methodology: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used as the target variable. … The overall approach is illustrated in Figure 3.”. Hence, contrary to the applicant’s Remarks, Reif does not indicate measuring multiple run-times per dataset, and Reif does not indicate that multiple measured run-times contribute to a single grid search time.
Regarding applicant’s Remarks on pp.14-16:
“D. Frustration of purpose 
In MPEP "2143.01 Suggestion or Motivation To Modify the References" is section "V. The Proposed Modification Cannot Render The Prior Art Unsatisfactory For Its Intended Purpose". As explained above, the express purpose of Reif (Title) is predicting grid search time that necessarily entails multiple trainings. Reif (Introduction) warns "the actual run-time often depends on...the exact parameter values of the algorithm." Reif says "exact parameter values" for an important reason: varying hyperparameter(s) causes acceleration or deceleration. Thus, Reif's grid search that, unlike a single training, traverses a multidimensional hyperparameters space necessarily entails accelerations and decelerations beyond the scope of any single training. 
Reif (Introduction) expressly disclaims, "Since most algorithms contain parameters that influence their performance, they are typically optimized. Therefore, we do not predict the time of one training...but the time needed for a grid search of the most important parameters. This includes multiple training and application phases." Reif never measures (nor learns based on) the time of an individual training that Claim 1 requires. Reif's time necessarily also "includes...and application" that means Reif's time combines training and subsequent use, which is not the claimed time "spent training". Reif (cited FIG. 3) expressly says "Measure Time Grid Search". That is why Reif's non-analogous time is instead referred to as "run-time" and why the Office action (page 7) quotes "Run-time of a Grid Search". 
Reif (cited section 3: Run-time of a Grid Search) explains "different parameter combinations require different amounts of time ... If, for example, the time of the lowest values of the parameters have been used, the total run-time will be significantly underestimated. Also other parameters of widely used classifiers obviously influence their run-time, e.g. the maximal depth of a decision tree or the learning rate of a Multilayer Perceptron (MLP)." Reif expressly warns of a "learning rate" hyperparameter as shown in Table 1. 
Modifying Reif to time and learn from an individual training instead of a grid search and excluding "application" time would render Reif unable to predict grid search time that is the sole purpose of Reif. Thus, modifying Reif to time and learn from an individual training instead of a grid search and excluding application time frustrates Reif's intended purpose. Therefore, there is no motivation to combine Reif, and combining Reif is illegal.”
Examiner has considered this argument but finds the argument to be not persuasive. Under its broadest reasonable interpretation of the applicant’s claim language, the teachings identified in the Reif reference (as well as other supporting secondary references) are shown as being within the same scope as the claim language provided by the applicant. See MPEP 2111. Examiner also notes that part of the above applicant’s arguments relies on Reif’s explanation of the prior art in the Introduction section, which is used by Reif to explain the current status of the prior art before Reif identifies its main purpose in the paper, which is identified in Reif p.261 4th paragraph: “In this paper, a method for predicting actual run-times of classification algorithms is presented.” Examiner further notes that in light of the applicant’s specification, applicant’s specification paragraphs [0058] also indicates usage of a grid in …Various embodiments may generate exploratory configurations 341-342 randomly, greedily (i.e., gradient following), or some combination of both. In an embodiment, exploratory configurations 341-342 are instead generated according to uniform intervals along a regular grid in hyperspace.”, which is consistent with what is found in the Reif reference. Examiner further notes that applicant’s Claims 1 and 2 indicate training a model and measuring a duration spent training, which under its broadest reasonable interpretation, is taught in Reif, as Reif performs this grid search during training (refer to Reif p.265 Figure 3), and measures its run-time in the context of training, where each of these run-times are measured one for each dataset and provided as training data along with the corresponding meta-feature set for the associated dataset. 
Regarding applicant’s Remarks on p.16:
“E. Validation is not training 
The Office action (pages 11-12) alleges, "Sturlaugson teaches...duration needed to train" that is a mischaracterization. On October 12, 2021 the Office filed an interview summary that alleges "Sturlaugson paragraphs [0040], [0042], [0057]...evaluating are performed to produce performance results". Performance results are not caused by training. Sturlaugson's performance results are caused by validation (i.e. evaluation 124 below). The skilled person knew that: a) performance results are the sole purpose of validation, b) validation is not training, and c) validation uses a model that was already trained. Sturlaugson says "validate each trained model to produce a performance result" (cited 0040), "performance result...for each round of validation" (cited 0042), and "evaluation 124 (i.e., rounds of validation)...evaluation 124 to produce the performance result" (cited 0057). Thus, the interview summary incorrectly relies on a phase other than training. The Office action (page 40) is similarly inaccurate, "Sturlaugson paragraph [0040]...validate [already] trained model to produce a performance result ...validation...through cross-validation" that is the wrong phase of a model's lifecycle. The Office action mentions "evaluate", "evaluating", and "evaluation" sixty-one times, which is a technical problem because Sturlaugson (cited 0057) teaches "evaluation 124 (i.e., rounds of validation)" that is not training. Thus, the Office action (page 13) alleging "training produces a performance result" is a mischaracterization.”
Examiner has considered this argument but finds the argument to be not persuasive. Sturlaugson teaches an experiment module that performs training of selected machine learning models, as clearly stated in Sturlaugson [0033]: “Experiment module 30 of the machine learning system 10 is configured to test (e.g., to train and evaluate) each of the machine learning models 32 of the selection of machine learning models 32 provided by the data input module to produce a performance result for each machine learning model 32.” and Sturlaugson [0040]: “Experiment module 30 is configured to train each of the machine learning models 32 using supervised learning to produce a trained model for each machine learning model.”. The selection of the machine learning models is further explained and taught in Sturlaugson [0034]: “The selection of machine learning models 32 received by the data input module 20 may include specific machine learning algorithms and a range and/or set of one or more associated parameters to test. The experiment module 30 may apply these ranges) and/or set(s) to identify a group of machine learning models 32. …As an example, the selection of machine learning models 32 may identify an artificial neural network as (one of) the machine learning algorithm(s) and associated parameters as 10-20 nodes and a learning rate decay of 0 or 0.01. The experiment module 30 may interpret this selection as at least four machine learning models: an artificial neural network with 10 nodes and a learning rate decay of 0, an artificial neural network with 10 nodes an a learning rate decay of 0.01, an artificial neural network with 20 nodes and a learning rate decay of 0, and an artificial neural network with 20 nodes and a learning rate decay of 0.01.”. Sturlaugson [0042] is cited to clarify the relevance of the performance result indicated in Sturlaugson [0033] in the context of training to generate a performance result, which Sturlaugson [0042] identifies as encompassing an execution speed (“The performance result for each machine learning model … may include an indicator, value, and/or result related to …. Additionally or alternatively, the indicator, value, and/or result may be related to … execution speed.”), which is interpreted as a measured run-time duration. Sturlaugson [0036] further indicates that this training is done using training datasets: “Experiment module 30 may be configured, optionally for each machine learning model 32 independently, to divide the dataset into a training dataset (a subset of the dataset) and an evaluation dataset (another subset of the dataset). … The experiment module 30 may be configured to train the machine learning model(s) 32 with the respective training dataset(s) (to produce a trained model) and to evaluate the machine learning model(s) 32 with the respective evaluation dataset(s).”. A person with ordinary skill in the art would read and understand that although the above cited text mentions performing evaluation (with evaluation subsets), the same cited text also indicates training using training subsets, using the same experiment module framework to perform both training and evaluation on selected machine learning models loaded with a combination of hyperparameter values representing a configuration. Furthermore, a person with ordinary skill in the art would understand that cross-validation is a technique that forms training and test/evaluation subsets of data, where the identified training subsets are used for training, which is cited in Sturlaugson [0057]: “Training and evaluating 106 may include validation and/or cross validation (multiple rounds of validation), … as discussed with respect to experiment module 30. Training and evaluating 106 may include repeatedly dividing 120 the dataset to perform multiple rounds of training 122 and evaluation 124 ...”. In light of the applicant’s specification, examiner has also found in applicant’s specification paragraph [0054] that applicant acknowledges the use of cross-validation in the context of training, in particular, to divide datasets into training subsets: “… In an embodiment, dataset 305 may be folded (e.g. for cross validation) or otherwise divided into training subsets, and each subset may or may not have its own landmark configurations. For example, even when landmark configurations (i.e., hyperparameter values) are constant, different datasets may yield different training durations for a same landmark configuration. Thus multiple training folds, each with its own meta-feature values and landmark times, may yield a rich corpus (i.e. training tuples 371-372) from which trainable regressor 380 may achieve high accuracy without overfitting.” (See also applicant’s specification [0087]: “Fifty-fold cross-validation separates datasets into training and test subsets.”), and therefore this use of cross-validation is a well-known and conventional technique to divide datasets into training sets for use in training a machine learning model.
	Regarding applicant’s Remarks on pp.16-17:
“F. Runtime is not training time 
The Office action (page 13) alleges "a performance result that is related to execution speed (interpreted as a measured run-time)". As explained earlier herein for Reif, run-time is not training time, especially because Sturlaugson's performance result is produced by validation, not training. Sturlaugson is mischaracterized.”
Examiner has considered this argument but finds the argument to be not persuasive. As indicated earlier, Sturlaugson produces a performance result during training (Sturlaugson [0033]), where in Sturlaugson [0042] a performance result includes execution time. Under its broadest reasonable interpretation, execution time and run-time are synonym terms, and hence measuring execution time of a machine-learning model during training is interpreted as being functionally equivalent to measuring the run-time of a machine-learning model during training.
	Regarding applicant’s Remarks on p.18:
“G. Experiment module does not predict and is not an ML model
The Office action (page 11) acknowledges, "Reif does not explicitly teach... predicting...duration needed to train the ML model based on: a proposed configuration of the ML model". However, Sturlaugson does not teach that "predicting...based on". 
The Office action (page 12) alleges "(...needed to train the ML model based on) The experiment module 30 ... Sturlaugson paragraphs [0033]-[0034]: Experiment module 30". However, experiment module 30 is not an ML model and does not predict, which is why Sturlaugson (0061) teaches, "Building 114 may be performed after comparing the machine learning models with the performance comparison statistics and selecting one or more of the machine learning models to deploy." Comparing and selecting are not predicting. Sturlaugson is mischaracterized.”
Examiner has considered this argument but finds the argument to be not persuasive. As indicated earlier, the experiment module taught in Sturlaugson [0033]-[0034] contains selected machine learning models loaded with combinations of hyperparameter values. As indicated in Sturlaugson [0034], these machine learning models are algorithms with corresponding range and/or set of one more associated parameters, with Sturlaugson [0034] further providing an example of an algorithm and its associated hyperparameter combination. As indicated earlier, the experiment module contains machine learning models, and the experiment module (along with its associated machine learning models) is used as the common framework to perform both training and evaluation phases on the machine learning models using training and evaluation subsets of data, where this prediction is taught in Sturlaugson [0056]: “Evaluating 124 includes evaluating each trained model with the corresponding evaluation dataset, e.g., as discussed with respect to experiment module 30. The trained model is applied to the evaluation dataset to produce a result (a prediction) for each of the input values of the evaluation dataset …”. While the remainder of the Sturlaugson [0056] does discuss comparison of prediction results to an expected value, this comparison is done after a prediction value is generated using an evaluation dataset on a trained machine learning model, and does not impact the generation of the actual prediction value itself. A person having ordinary skill in the art would understand that the evaluation phase requires a trained machine learning model, and applying evaluation subsets of data to a trained machine learning model will produce a result, which is identified in Sturlaugson [0042] to include a predictive value: “The performance result for each machine learning model 32 and/or the individual evaluation results … may include … an indicator, a value, … a positive predictive value, … a negative predictive value …”

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider 
Claims 1, 6-7, 9-11, 13, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Reif et al., Prediction of Classifier Training Time Including Parameter Optimization, in Bach et al., KI 2011: Advances in Artificial Intelligence, LNAI 7006, Springer-Verlag Berlin Heidelberg 2011, pp.260-271 [hereafter referred as Reif] in view of Sturlaugson et al., U.S. PGUPB 2016/0358099, published 12/8/2016 [hereafter referred as Sturlaugson].













Regarding amended Claim 1, Reif teaches
(Currently Amended) A method comprising: 
for each landmark configuration of a plurality of landmark configurations that each contain a plurality of values for a plurality of hyperparameters of a machine learning (ML) model (Examiner’s note: Under its broadest reasonable interpretation, a landmark configuration is identified as a set of known hyperparameters (and their corresponding values) for an associated machine learning algorithm used for prediction. Reif Table 1 shows a list of target classifiers and their associated hyperparameters (with corresponding range values for each hyperparameter value). As indicated earlier, Reif Table 1 summarizes the classifiers and their corresponding optimized parameters/hyperparameter, with those parameters shown in terms of interval ranges and steps, where the numeric intervals denote the start and end value for each parameter/hyperparameter, while the steps (for those numeric intervals) denote the next numeric value within the interval range that is being identified. A person having ordinary skill in the art would be able to read and understand this table, with the given numeric interval range and defined steps for each hyperparameter, as a representation to express values for the corresponding hyperparameter that were optimized by a grid search for each respective classifier. Hence, as indicated earlier in the above response to applicant’s arguments, each classifier and their corresponding optimized combination of hyperparameters correspond to a “… landmark configuration … that each contain a plurality of values for a plurality of hyperparameters of a machine learning model”, with the plurality of target classifiers and their associated hyperparameter values corresponding to “a plurality of landmark configurations” (Reif p.266 Section 5 Evaluation, 1st paragraph: “The used classifiers as well as their optimized parameters are listed in Table 1.”).): 
… measuring a duration of a plurality of durations spent training, based on a dataset, the ML model once (Examiner’s note: As indicated in earlier in the above response to applicant’s arguments, Reif teaches measuring a single training run-time of each target classifier performing a grid search for each dataset of a plurality of datasets, where an associated meta-features set is extracted from each dataset, and a single measured training run-time is generated corresponding to “measuring a duration of a plurality of durations spent training, based on a dataset, the ML model once” (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used as the target variable.”).); 
predicting, by a trainable regressor, an inferred duration needed to train the ML model based on (Reif p.265 Figure 3: examiner’s note: Under its broadest reasonable interpretation, “predicting, by a trainable regressor, an inferred duration needed to train the ML model based on …” is interpreted as referencing both training and trained aspects, where the first part of the claim limitation of “predicting, by a trainable regressor, an inferred duration …” refers to using the trainable regressor (once trained) to make an inferred duration prediction, while the second part of the claim limitation of “… needed to train the ML model based on …” refers to the training data needed to train the trainable regressor. Referring to Reif Figure 3 Application section, Reif teaches a time prediction model performs a prediction of a time x, where the time prediction model is based on the regression learner that was trained in Reif Figure 3 Training section, such that the time prediction model represents a trainable regressor, corresponding to “predicting by a trainable regressor, an inferred duration …”. The training of the regression learner involves providing meta-feature data and associated measured training run-times as input training data into the regression learner, thus corresponding to “… needed to train the ML model based on …”, with the elements of the input training data indicated in further detail by the subsequent claim elements (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. … After the learning phase, the resulting model can be used to predict the run-time of an unknown dataset … The overall approach is illustrated in Figure 3.”).): 
… a plurality of values, based on the dataset, of a plurality of meta-features (Reif p.265 Figure 3: examiner’s note: Under its broadest reasonable interpretation, the values referenced in this claim limitation refer to the values found in the training dataset (as indicated by the preceding phrase “needed to train the ML model based on …”). Referring to Reif Figure 3 Training section, Reif teaches analyzing dataset instances to generate meta-features, where the meta-features represent properties of the respective dataset (Reif p.262 Section 2 Meta-Learning: “…meta-learning is based on features of datasets. These features are often called meta-features. They describe properties of a dataset … Simple meta-features use directly accessible properties like the number of samples, the number of attributes or the number of classes. More sophisticated features are statistical measures …”). A detailed list of meta-features (grouped by category) is listed in Reif Section 4.1 Traditional Meta-Features (corresponding to “a plurality of values, based on the dataset, of a plurality of meta-features”), with the application of each of these groups of meta-features as part of the input training data to train a regression learner (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used at the target variable. … The overall approach is illustrated in Figure 3.”).), and 
said plurality of durations (Reif p.265 Figure 3: examiner’s note: Under its broadest reasonable interpretation, the plurality of durations referenced in this claim limitation refer to the plurality of durations found in the training dataset (as indicated by the preceding phrase “needed to train the ML model based on …”). Referring to Reif Figure 3 Training section, Reif teaches measuring the run-time of a target classifier that is learning optimized hyperparameters for each Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used at the target variable. … The overall approach is illustrated in Figure 3.”).) and 
the values of the plurality of landmark configurations (Reif p.265 Figure 3: examiner’s note: Under its broadest reasonable interpretation, the values referenced in this claim limitation refer to the values found in the training dataset (as indicated by the preceding phrase “needed to train the ML model based on …”). Referring to Reif Figure 3 Training section, Reif teaches measuring the run-times to train a particular classifier to learn optimized hyperparameters (and their associated values) for each dataset and associated meta-features, where the hyperparameter values defined based on the interval ranges shown in Reif Table 1 correspond to “the values of the plurality of landmark configurations” (Reif p.263 Section 3 Run-Time of a Grid Search, 1st paragraph: “Since the performance of most classifiers depends on parameter values, the parameters are usually optimized. A simple and often used method for parameter optimization is grid search. All predefined combinations of parameter values are evaluated to determine the best of them. … different parameter combinations require different amounts of time. The plot shows the run-time of training the Ripper classifier for different combinations of its two parameters sample ratio and pureness.” and Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used at the target variable. … The overall approach is illustrated in Figure 3.”).).  
	While Reif indicates that target classifiers are being trained using a set of hyperparameters through a grid search (Reif p.265 Figure 3) in order to measure the respective run-times, Reif does not explicitly teach
for each landmark configuration of a plurality of landmark configurations that each contain a plurality of values for a plurality of hyperparameters of a machine learning (ML) model: 
… configuring the ML model based on the landmark configuration, wherein each value of the plurality of values of the landmark configuration corresponds to a distinct hyperparameter of the plurality of hyperparameters of the ML model; …
predicting, by a trainable regressor, an inferred duration needed to train the ML model based on: … a proposed configuration of the ML model, …
	Sturlaugson teaches
for each landmark configuration of a plurality of landmark configurations that each contain a plurality of values for a plurality of hyperparameters of a machine learning (ML) model: 
… configuring the ML model based on the landmark configuration, wherein each value of the plurality of values of the landmark configuration corresponds to a distinct hyperparameter of the plurality of hyperparameters of the ML model (Sturlaugson Figure 2, elements 30, 32: examiner’s note: As indicated earlier, Sturlaugson teaches machine learning model 32 within an experiment module 30 in a machine learning system, where the machine learning model include specific machine learning algorithms and a range and/or set of one or more associated parameters, where these parameters and their values correspond to a set of hyperparameters and their values, with each different type of machine learning algorithm providing different hyperparameters and associated values, each of which can be loaded onto a machine learning model (Sturlaugson [0022]; [0017]: “Generally, machine learning systems 10 are configured to calculate and/or to estimate the performance of one or more machine learning algorithms configured with one or more specific parameters (also referred to as hyper-parameters) with respect to a given set of data. The machine learning algorithm along with its associated specific parameter values form, at least in part, the machine learning model 32 …”; [0020]: “The machine learning models 32 include a machine learning algorithm and one or more associated parameter values for the machine learning algorithm.”; and [0033]-[0034]: “The selection of machine learning models 32 received by the data input module 20 may include specific machine learning algorithms and a range and/or set of one or more associated parameters to test. The experiment module 30 may apply these ranges) and/or set(s) to identify a group of machine learning models 32. That is, the experiment module 30 may generate a machine learning model 32 for each unique combination of parameters specified by the selection. As an example, the selection of machine learning models 32 may identify an artificial neural network as (one of) the machine learning algorithm(s) and associated parameters as 10-20 nodes and a learning rate decay of 0 or 0.01. The experiment module 30 may interpret this selection as at least four machine learning models: an artificial neural network with 10 nodes and a learning rate decay of 0, an artificial neural network with 10 nodes an a learning rate decay of 0.01, an artificial neural network with 20 nodes and a learning rate decay of 0, and an artificial neural network with 20 nodes and a learning rate decay of 0.01.”); …
predicting, by a trainable regressor, an inferred duration needed to train the ML model based on: … a proposed configuration of the ML model (Sturlaugson Figure 2, elements 30, 32: examiner’s note: Under its broadest reasonable interpretation, “a proposed configuration of the ML model” refers to a selected set of hyperparameters and its associated values being used under the training phase (as indicated by the phrase “needed to train the ML model based on …)”. As indicated earlier, Sturlaugson teaches using the experiment module 30 framework to perform training of a selected machine learning model, by selecting the specific machine learning algorithm and a range and/or set of one or more associated parameters, where this selection involves loading a set of associated parameters to a machine learning model, thus corresponding to “a proposed configuration of the ML model” (Sturlaugson [0033]-[0034]: “Experiment module 30 of the machine learning system 10 is configured to test (e.g., to train and evaluate) each of the machine learning models 32 of the selection of machine learning models 32 provided by the data input module 20 to produce a performance result for each machine learning model 32. … Experiment module 30 may be configured to automatically and/or autonomously design and carry out the specified experiments (also called trials) to test each of the machine learning models 32. … For example, the selection of machine learning models 32 received by the data input module 20 may include specific machine learning algorithms and a range and/or a set of one or more associated parameters to test. … That is, the experiment module 30 may generate a machine learning model 32 for each unique combination of parameters specified by the selection.”). As indicated earlier, Sturlaugson teaches that the performance result generated by the training includes a value or indicator that is related to execution speed (interpreted as a measured run-time) (Sturlaugson  [0042]: “The performance result for each machine learning model 32 … may include an indicator, value, and/or result related to …  an accuracy,… . Additionally or alternatively, the indicator, value, and/or result may be related to computational efficiency, memory required, and/or execution speed.”).), …
Both Reif and Sturlaugson are analogous art since they both teach training and evaluating machine learning algorithms using an associated set of hyperparameters.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the ML model (i.e., a target classifier and a predetermined combination set of associated hyperparameters) identified in the training phase taught in Reif and perform the training steps for the ML model using the experiment module taught in Sturlaugson as a way to train and measure the associated run-time of the ML model. The motivation to combine is taught in Sturlaugson, since this method allows a large number of different combinations of machine learning algorithms and their associated hyperparameters to be trained and evaluated in an automated fashion, which provides a user the ability to tailor a machine learning model for various applications and datasets by identifying and selecting a best trained machine learning model based on a performance measurement (Sturlaugson  [0003]-[0005]: “… a broad array of machine learning algorithm are available, with new algorithms the subject of active research. … The large number of machine learning options available to address a problem makes it difficult to choose the best option or even a well-performing option. The amount, type, and quality of data affect the accuracy and stability of training and the resultant trained models. Further, problem-specific considerations, such as tolerance of errors (e.g., false positives, false negatives) scalability, and execution speed, limit the acceptable choices. … Therefore, there exists a need for comparing machine learning models for applicability to various specific problems.”).  
Regarding amended Claim 6, Reif in view of Sturlaugson teaches
(Currently Amended) The method of Claim 1 wherein: 
the plurality of landmark configurations comprises a reference configuration (Reif p.267 Table 1: examiner’s note: Under its broadest reasonable interpretation, a “reference configuration” is interpreted as a landmark configuration in which the measured run-time was performed for the target classifier, serving as a baseline to perform future predictions. Reif Table 1 teaches a list of target classifiers and their associated hyperparameters (with corresponding range values for each hyperparameter value), where each classifier are identified as simple learners used to predict more sophisticated classifiers, and as such, each target classifier and its associated hyperparameter values represent a reference configuration (thus corresponding to “the plurality of landmark configurations comprises a reference configuration”) (Reif pp.265-266 Section 4.2 Time-Based Meta-Features, 1st paragraph: “Landmarking have been successfully used in the past for different meta-learning approaches [15][4][2][9]. The approach use performance values of simple classifier for predicting the performance of more sophisticated algorithms. Analogically, we sue the run-time of the same simple learners for predicting the run-time of a sophisticated classifier. The used classifiers are Naïve Bayes, One-Nearest Neighbor, and Decision Stumps.” and Reif p.266 Section 5 Evaluation, 1st paragraph: “We evaluated the presented approach on real world datasets from the UCI machine learning repository [1] and StatLib[18]. The run-time of a grid search for five different classifiers are investigated. The used classifiers as well as their optimized parameters are listed in Table 1.”).); 
said duration spent training is a reference duration when said landmark configuration is said reference configuration (Reif p.268 Normalized Absolute Error, 1st paragraph: examiner’s note: Under its broadest reasonable interpretation, this claim limitation in a method claim recites a contingent clause that effectively renders the subsequent claim language to not be performed because the condition precedent (“when said landmark configuration is said reference configuration” is not required to be met, Applicant is advised to amend the claim to positively cite the condition as being fulfilled, since no patentable weight is given for the subsequent claim language following a contingent clause that does not require the condition to be fulfilled for practicing the claimed invention. However, for the purposes of examination, this contingent clause will be treated as if the condition were fulfilled. 
Under its broadest reasonable interpretation, a “reference duration” is interpreted as the measured run-time for a landmark configuration for a target classifier serving as a baseline to perform future predictions. As identified by this claim limitation (“said duration spent training is a reference duration”), the measuring of the prediction run-time of each target classifier from Reif Table 1 corresponds to a “reference duration”, with each of the measured run-times performed during evaluation of these target classifiers representing a respective “reference configuration” (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used as the target variable.” and Reif p.266 Section 5 Evaluation, 1st paragraph: “We evaluated the presented approach on real world datasets from the UCI machine learning repository [1] and StatLib[18]. The run-time of a grid search for five different classifiers are investigated. The used classifiers as well as their optimized parameters are listed in Table 1. ... we only considered datasets with a run-time of the grid search in a defined interval. … Additionally, datasets with a run-time of the algorithm greater than 24 hours have been neglected as well because of the computational effort.”).); 
a normalized duration, of a plurality of normalized durations, is based on said duration relative to the reference duration (Reif p.268 Table 3; Reif p.268 Normalized Absolute Error, 1st-2nd paragraphs: examiner’s note: Under its broadest reasonable interpretation, the “said duration relative to the reference duration” is interpreted as a relationship between “said duration” (interpreted as an inferred duration from Claim 1) and a “reference duration” (interpreted as the landmark configuration for a target classifier serving as a baseline to perform future predictions). Reif teaches performances of the different                         
                            
                                
                                    t
                                
                                
                                    p
                                
                            
                        
                     (corresponding to an “inferred duration”) and a measured baseline run-time                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     (corresponding to a “reference duration”), where the error calculation equation itself is in the form of a ratio of normalized durations (i.e., the divisor is based on the predicted run-time normalized to the measured run-time, while the dividend is based on the baseline run-time normalized to the measured run-time), with the calculated normalized absolute errors for each target classifier and associated meta-features shown in Reif Table 3 (corresponding to “a normalized duration, of a plurality of normalized durations…”) (Reif p.268 Normalized Absolute Error, 1st – 2nd paragraphs: “… the normalized absolute error was determined that serves as a comparison to a baseline. The absolute error of the prediction by the presented approach is divided by the absolute error of the prediction by a baseline method:                         
                            e
                            =
                            
                                
                                    
                                        
                                            t
                                        
                                        
                                            m
                                        
                                    
                                    -
                                    
                                        
                                            t
                                        
                                        
                                            p
                                        
                                    
                                
                            
                            /
                            |
                            
                                
                                    t
                                
                                
                                    m
                                
                            
                            -
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                            |
                        
                    , where                         
                            
                                
                                    t
                                
                                
                                    m
                                
                            
                        
                     is the actual measured time,                         
                            
                                
                                    t
                                
                                
                                    p
                                
                            
                        
                     the predicted time of the presented approach, and                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     the time predicted by the baseline method. For the baseline method, the predicted run-time is simply the average run-time of the classifier. Hence, the baseline method predicts the same run-time for every dataset. … Table 4 shows the normalized absolute errors.”).); 
said predicting based on the plurality of durations comprises predicting based on the plurality of normalized durations (Reif p.268 Table 3; Reif p.268 Normalized Absolute Error, 1st-2nd paragraphs: examiner’s note: As indicated earlier, Reif teaches performances of the different sets of target classifiers and their predicted run-times are evaluated using a normalized absolute error calculation (corresponding to “a normalized duration”). A normalized absolute error is calculated based on a prediction time                         
                            
                                
                                    t
                                
                                
                                    p
                                
                            
                        
                     (corresponding to an “inferred duration”) and a measured baseline time                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     (corresponding to a “reference duration”), where the error calculation equation itself is in the form of a ratio of normalized durations (i.e., the divisor is based on the predicted run-time subtracted (i.e., normalized) from the measured run-time, while the dividend is based on the baseline run-time subtracted (i.e., normalized) from the measured run-time), with the calculated normalized absolute errors for each target classifier and associated meta-features shown in Reif Table 3 (corresponding to “said predicting based on a plurality of normalized durations comprises predicting based on the plurality of normalized durations”).).  
Regarding amended Claim 7, Reif in view of Sturlaugson teaches
(Currently Amended) The method of Claim 6 wherein said duration relative to the reference duration comprises: a percent deviation of said duration from the reference duration (Reif pp.268-269 Section 5.2 Normalized Absolute Error: examiner’s note: Under its broadest reasonable interpretation, a deviation defined by Merriam-Webster dictionary is a difference between a value in a distribution and a fixed number. In light of applicant’s specification [0078], the specification only indicates that the “normalized duration is a percent deviation of empirical time from reference time” and does not contain an explicit formula to calculate a percent deviation. As indicated earlier, Reif calculates a normalized absolute error, where the normalized absolute error is based on a difference between an actual measured time                         
                            
                                
                                    t
                                
                                
                                    m
                                
                            
                        
                     (where this measured time is a value within a range of possible time values and hence represents a distribution) and a measured baseline run-time                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     (corresponding to a “reference duration”). Given this information, a person having ordinary skill in the art would be able to take these values (i.e., the difference between                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     and                         
                            
                                
                                    t
                                
                                
                                    m
                                
                            
                        
                    , i.e.,                         
                            
                                
                                    t
                                
                                
                                    m
                                
                            
                            -
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                    , and                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     itself) and compute a percentage value by taking this computed difference and dividing by the measured baseline run-time                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                     to produce a percent deviation from this measured run-time                         
                            
                                
                                    t
                                
                                
                                    b
                                
                            
                        
                    .).  
Regarding amended Claim 9, Reif in view of Sturlaugson teaches
(Currently Amended) The method of Claim 1 wherein the plurality of landmark configurations comprises, for each numeric hyperparameter of said plurality of hyperparameters, at least one selected from the group consisting of: 
a landmark configuration having said plurality of values that contains a minimum value for said hyperparameter (Reif p.267 Table 1: examiner’s note: As indicated earlier, Reif Table 1 teaches a set of target classifiers with their associated hyperparameters and interval values (corresponding to “the plurality of landmark configurations”). For example, for a k-NN classifier, the k parameter (corresponding to “numeric parameter of said plurality of hyperparameters”) has an interval range of [1, 1000] with logarithmic scale containing 100 steps, where the values 1 and 1000 are expressed as minimum and maximum values respectively (thus corresponding to “for each numeric hyperparameter of said plurality of hyperparameters: a landmark configuration having said plurality of values that contains a minimum value for said hyperparameter”). Similarly, for the Ripper classifier, the ‘pureness’ parameter has an interval 

    PNG
    media_image3.png
    647
    917
    media_image3.png
    Greyscale

), 
a landmark configuration having said plurality of values that contains a maximum value for said hyperparameter, and
a landmark configuration having said plurality of values that contains a value for said hyperparameter that is halfway between two of: said minimum value, said maximum value, and a default value.  
Regarding original Claim 10, Reif in view of Sturlaugson teaches
(Original) The method of Claim 1 wherein the plurality of landmark configurations comprises, 
for each hyperparameter of said plurality of hyperparameters that is categorical, landmark configurations that each have said plurality of values that contains a distinct value for said hyperparameter (Reif p.267 Table 1: examiner’s note: As indicated earlier, Reif Table 1 (displayed at the end of Claim 9) teaches hyperparameters and their associated interval values for a set of target classifiers, where a plurality of predetermined combinations of hyperparameters and their values can be generated (corresponding to “the plurality of landmark configurations”). The interval ranges indicate either numerical or non-numerical values (where the hyperparameter with non-numerical ranges correspond to “… hyperparameter of said plurality of hyperparameters that is categorical …”). For example, a k-NN classifier with the non-numeric hyperparameter ‘weighted vote’ can take possible values {yes, no}, thus ).  
Regarding amended Claim 11, Reif in view of Sturlaugson teaches
(Currently Amended) The method of Claim 1 wherein said plurality of meta-features comprises at least one selected from the group consisting of:  
a) a value, of a feature of said dataset, that is at least one selected from the group consisting of: a minimum, a maximum, a mean, and a quantile (Reif p.265 Section 4.1 Traditional Meta-Features: examiner’s note: Reif teaches model-based meta-features as part of its list of meta-features, where these model-based meta-features for a decision tree include minimum, maximum, mean values for the following meta-features: length of a branch, number of nodes in a level, number of occurrences of attributes in a split.), 
b) a majority count of examples within said dataset having a majority label of a plurality of labels of said dataset, 
c) a minority count of examples within said dataset having a minority label of a plurality of labels of said dataset, and
d) a ratio of two selected from the group consisting of: a total count of examples within said dataset, said majority count, and said minority count.  
Regarding amended Claim 13, Reif teaches
(Currently Amended) One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors (Examiner’s note: Reif teaches performing the training and evaluation of the machine learning models using a computer (containing memory storage) identified by its processor type (AMD Opteron) running a single-threaded program within an open-source package (RapidMiner), where both the program and open-source package is understood to be stored in memory or computer-readable medium located on the computer (Reif p.266 Section 5 Evaluation: “The complete evaluation was done using RapidMiner [13]. It is an open source data mining and pattern recognition framework implemented in Java. All times have been measured on an AMD Opteron using a single-threaded program.”).), cause: 
for each landmark configuration of a plurality of landmark configurations that each contain a plurality of values for a plurality of hyperparameters of a machine learning (ML) model (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.): 
… measuring a duration of a plurality of durations spent training, based on a dataset, the ML model once (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.); 
predicting, by a trainable regressor, an inferred duration needed to train the ML model based on (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.): 
… a plurality of values, based on the dataset, of a plurality of meta-features (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.), and 
said plurality of durations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.) and 
the values of the plurality of landmark configurations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.).  
	While Reif indicates that target classifiers are being trained using a set of hyperparameters through a grid search (Reif p.265 Figure 3) in order to measure the respective run-times, Reif does not explicitly teach
for each landmark configuration of a plurality of landmark configurations that each contain a plurality of values for a plurality of hyperparameters of a machine learning (ML) model:
… configuring the ML model based on the landmark configuration, wherein each value of the plurality of values of the landmark configuration corresponds to a distinct hyperparameter of the plurality of hyperparameters of the ML model; …
predicting, by a trainable regressor, an inferred duration needed to train the ML model based on:
a proposed configuration of the ML model, …

for each landmark configuration of a plurality of landmark configurations that each contain a plurality of values for a plurality of hyperparameters of a machine learning (ML) model:
… configuring the ML model based on the landmark configuration, wherein each value of the plurality of values of the landmark configuration corresponds to a distinct hyperparameter of the plurality of hyperparameters of the ML model (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.) …
predicting, by a trainable regressor, an inferred duration needed to train the ML model based on:
a proposed configuration of the ML model (This claim limitation is similar in scope to a corresponding claim limitation in Claim 1, and hence is rejected under similar rationale.) …
Both Reif and Sturlaugson are analogous art since they both teach training and evaluating machine learning algorithms using an associated set of hyperparameters.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the ML model (i.e., a target classifier and a predetermined combination set of associated hyperparameters) identified in the training phase taught in Reif and perform the training steps for the ML model using the experiment module taught in Sturlaugson as a way to train and measure the associated run-time of the ML model. The motivation to combine is taught in Sturlaugson, as provided in the prior art claim mapping of Claim 1 recited above.  
Regarding amended Claim 18, Reif in view of Sturlaugson teaches
(Currently Amended) The one or more non-transitory computer-readable media of Claim 13 wherein: 
the plurality of landmark configurations comprises a reference configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 6, and hence is rejected under similar rationale.); 
said duration spent training is a reference duration when said landmark configuration is said reference configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 6, and hence is rejected under similar rationale.); 
a normalized duration, of a plurality of normalized durations, is based on said duration relative to the reference duration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 6, and hence is rejected under similar rationale.); 
said predicting based on the plurality of durations comprises predicting based on the plurality of normalized durations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 6, and hence is rejected under similar rationale.).  
Regarding amended Claim 19, Reif in view of Sturlaugson teaches
(Currently Amended) The one or more non-transitory computer-readable media of Claim 18 wherein said duration relative to the reference duration comprises: a percent deviation of said duration from the reference duration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 7, and hence is rejected under similar rationale.).  
Claims 2-4 and 14-16 are rejected under 35 U.S.C. 103 as being unpatentable over Reif et al., Prediction of Classifier Training Time Including Parameter Optimization, in Bach et al., KI 2011: Advances in Artificial Intelligence, LNAI 7006, Springer-Verlag Berlin Heidelberg 2011, pp.260-271 [hereafter referred as Reif] in view of Sturlaugson et al., U.S. PGUPB 2016/0358099, published 12/8/2016 [hereafter referred as Sturlaugson] as applied to Claims 1 and 13; in further view of Hutter et al., Algorithm runtime prediction: Methods & evaluation, Artificial Intelligence 206 (2014), Elsevier B.V. 2013, pp.79-111 [hereafter referred as Hutter].
Regarding original Claim 2, Reif in view of Sturlaugson as applied to Claim 1 teaches
The method of Claim 1 wherein: 
said dataset is a first dataset (Examiner’s note: Reif teaches measuring the prediction run-time of each target classifier using predefined combination sets of hyperparameter values (Reif p.263 Section 3 Run-time of a Grid Search) and associated meta-features of a known dataset, where the known dataset (from Claim 1) corresponds to “said dataset” and is assigned the role of “a first dataset” (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used as the target variable.”).); 
a plurality of exploratory configurations is larger than the plurality of landmark configurations ([Reif p.263 Section 3 Run-time of a Grid Search: examiner’s note: Reif teaches a combination set of hyperparameter values derived from the set of target classifiers and their respective interval ranges (shown in Reif Table 1) are applied for each machine learning model, corresponding to “a plurality of exploratory configurations” (Reif p.263 Section 3 Run-time of a Grid Search, 1st paragraph: “Since the performance of most classifiers depends on parameter values, the parameters are usually optimized. … All predefined combinations of parameter values are evaluated to determine the best of them. … different parameter combinations require different amounts of time. The plot shows the run-time of training the Ripper classifier for different combinations of its two parameters sample ratio and pureness.”).] [Sturlaugson Figure 2, elements 30, 32: examiner’s note: Sturlaugson teaches the experiment module 30 generating multiple combinations of hyperparameters for each machine learning algorithm based on the original set of associated hyperparameters, where the original set of associated hyperparameters represent “a plurality of landmark configurations”, and each generated combination of hyperparameters represent “a plurality of exploratory configurations”. Given that multiple combinations of hyperparameters are generated from an original set of associated hyperparameters with interval ranges, this satisfies the condition of “a plurality of exploratory configurations is larger than the plurality of landmark configurations” (Sturlaugson [0034]: “… the selection of machine learning models 32 received by the data input module 20 may include specific machine learning algorithms and a range and/or a set of one or more associated parameters to test. The experiment module 30 may apply these range(s) and/or set(s) to identify a group of machine learning models 32. That is, the experiment module 30 may generate a machine learning model 32 for each unique combination of parameters specified by the selection. … the selection of machine learning models 32 may identify an artificial neural network as (one of) the machine learning algorithm(s) and associated parameters as 10-20 nodes and a learning rate decay of 0 or 0.01. The experiment module 30 may interpret this selection as at least four machine learning models: an artificial neural network with 10 nodes and a learning rate decay of 0, an artificial neural network with 10 nodes and a learning rate decay of 0.01, an artificial neural network with 20 nodes and a learning rate decay of 0, and an artificial neural network with 20 nodes and a learning rate decay of 0.01.”).]); 
the method further comprises: 
for each exploratory configuration of the plurality of exploratory configurations that each contain a plurality of values for said plurality of hyperparameters (Sturlaugson Figure 2, elements 30, 32: examiner’s note: As indicated earlier, Sturlaugson teaches using the experiment module 30 to perform training for selected machine learning models, by selecting the specific machine learning algorithm and a range and/or set of one or more associated parameters, whereby the process of selecting the set of possible associated parameters to carry out the “experiments” or “trials” to test each possible machine learning model correspond to “an exploratory configuration that each contain a plurality of values for said plurality of hyperparameters” (Sturlaugson [0033]-[0034]).): 
configuring the ML model based on the exploratory configuration (Sturlaugson Figure 2, elements 30, 32: examiner’s note: As indicated earlier, Sturlaugson teaches using the experiment module 30 framework to perform training for selected machine learning models, by selecting the specific machine learning algorithm and a range and/or set of one or more associated parameters (corresponding to “an exploratory configuration”), whereby selecting a particular exploratory configuration and its associated machine learning algorithm to test at each particular trial involves “configuring the ML model based on the exploratory configuration” (Sturlaugson [0033]-[0034]).); 
measuring a second duration spent training, based on a second dataset, the ML model (Sturlaugson Figure 2, elements 30, 32: examiner’s note: Sturlaugson teaches the experiment module 30 framework to perform the experiments/trials to train each machine learning model 32, and selecting the specific machine learning algorithm and a range and/or set of one or more associated parameters (corresponding to “an exploratory configuration”), where the experiment module further subdivides a dataset into training datasets and evaluation datasets, where the training datasets (corresponding to “a second dataset”) are used to further train the selected machine learning algorithm and its associated exploratory configuration (Sturlaugson [0034]-[0036]: “Experiment module 30 of the machine learning system 10 is configured to test (e.g., to train and evaluate) each of the machine learning models 32 of the selection of machine learning models 32 provided by the data input module 20 to produce a performance result for each machine learning model 32. … That is, the experiment module 30 may generate a machine learning model 32 for each unique combination of parameters specified by the selection. … Experiment module 30 may be configured, optionally for each machine learning model 32 independently, to divide the dataset into a training dataset (a subset of the dataset) and an evaluation dataset (another subset of the dataset). The same training dataset … may be used for one or more, optionally all, of the machine learning models 32. … The experiment module 30 may be configured to train the machine learning model(s) 32 with the respective training dataset(s) (to produce a trained model) … ”). The results of the training produces a performance result that is related to execution speed (which is interpreted as a measured run-time, and corresponds to “measuring a second duration spent training, based on a second dataset, the ML model”) (Sturlaugson [0042]: “The performance result for each machine learning model 32 … may include an indicator, value, and/or result related to …  an accuracy,… . Additionally or alternatively, the indicator, value, and/or result may be related to computational efficiency, memory required, and/or execution speed.”).); and 
generating, within a plurality of training tuples, a training tuple (Reif p.265 Figure 3: examiner’s note: Referring to Reif Figure 3 Training section, Reif teaches a set of measured run-times and associated meta-features corresponding to known datasets are used to form a set of input training data (corresponding to “generating, within a plurality of training tuples, a training tuple”) to a regression learner.) based on: 
the second duration (Reif p.265 Figure 3: examiner’s note: Referring to Reif Figure 3 Training section, Reif teaches measuring the run-time of a target classifier that is learning optimized hyperparameters for each dataset and associated meta-features, and using the associated measured run-times (corresponding to “said plurality of durations”) as part of the input training data (corresponding to “a training tuple”) to train a Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used at the target variable. … The overall approach is illustrated in Figure 3.”).), …
… a plurality of values, based on the second dataset, of said plurality of meta- features (Reif p.265 Figure 3: examiner’s note: Referring to Reif Figure 3 Training section, Reif teaches analyzing dataset instances to generate meta-features, where the meta-features are based on dataset features, where this method of analyzing dataset instances to generate meta-features is independent of any dataset (and hence can apply to a “first dataset”, a “second dataset”, etc.) (Reif p.262 Section 2 Meta-Learning: “…meta-learning is based on features of datasets. These features are often called meta-features. They describe properties of a dataset … Simple meta-features use directly accessible properties like the number of samples, the number of attributes or the number of classes. More sophisticated features are statistical measures …”). A list of meta-features (grouped by category) is listed in Reif Section 4.1 Traditional Meta-Features (corresponding to “a plurality of values, based on a second dataset, of a plurality of meta-features”), with the application of each of these groups of meta-features as part of the input training data (corresponding to “a training tuple”) to train a regression learner associated with a particular classifier (Reif p.264 Section 4 Methodology, 1st paragraph: “For each target classifier, whose run-time should be predicted, a separate regression model is trained. The training data for the learning scheme consists of the knowledge about known datasets. Each instance of the training set describes one dataset. It contains the meta-features of the dataset and the measured run-time of the considered target classifier. The run-time is used at the target variable. … The overall approach is illustrated in Figure 3.”).); 
training the trainable regressor based on the plurality of training tuples (Reif p.265 Figure 3: examiner’s note: Referring to Reif Figure 3 Training section, Reif teaches a set measured run-times and associated meta-features corresponding to known datasets are used to form a set of input training data to a regression learner (corresponding to “training the trainable regressor based on the plurality of training tuples”).).  
While Reif in view of Sturlaugson teaches generating a plurality of training tuples containing measured run-times and a plurality of meta-features associated with known datasets as input training data into a regression learner, Reif in view of Sturlaugson does not explicitly teach
the method further comprises: …
… generating, within a plurality of training tuples, a training tuple based on: 
… the plurality of values of the exploratory configuration, …
Hutter teaches
the method further comprises: …
… generating, within a plurality of training tuples, a training tuple based on: 
… the plurality of values of the exploratory configuration (Examiner’s note: Hutter teaches training an empirical performance model representing an EPM regression model (Hutter p.79 Section 1 Introduction, 1st paragraph: “… a considerable body of work has shown how to use supervised machine learning methods to build regression models … we refer to such models as empirical performance models (EPMs).” and Hutter p.82 Section 3.1 Preliminaries, 2nd paragraph: “… EPMs can predict any type of performance measure that can be evaluated in single algorithm runs, such as runtime, …”) by constructing input training data with parameter configurations                         
                            
                                
                                    θ
                                
                                
                                    i
                                
                            
                        
                     (Hutter p.81 Section 2.2 Related Work on Predicting Runtime of Parameterized Algorithms, 1st paragraph: “ … parameters can be treated as additional inputs to the model … and a model can be learned in the standard way.” and Hutter p.82 Section 3.1 Preliminaries, 1st paragraph: “We define the configuration space of a parameterized algorithm with k parameters                         
                            
                                
                                    θ
                                
                                
                                    1
                                
                            
                        
                    , …,                         
                            
                                
                                    θ
                                
                                
                                    k
                                
                            
                        
                    ”), a set of feature vectors                         
                            
                                
                                    z
                                
                                
                                    i
                                
                            
                             
                        
                    representing problem-specific instance features (“meta-                        
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                     representing a measured run-time (Hutter p.82 Section 3.1 Preliminaries, 3rd paragraph: “…we focus on runtime as a performance measure…”), with this input training data for the EPM (including the parameter configurations) corresponding to “generating, within a plurality of training tuples, a training tuple based on: … the plurality of values of the exploratory configuration, …” (Hutter p.82 Section 3.1 Preliminaries, 2nd paragraph: “To construct an EPM for an algorithm A with configuration space on an instance set Π, we run 𝒜 on various combinations of configurations                         
                            
                                
                                    θ
                                
                                
                                    i
                                
                            
                        
                    ∈θ  and instances                         
                            
                                
                                    π
                                
                                
                                    i
                                
                            
                        
                    = Π, and record the resulting performance values                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                    . We record the k-dimensional parameter configuration i and the m-dimensional feature vector                         
                            
                                
                                    z
                                
                                
                                    i
                                
                            
                        
                     of the instance used in the i-th run, and combine them to form a p = k + m-dimensional vector of predictor variables                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                     =                         
                            
                                
                                    [
                                    
                                        
                                            θ
                                        
                                        
                                            i
                                        
                                        
                                            T
                                        
                                    
                                    , 
                                    
                                        
                                            z
                                        
                                        
                                            i
                                        
                                        
                                            T
                                        
                                    
                                    ]
                                
                                
                                    T
                                
                            
                        
                    . The training data for our regression models is then simply {(                        
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                        
                    ,                        
                            
                                
                                    y
                                
                                
                                    1
                                
                            
                        
                    ), …, (                        
                            
                                
                                    x
                                
                                
                                    n
                                
                            
                        
                    ,                        
                            
                                
                                    y
                                
                                
                                    n
                                
                            
                        
                    )}.”).), …
Both Reif in view of Sturlaugson and Hutter are analogous art since they both teach using regression algorithms to predict run-time for a machine learning model based on a set of hyperparameters, measured run-times, and associated meta-features corresponding to known datasets.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the training tuple for the regression learner taught in Reif in view of Sturlaugson and enhance it to include the hyperparameter values associated with a machine learning model as taught in Hutter as a way to perform run-time predictions for a machine learning model. The motivation to combine is taught in Hutter, since performing predictions on a model trained over its entire distribution (including its meta-features and hyperparameter values) allows a more accurate assessment of a model’s confidence at a particular input, where a selected model with a higher confidence results in the model having a better than average expectation of making more accurate predictions (Hutter p.82 Section 3.1 Preliminaries, 2nd paragraph: “Given an algorithm 𝒜 with configuration space Θ and a distribution of instances with feature space ℱ, an EPM is a stochastic process f: ℐ→∆(ℝ) that defines a probability distribution over performance measures for each combination of a parameter configuration θ∈Θ of 𝒜 and a problem instance with features z∈ ℱ. The prediction of an entire distribution allows us to assess the model’s confidence at a particular input, which is essential, e.g., in model-based algorithm configuration [7,6,58,55].”).
Regarding amended Claim 3, Reif in view of Sturlaugson, in further view of Hutter teaches
(Currently Amended) The method of Claim 2 wherein one selected from the group consisting of:
a) the second dataset is the first dataset, 
b) the second dataset is larger than the first dataset, 
c) said proposed configuration is contained in at least one selected from the group consisting of: said plurality of exploratory configurations, and said plurality of landmark configurations (Sturlaugson Figure 2, elements 30, 32: examiner’s note: As indicated earlier, Sturlaugson teaches using the experiment module 30 to perform training for selected machine learning models, by selecting the specific machine learning algorithm and a range and/or set of one or more associated parameters, whereby the process of selecting the set of possible associated parameters to carry out the “experiments” or “trials” to test each possible machine learning model, where each possible combination represents an exploratory configuration, and selection of one of the possible combination represents a proposed configuration (Sturlaugson [0033] and [0034]: “… the selection of machine learning models 32 received by the data input module 20 may include specific machine learning algorithms and a range and/or a set of one or more associated parameters to test. The experiment module 30 may apply these range(s) and/or set(s) to identify a group of machine learning models 32. That is, the experiment module 30 may generate a machine learning model 32 for each unique combination of parameters specified by the selection. … the selection of machine learning models 32 may identify an artificial neural network as (one of) the machine learning algorithm(s) and associated parameters as 10-20 nodes and a learning rate decay of 0 or 0.01. The experiment module 30 may interpret this selection as at least four machine learning models: an artificial neural network with 10 nodes and a learning rate decay of 0, an artificial neural network with 10 nodes and a learning rate decay of 0.01, an artificial neural network with 20 nodes and a learning rate decay of 0, and an artificial neural network with 20 nodes and a learning rate decay of 0.01.”).).  
Regarding amended Claim 4, Reif in view of Sturlaugson, in further view of Hutter teaches
 The method of Claim 2 wherein said training the trainable regressor comprises measuring accuracy of the trainable regressor based on at least one selected from the group consisting of: 
mean-squared error (MSE) (Hutter pp.93-94 Section 6.2 Experimental Setup: examiner’s note: Hutter teaches using quantitative metrics to assess a model’s performance on data that has not been used to train the models, where these metrics include using the root mean squared error (RMSE) to evaluate the mean predictions and predictive variances against the true performance values.), 
coefficient of determination (R2), and 
Spearman rank correlation.  
Regarding original Claim 14, Reif in view of Sturlaugson as applied to Claim 13 teaches
(Original) The one or more non-transitory computer-readable media of Claim 13 wherein: 
said dataset is a first dataset (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.); 
a plurality of exploratory configurations is larger than the plurality of landmark configurations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.); 
the instructions further cause: 
for each exploratory configuration of the plurality of exploratory configurations that each contain a plurality of values for said plurality of hyperparameters (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.): 
configuring the ML model based on the exploratory configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.); 
measuring a second duration spent training, based on a second dataset, the ML model (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.); and 
generating, within a plurality of training tuples, a training tuple (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.) based on: 
the second duration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.), … 
… a plurality of values, based on the second dataset, of said plurality of meta- features (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.); 
training the trainable regressor based on the plurality of training tuples (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.).  
While Reif in view of Sturlaugson teaches generating a plurality of training tuples containing measured run-times and a plurality of meta-features associated with known datasets as input training data into a regression learner, Reif in view of Sturlaugson does not explicitly teach
the instructions further cause: 
… generating, within a plurality of training tuples, a training tuple based on: 
… the plurality of values of the exploratory configuration, …
Hutter teaches
the instructions further cause: 
… generating, within a plurality of training tuples, a training tuple based on: 
… the plurality of values of the exploratory configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 2, and hence is rejected under similar rationale.), …
Both Reif in view of Sturlaugson and Hutter are analogous art since they both teach using regression algorithms to predict run-time for a machine learning model based on a set of hyperparameters, measured run-times, and associated meta-features corresponding to known datasets.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the training tuple for the regression learner taught in Reif in view of Sturlaugson and enhance it to include the hyperparameter values associated with a machine learning model taught in Hutter as a way to perform run-time predictions for a machine learning model. The motivation to combine is taught in Hutter, as provided in the prior art claim mapping of Claim 2 recited above.
Regarding amended Claim 15, Reif in view of Sturlaugson, in further view of Hutter teaches
(Currently Amended) The one or more non-transitory computer-readable media of Claim 14 wherein one selected from the group consisting of: 
a) the second dataset is the first dataset, 
b) the second dataset is larger than the first dataset, and
c) said proposed configuration is contained in at least one selected from the group consisting of: said plurality of exploratory configurations, and said plurality of landmark configurations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 3, and hence is rejected under similar rationale.).  
Regarding amended Claim 16, Reif in view of Sturlaugson, in further view of Hutter teaches
(Currently Amended) The one or more non-transitory computer-readable media of Claim 14 wherein said training the trainable regressor comprises measuring accuracy of the trainable regressor based on at least one selected from the group consisting of: 
mean-squared error (MSE) (This claim limitation is similar in scope to a corresponding claim limitation in Claim 4, and hence is rejected under similar rationale.), 
coefficient of determination (R2), and
Spearman rank correlation.  
Claims 5 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Reif et al., Prediction of Classifier Training Time Including Parameter Optimization, in Bach et al., KI 2011: Advances in Artificial Intelligence, LNAI 7006, Springer-Verlag Berlin Heidelberg 2011, pp.260-271 [hereafter referred as Reif] in view of Sturlaugson et al., U.S. PGUPB 2016/0358099, published 12/8/2016 [hereafter referred as Sturlaugson], in further view of Hutter et al., Algorithm runtime prediction: Methods & evaluation, Artificial Intelligence 206 (2014), Elsevier B.V. 2013, pp.79-111 [hereafter referred as Hutter] .
Regarding amended Claim 5, Reif in view of Sturlaugson, in further view of Hutter as applied to Claim 2 teaches
(Currently Amended) The method of Claim 2, wherein said training the trainable regressor comprises: 
… hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor ([Sturlaugson Figure 2, elements 32, 36, 38: examiner’s note: Sturlaugson teaches a machine learning model 32 to include a macro-procedure which is an ensemble of micro-procedures (where each micro-procedure is a trainable machine learning model), and the macro-procedure can include a machine learning algorithm (corresponding to the “trainable regressor”) and associated parameter values that are independent from those used in each micro-procedure, thus corresponding to “… hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor” (Sturlaugson [0023]: “Machine learning model 32 may be a macro-procedure 36 that combines the outcomes of an ensemble of micro-procedures 38. Each micro-procedure 38 includes a machine learning algorithm and its associated parameter values. Optionally, each micro-procedure 38 includes a different combination of machine learning algorithm and associated parameter values.” and Sturlaugson [0026]: “Macro-procedures 36 may include a machine learning algorithm and associated parameter values that are independent and/or distinct from the micro-procedures 38.”).]):
training the trainable regressor with a subset of the second dataset and the hyperparameter configuration (Sturlaugson Figure 2, elements 30, 32, 36, 38: examiner’s note: Sturlaugson teaches each macro-procedure (which is a machine learning model, Sturlaugson [0023]-[0026]) with its independent set of hyperparameter values (corresponding to the “trainable regressor”) use the same training datasets used in training each micro-procedure (which are also machine learning models with different combinations of machine learning algorithms and associated parameter values) (Sturlaugson [0040]: “Experiment module 30 is configured to train each of the machine learning models 32 using supervised learning to produce a trained model for each machine learning model. … For machine learning models 32 which are macro-procedures 36, the experiment module 30 may be configured to generate a trained macro-procedure by independently training each micro-procedure 38 of the macro-procedure 36 to produce an ensemble of trained micro-procedures and, if the macro procedure 36 itself includes a machine learning algorithm, training the macro-procedure 36 with the ensemble of trained micro-procedures 38.”), which can be further subdivided into multiple subsets of data through cross-validation (thus corresponding to “training the trainable regressor with a subset of the second dataset and the hyperparameter configuration”) (Sturlaugson [0041]: “… Cross validation is a process in which the original dataset is divided multiple times (to form multiple training datasets and corresponding evaluation datasets), the machine learning model 32 is trained and evaluated with each division (each training dataset and corresponding evaluation dataset) to produce an evaluation result for each division …”).); and 
measuring accuracy of the trainable regressor based on said hyperparameter configuration (Sturlaugson [0042]: examiner’s note: As indicated earlier, Sturlaugson teaches the performance result produced by training a machine learning model contains a value or indicator related to an accuracy. (“The performance result for each machine learning model 32 … may include an indicator, value, and/or result related to … an accuracy …”).); 
training the trainable regressor with the second dataset (Sturlaugson Figure 2, elements 30, 32, 36, 38: examiner’s note: As indicated earlier, Sturlaugson teaches each macro-procedure with its independent set of hyperparameter values (corresponding to the “trainable regressor”) use the same training datasets used in training each micro-procedure, thus corresponding to “training the trainable regressor with the second dataset…” (Sturlaugson [0040]: “Experiment module 30 is configured to train each of the machine learning models 32 using supervised learning to produce a trained model for each machine learning model. … For machine learning models 32 which are macro-procedures 36, the experiment module 30 may be configured to generate a trained macro-procedure by independently training each micro-procedure 38 of the macro-procedure 36 to produce an ensemble of trained micro-procedures and, if the macro procedure 36 itself includes a machine learning algorithm, training the macro-procedure 36 with the ensemble of trained micro-procedures 38.”).) …
While Reif in view of Sturlaugson, in further view of Hutter teaches optimizing hyperparameter values for the regression learner through a grid search, which requires searching all possibilities of a hyperparameter space for the best set of hyperparameters (corresponding to “for each hyperparameter configuration … of the trainable regressor”; Reif p.267 2nd paragraph: “The presented approach was evaluated by a leave-one-out cross-validation for every algorithm. We used the regression variant of a Support Vector Machine, the ϵ-SVR, as meta-learning scheme. The parameters γ and C of the ϵ-SVR have been optimized by a grid search. LibSVM[7] was used as implementation.”), Reif in view of Sturlaugson, in further view of Hutter does not explicitly teach
for each hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor: … training … a most accurate hyperparameter configuration of said measuring accuracy.  
Kobayashi teaches
for each hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor: … training … a most accurate hyperparameter configuration of said measuring accuracy (Kobayashi Figure 18, elements 137, 138; Figure 19, steps S71..S72..S79, S80, S81: examiner’s note: Kobayashi teaches a machine learning service performing searching for an optimized set of hyperparameters during training of a machine learning model (where the model can be a regression model such as a random forest, thus corresponding to “a trainable regressor”, Kobayashi [0092]: “The machine learning device 100 is able to use a plurality of machine learning algorithms. … Examples of the machine learning algorithms include … a random forest.”), where the searching is performed using a hyperparameter adjustment unit to produce a set of hyperparameter vectors (through a grid search) that achieved the best prediction performance based on sets of hyperparameter vectors found in earlier learning steps (corresponding to “… a plurality of hyperparameter configurations…”, Kobayashi [0209]: “… the hyperparameter adjustment unit 137 generates a hyperparameter vector applied to a machine learning algorithm to be executed by the step execution unit 138. Grid search or random search may be used to generate the hyperparameter vector.” and Kobayashi [0211]: “… the hyperparameter adjustment unit 137 may perform the search by starting with a hyperparameter vector                         
                            
                                
                                    θ
                                
                                
                                    j
                                    =
                                    i
                                
                            
                        
                    , that achieved the best prediction performance in the last learning step…”), and the search flow is controlled through a step execution unit which extracts a set of hyperparameter vectors from the hyperparameter adjustment unit over multiple learning steps (corresponding to “for each hyperparameter configuration of a plurality of hyperparameter configurations …”). The step execution unit also performs cross-validation using training datasets, and repeats the steps of generating sets of hyperparameter vectors and cross-validation over H iterations (Kobayashi [0214], [0215]-[0227]; Kobayashi Figure 19, steps S71..S72..S79, S80, S81), in order to produce a set of H predictions with corresponding hyperparameter vectors, where the iteration that has the best prediction performance (corresponding to an accuracy, Kobayashi [0055]: “The prediction performance of an individual model indicates the accuracy thereof, namely, indicates the capability of accurately predicting results of unknown cases.”) is selected, thereby outputting a selected machine learning model with the best prediction performance and an optimized set of hyperparameters, resulting in “… training the trainable regressor with … a most accurate hyperparameter configuration of said measuring said accuracy” (Kobayashi [0214]: “ … Next, the step execution unit 138 selects a model that indicates the best prediction performance from a plurality of models that correspond to the plurality of hyperparameter vectors. The step execution unit 138 outputs the selected model, the prediction performance thereof, the hyperparameter vector used to generate the model, and the execution time.”).).  
Both Reif in view of Sturlaugson, in further view Hutter and Kobayashi are analogous art since they both teach training machine learning algorithms using cross-validation techniques.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the machine learning training of the macro-procedure machine learning algorithm taught in Reif in view of Sturlaugson, in further view of Hutter and enhance it to include the step execution unit and hyperparameter adjustment unit taught in Kobayashi as a way to train a machine learning model to produce an optimized set of hyperparameter values. The motivation to combine is taught in Kobayashi, as a way to automate the training, evaluation, and selection of machine learning model using large datasets by starting with a model with a known prediction performance trained within a (Kobayashi [0004]-[0005]: “In machine learning, it is preferable that the accuracy of an individual learned model, namely, the capability of accurately predicting results of unknown cases (which may be referred to as a prediction performance) be high. If a larger size of training data is used in learning, a model indicating a higher prediction performance is obtained. However, if a larger size of training data is used, more time is needed to learn a model. Thus, progressive sampling has been proposed as a method for efficiently obtaining a model indicating a practically sufficient prediction performance. … With the progressive sampling, first, a computer learns a model by using a small size of training data. Next, by using test data indicating a known case different from the training data, the computer compares a result predicted by the model with the known result and evaluates the prediction performance of the learned model. If the prediction performance is not sufficient, the computer learns a model again by using a larger size of training data than the size of the last training data. The computer repeats this procedure until a sufficiently high prediction performance is obtained. In this way, the computer can avoid using an excessively large size of training data and can shorten the time needed to learn a model.”).
Regarding amended Claim 17, Reif in view of Sturlaugson, in further view of Hutter as applied to Claim 14 teaches
(Currently Amended) The one or more non-transitory computer-readable media of Claim 14 wherein said training the trainable regressor comprises: 
… hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor (This claim limitation is similar in scope to a corresponding claim limitation in Claim 5, and hence is rejected under similar rationale.):
training the trainable regressor with a subset of the second dataset and the hyperparameter configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 5, and hence is rejected under similar rationale.); and 
measuring accuracy of the trainable regressor based on said hyperparameter configuration (This claim limitation is similar in scope to a corresponding claim limitation in Claim 5, and hence is rejected under similar rationale.);
training the trainable regressor with the second dataset (This claim limitation is similar in scope to a corresponding claim limitation in Claim 5, and hence is rejected under similar rationale.) …
While Reif in view of Sturlaugson, in further view of Hutter teaches optimizing hyperparameter values for the regression learner through a grid search (Reif p.267 2nd paragraph: “The presented approach was evaluated by a leave-one-out cross-validation for every algorithm. We used the regression variant of a Support Vector Machine, the ϵ-SVR, as meta-learning scheme. The parameters γ and C of the ϵ-SVR have been optimized by a grid search. LibSVM[7] was used as implementation.”), Reif in view of Sturlaugson, in further view of Hutter does not explicitly teach
for each hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor: … training … a most accurate hyperparameter configuration of said measuring accuracy.  
Kobayashi teaches
for each hyperparameter configuration of a plurality of hyperparameter configurations of the trainable regressor: … training … a most accurate hyperparameter configuration of said measuring accuracy (This claim limitation is similar in scope to a corresponding claim limitation in Claim 5, and hence is rejected under similar rationale.).  
Both Reif in view of Sturlaugson, in further view Hutter and Kobayashi are analogous art since they both teach training machine learning algorithms using cross-validation techniques.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the machine learning training of the macro-procedure machine learning algorithm taught in Reif in view of Sturlaugson, in further view of Hutter.
Claims 8 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Reif et al., Prediction of Classifier Training Time Including Parameter Optimization, in Bach et al., KI 2011: Advances in Artificial Intelligence, LNAI 7006, Springer-Verlag Berlin Heidelberg 2011, pp.260-271 [hereafter referred as Reif] in view of Sturlaugson et al., U.S. PGUPB 2016/0358099, published 12/8/2016 [hereafter referred as Sturlaugson] as applied to Claims 6 and 18; in further view of Raschka, Sebastian, Machine Learning FAQ: What is the difference between Pearson R and Simple Linear Regression?, retrieved from web.archive.org (https://web.archive.org/web/20160402054319/http://sebastianraschka.com:80/faq/docs/pearson-r-vs-linear-regr.html), dated 04/02/2016 [hereafter referred as Raschka].
Regarding amended Claim 8, Reif in view of Sturlaugson as applied to Claim 6 teaches
(Currently Amended) The method of Claim 6,
wherein said predicting based on the plurality of durations comprises predicting based on … two landmark configurations of the plurality of landmark configurations (Reif p.268 Section 5.1 Correlation, 1st-2nd paragraphs: examiner’s note: Reif teaches prediction performances of the different sets of target classifiers running predetermined combination sets of hyperparameters (“landmark configurations”) and their predicted run-times are evaluated using a Pearson product moment correlation coefficient, where the Pearson correlation coefficient calculation consists of comparing two variables X and Y. In the context of run-times, these two variables represent the actual run-time and a predicted run-time (corresponding to “the plurality of durations”), both of which are based on a machine learning algorithm and its respective hyperparameter values (thus corresponding to “wherein said predicting based on the plurality of durations comprises prediction based on … two landmark configurations of the plurality of landmark configurations”) (Reif p.268 Section 5.1 Correlation, 1st-2nd paragraphs: “The Pearson product moment correlation coefficient (PMCC) of the actual run-time and the predicted run-time was calculated. The correlation between two variables X and Y is defined as                         
                            
                                
                                    ρ
                                
                                
                                    X
                                    ,
                                    Y
                                
                            
                            =
                            E
                            
                                
                                    
                                        
                                            X
                                            -
                                            
                                                
                                                    μ
                                                
                                                
                                                    x
                                                
                                            
                                        
                                    
                                    
                                        
                                            Y
                                            -
                                            
                                                
                                                    μ
                                                
                                                
                                                    y
                                                
                                            
                                        
                                    
                                
                            
                            /
                            
                                
                                    σ
                                
                                
                                    x
                                
                            
                            
                                
                                    σ
                                
                                
                                    y
                                
                            
                        
                    . … The results are values in the interval [-1, 1]. … Table 3 shows the correlations coefficients for all five target classifiers and the investigated sets of meta-features.”).).  
While Reif in view of Sturlaugson teaches a Pearson correlation coefficient, Reif in view of Sturlaugson does not explicitly teach
wherein said predicting … comprises predicting based on a slope of normalized durations ...  
Raschka teaches
wherein said predicting … comprises predicting based on a slope of normalized durations (Examiner’s note: Raschka teaches a Pearson correlation coefficient represents a standardized slope, thus corresponding to “… a slope of normalized durations …” (Raschka Simple Linear Regression: “ … To show how the correlation coefficient r factors in, let’s rewrite it as cov(x,y)/                        
                            
                                
                                    σ
                                
                                
                                    x
                                
                                
                                    2
                                
                            
                        
                     = cov(x,y)/                        
                            
                                
                                    σ
                                
                                
                                    x
                                
                            
                            
                                
                                    σ
                                
                                
                                    y
                                
                            
                        
                     x                         
                            
                                
                                    σ
                                
                                
                                    y
                                
                            
                        
                    /                        
                            
                                
                                    σ
                                
                                
                                    x
                                
                            
                        
                    , where the first term is equal to r, which we defined earlier; we can now see that we could use the “linear correlation coefficient” to compute the slope of the line as b =                         
                            
                                
                                    r
                                
                                
                                    x
                                    ,
                                    y
                                
                            
                            
                                
                                    σ
                                
                                
                                    y
                                
                            
                        
                    /                        
                            
                                
                                    σ
                                
                                
                                    x
                                
                            
                        
                    . … So, essentially, the linear correlation coefficient (Pearson’s r) is just the standardized slope of a simple linear regression line (fit).”).) … 
Both Reif in view of Sturlaugson and Raschka are analogous art since they both teach Pearson correlation coefficient in the context of regression learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the performance measurement based on the Pearson correlation coefficient calculation of the actual and predicted run-times taught in Reif in view of Sturlaugson and treat it as a slope of normalized duration during regression line analysis as taught in Raschka as a way to perform regression line analysis for the calculated performance measurements associated with each target classifier. The motivation to combine is taught in Raschka, as a way to facilitate the calculation of regression line analysis, as standardizing variables surround a normal distribution with mean 0 and standard deviation 1 avoids computing the y-axis intercept for a linear regression line when executing optimization algorithms based on linear regression, and allows the slope of a linear regression line to be the same as the correlation coefficient, thus simplifying and making the analysis involving large datasets more computationally efficient (Raschka Standardizing Variables: “In practice, we often standardize our input variables … After standardization, our variables have the properties of a standard normal distribution with mean=0, and standard deviation 1. … This is also useful if we use optimization algorithms for multiple linear regression, such as gradient descent, instead of the closed-form solution (handy for working with large datasets). … Another advantage of this approach is that the slop is then exactly the same as the correlation coefficient, which saves another computational step.”).
Regarding amended Claim 20, Reif in view of Sturlaugson as applied to Claim 18 teaches
(Currently Amended) The one or more non-transitory computer-readable media of Claim 18,
wherein said predicting based on the plurality of durations comprises predicting based on … two landmark configurations of the plurality of landmark configurations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 8, and hence is rejected under similar rationale.).
While Reif in view of Sturlaugson teaches a Pearson correlation coefficient, Reif in view of Sturlaugson does not explicitly teach
wherein said predicting … comprises predicting based on a slope of normalized durations …  
Raschka teaches
wherein said predicting … comprises predicting based on a slope of normalized durations (This claim limitation is similar in scope to a corresponding claim limitation in Claim 8, and hence is rejected under similar rationale.) …
Both Reif in view of Sturlaugson and Raschka are analogous art since they both teach Pearson correlation coefficient in the context of regression learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the performance measurement based on the Pearson correlation coefficient calculation of the actual and predicted run-times taught in Reif in view of Sturlaugson and treat it as a slope of normalized duration during regression line analysis as taught in Raschka as a way to perform regression line analysis for the calculated performance measurements associated with each target classifier. The motivation to combine is taught in Raschka, as provided in the prior art claim mapping of Claim 8 recited above.
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Reif et al., Prediction of Classifier Training Time Including Parameter Optimization, in Bach et al., KI 2011: Advances in Artificial Intelligence, LNAI 7006, Springer-Verlag Berlin Heidelberg 2011, pp.260-271 [hereafter referred as Reif] in view of Sturlaugson et al., U.S. PGUPB 2016/0358099, published 12/8/2016 [hereafter referred as Sturlaugson] as applied to Claim 1; in further view of Feurer et al., Initializing Bayesian Hyperparameter .
Regarding original Claim 12, Reif in view of Sturlaugson as applied to Claim 1 teaches
(Original) The method of Claim 1,
wherein the trainable regressor is a random forest ([Reif p.262 Section 2 Meta-Learning, 5th paragraph: “Regression was also used for meta-learning … for each target classifier whose performance should be predicted, a separate regression model has to be trained. … Various meta-features and regression algorithms have been used to predict different performance measures of classification algorithms can be applied using various regression algorithms …”] [Sturlaugson Figure 2, elements 32, 36, 38: examiner’s note: As indicated earlier, Sturlaugson teaches the machine learning model consists of a macro-procedure, which is an ensemble of micro-procedures (where each micro-procedure is a trainable machine learning model), and the macro-procedure can include a machine learning algorithm such as a random forest (Sturlaugson [0023]: “Machine learning model 32 may be a macro-procedure 36 that combines the outcomes of an ensemble of micro-procedures 38. Each micro-procedure 38 includes a machine learning algorithm and its associated parameter values. Optionally, each micro-procedure 38 includes a different combination of machine learning algorithm and associated parameter values.” and Sturlaugson [0026]: “Macro-procedures 36 may include a machine learning algorithm and associated parameter values that are independent and/or distinct from the micro-procedures 38. … Examples of macro-procedures 36 include an ensemble of learned decision trees (e.g., a random forest)…”).) …
However, Reif in view of Sturlaugson does not explicitly teach
… the method further comprises using the random forest to rank features of said dataset by importance.  
Fuerer teaches
... the method further comprises using the random forest to rank features of said dataset by importance (Feurer p.1130 Algorithm 2: examiner’s note: Performing training of a meta-learner using sets of hyperparameter configurations and associated meta-features of datasets using the meta-learning-based initialization variant of SMBO algorithm (MI-SMBO, described in Feurer p.1130 Figure 2), where Feurer p.1130 col.1 4th paragraph-p.1130 col.2 4th paragraph (Initializing SMBO with Configurations Suggested by Meta-Learning): “… we assume that each dataset                         
                            
                                
                                    D
                                
                                
                                    i
                                
                            
                        
                     can be described by a set of F metafeatures                         
                            
                                
                                    m
                                
                                
                                    i
                                
                            
                        
                    =(                        
                            
                                
                                    m
                                
                                
                                    1
                                
                                
                                    i
                                
                            
                        
                    , …,                         
                            
                                
                                    m
                                
                                
                                    F
                                
                                
                                    i
                                
                            
                        
                    ). … we precompute the metafeatures for all training datasets                         
                            
                                
                                    D
                                
                                
                                    1
                                
                            
                        
                    ,…,                         
                            
                                
                                    D
                                
                                
                                    N
                                
                            
                        
                    , along with the best configurations (                        
                            
                                
                                    
                                        
                                            θ
                                        
                                        ^
                                    
                                
                                
                                    1
                                
                            
                        
                    ,…,                         
                            
                                
                                    
                                        
                                            θ
                                        
                                        ^
                                    
                                
                                
                                    N
                                
                            
                        
                    ). Given a new dataset                         
                            
                                
                                    D
                                
                                
                                    N
                                    +
                                    1
                                
                            
                        
                    , we then measure its distances to all previous datasets                         
                            
                                
                                    D
                                
                                
                                    i
                                
                            
                        
                     using a distance measure d: 𝓓 x 𝓓 [Wingdings font/0xE0] ℝ. … the measure we use (in the following denoted as                         
                            
                                
                                    d
                                
                                
                                    c
                                
                            
                        
                    ) is the negative Spearman correlation coefficient between the ranked results of a fixed set of n hyperparameter configurations on both datasets … compute                         
                            
                                
                                    d
                                
                                
                                    c
                                
                            
                        
                    (                        
                            
                                
                                    D
                                
                                
                                    i
                                
                            
                        
                    ,                         
                            
                                
                                    D
                                
                                
                                    j
                                
                            
                        
                    ) for all 1≤i, j≤N and use regression to learn a function R … we implemented R using a random forest because of its robustness and speed.”). Referring to Algorithm 2 line 1, datasets are sorted by increasing distance to                         
                            
                                
                                    D
                                
                                
                                    N
                                    +
                                    1
                                
                            
                        
                     based on a distance metric (where a distance metric of increasing distances is interpreted to indicate the features between datasets are more distant, or less related, and as such, a sorted list of datasets based on a distance metric according to increasing distance corresponds to “using the random forest to rank features of said dataset by importance”) (Feurer p.1130 col.1 4th paragraph-p.1130 col.2 4th paragraph: “Sort dataset indices 𝛑(1), …, 𝛑(N) by increasing distance to                         
                            
                                
                                    D
                                
                                
                                    N
                                    +
                                    1
                                
                            
                        
                    , i.e., (𝛑(i) ≤ 𝛑(j)) ⇔ (d(                        
                            
                                
                                    D
                                
                                
                                    N
                                    +
                                    1
                                
                            
                        
                    ,                         
                            
                                
                                    D
                                
                                
                                    i
                                
                            
                        
                    ) ≤ d(                        
                            
                                
                                    D
                                
                                
                                    N
                                    +
                                    1
                                
                            
                        
                    ,                         
                            
                                
                                    D
                                
                                
                                    j
                                
                            
                        
                    ))”).).  
Both Reif in view of Sturlaugson and Feurer are analogous art since they both teach performing meta-learning training over a set of hyperparameters and associated meta-features of datasets using a regressor model.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the ϵ-SVR regressor model taught in Reif in view of Sturlaugson and replace it with the random forest model as taught in Feurer as a way to perform meta-learning training and evaluation over a set of hyperparameters and associated meta-features of datasets. The motivation to combine is taught in Feurer, since the MI-SMBO algorithm performs model selection and hyperparameter optimization starting from promising configurations that performed well on similar datasets, thus potentially speeding up the overall search for hyperparameters and reducing the computation time taken to train a machine learning model, resulting in improved computational efficiency and minimized computation time during training phase of a machine learning model (Feurer Abstract: “Model selection and hyperparameter optimization is crucial in applying machine learning to a novel dataset. Recently, a sub-community of machine learning has focused on solving this problem with Sequential Model-based Bayesian Optimization (SMBO), demonstrating substantial successes in many applications. However, for computationally expensive algorithms the overhead of hyperparameter optimization can still be prohibitive. In this paper we mimic a strategy human domain experts use: speed up optimization by starting from promising configurations that performed well on similar datasets.”).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM WAI YIN KWAN whose telephone number is 303-297-4332. The examiner can normally be reached Monday-Friday 8:00am - 4:30pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 571-272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.


/WILLIAM WAI YIN KWAN/Examiner, Art Unit 2121                                                                                                                                                                                                        


/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121