Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-9, 12, and 19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites the limitation “the plurality of hyperparameters” in the part of the claim beginning with “for each hyperparameter of the plurality of hyperparameters.” There is insufficient antecedent basis for this limitation in the claim. The previously-recited term “plurality of previously stored hyperparameter values” does not provide antecedent for the above limitation because values and hyperparameters are different things. For purposes of examination, “the” has been regarded as “a.”
Claim 9 recites the limitation “the one or more previously stored hyperparameters.” There is insufficient antecedent basis for this limitation in the claim. The previously-recited term “plurality of previously stored hyperparameter values” in claim 1 does not provide antecedent for 
Claim 12 recites the limitation “the training set” in the phrase “a size of the training set.” There is insufficient antecedent basis for this limitation in the claim. For purposes of examination, “training set” has been interpreted to be “dataset.” This rejection can be overcome by amending “training set” to “dataset.” 
Claim 19 recites the limitation “the one or more previously stored hyperparameters.” There is insufficient antecedent basis for this limitation in the claim. The previously-recited term “plurality of previously stored hyperparameter values” in claim 17 does not provide antecedent for the above limitation, because values and hyperparameters are different things. For purposes of examination, “the” has been disregarded.
Claims dependent from one or more of the above discussed claims are also rejected for the same reasons, since these dependent claims incorporate the indefinite recitations of their parent claims without curing the deficiencies thereof. Therefore, claims 2-8 are rejected due to their dependencies. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 10-15 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea and the claims as a whole, considering all claim elements both individually and in combination, do not amount to significantly more than an abstract idea.

Step 2A Prong One: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
	Claim 10 recites a method comprising “receiving a dataset having a data schema; generating metadata based on properties of the dataset; selecting…based on the metadata, a machine learning model suitable for application to the dataset.”
	Claim 13 recites a method comprising “receiving a selection of a machine learning model; identifying, for each hyperparameter of a plurality of hyperparameters associated with the selected machine learning model, a degree of influence on one or more performance metrics of the selected machine learning model; selecting, based on the identified degree of influence for each hyperparameter, hyperparameter values for each of the plurality of hyperparameters to use in conjunction with the selected machine learning model.”
These limitations, under the broadest reasonable interpretation, cover performance of the steps in the mind but for the recitation of generic computer components. The claims do not recite any specific methodology that requires a level of computational complexity that would preclude the above limitations from being a mental process. Therefore, claims 10 and 13 recite an abstract idea in the form of a mental process.
Step 2A Prong Two: Does the claim recite additional elements that integrate the judicial exception into a practical application?
The judicial exception recited in claims 10 and 13 is not integrated into a practical application. 
Claim 10 recites the additional element of “by a computer processor” but this element is merely a computer component recited at a high-level of generality, such that it amounts to no more than mere instructions to apply the judicial exception using a generic computer component. An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application. 
Claims 10 and 13 respectively recite “training the selected machine learning model using the dataset” and “training the selected machine learning model using the selected hyperparameter values for each of the plurality of hyperparameter.” However, these limitations constitute insignificant extra-solution activity. As stated in MPEP § 2106.05(g): “The term ‘extra-solution activity’ can be understood as activities incidental to the primary process or product that are merely a nominal or tangential addition to the claim.” Here, the primary process is determining the hyperparameters. Therefore, the training process is merely incidental to the primary process. Furthermore, the instant limitations of training do not reflect an improvement to technology in part because no specific technical methodology for selecting hyperparameters is recited in the claims.
Therefore, the above limitations do not integrate the judicial exception into a practical application. 
It is noted that the preamble recitations in claims 10 and 13 do not include “additional elements” for purposes of this analysis. In these claims, a structurally complete invention is recited in the claim body, and the claim body does not reference the preamble recitations. Thus, in these claims, the preamble only states a purpose or intended use and is not a limitation of the claim. Furthermore, the preamble recitation of “determining,” even if it were a method step, is itself a mental process and not an additional element. 
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception?
The claims do not include additional elements that are sufficient for the claims to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional amount to no more than mere instructions to apply the judicial exception using a generic computer component, as discussed above. 
With respect to “training the selected machine learning model using the dataset” and “training the selected machine learning model using the selected hyperparameter values for each of the plurality of hyperparameter,” these elements merely constitute well understood, routine, conventional activity. The training of machine learning models, which would involve some model setting (i.e., hyperparameter values), is well understood, routine, and conventional activity in the art, and such training of necessarily involves some hyperparameter that defines the model being trained. For evidence, see, e.g., US 2020/0311572 A1, paragraph [0030] (“learning may be controlled by hyperparameters...the training process is enhanced by Nesterov's momentum, and smoothed by L2 regularization. Other well-known training methods could be used in a similar way”) and [0064] (“Stochastic gradient descent with batch update is a common method for training deep neural networks and is well known”); US 2016/0182553 A1, paragraph [0076] (“As well-known, machine learning involves the construction of algorithms that learn from data.”); Van Rijn et al., “Hyperparameter Importance Across Datasets,” arXiv:1710.04725v2 [stat.ML] 29 May 2018, § 1, paragraph 3 (“For many well-known algorithms, there already exists some intuition about which hyperparameters impact performance most”).
The remaining dependent claims do not recite additional elements, whether considered individually or in combination, that are sufficient to integrate the judicial exception into a practical application or amount to significantly more than the judicial exception. 
Dependent claim 11 recites “executing a secondary…model using the metadata as input to the secondary…model, the secondary machine learning model returning the selection of the machine learning model and suitable hyperparameter values for use with the machine learning model.” These limitations, under the broadest reasonable interpretation, cover performance of the steps in the mind but for the recitation of generic computer components. Therefore, these limitations are mental processes. With respect to the additional element of “machine learning” in the expression “secondary machine learning model,” the Examiner notes that the claim does not recite any details of the structure of the machine learning model. Therefore, the mere recitation of “machine learning” does not more than “generally linking the use of a judicial exception to a particular technological environment or field of use” (MPEP § 2106.04(d)(I)), namely the technological environment or field of machine learning. Therefore, this element does not integrate the abstract idea into a practical application. Furthermore, the element of “machine learning” is well understood, routine, conventional activity. For evidence, see, e.g., US 2020/0311572 A1, paragraph [0029] (“deep feed-forward neural network (a DNN), a type of machine learning system that is well known to those skilled in the art of machine learning”); US 2016/0182553 A1, paragraph [0076] (“As well-known, machine learning involves the construction of algorithms that learn from data.”); and US 2016/0110657 A1, paragraph [0097] (“conventional machine learning methods”).
Dependent claims 12 and 14 merely recite further characteristics of a mental process, and thus do not include any “additional elements” for purposes of the Step 2A Prong Two and Step 2A analysis. 
Dependent claim 15 recites “executing a secondary…model using the plurality of hyperparameters associated with the selected machine learning model as input, the secondary machine learning model returning a ranking of the plurality of hyperparameters according to the degree of influence on the one or more performance metrics of the selected machine learning model.” These limitations, under the broadest reasonable interpretation, cover performance of the steps in the mind but for the recitation of generic computer components. Therefore, these limitations are mental processes. With respect to the additional element of “machine learning” in the expression “secondary machine learning model,” the element does not integrate the abstract idea into a practical application for the same reasons discussed above for claim 11.
Therefore, the rejected claims are directed to a judicial exception and do not recite additional elements, whether considered individually or in combination, that are sufficient to integrate the judicial exception into a practical application or amount to significantly more than the judicial exception. Therefore, these claims are not patent-eligible under § 101. 

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

1.	Claims 10 and 12 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Achin et al. (US 2018/0060738 A1) (“Achin”). 
As to claim 10, Achin teaches a computer-implemented method of determining one or more suitable hyperparameters for a machine learning model in an automated machine learning system [[0101]: “Fitting the predictive models to the prediction problem's dataset(s) may include tuning one or more hyper-parameters of the predictive modeling procedure”; [0131]: “search for modeling solutions in either manual mode or automatic mode”; [0135]: “to tune the parameters of a predictive model or the hyper-parameters of a modeling technique.” It is noted that the claim defines a structurally complete invention in the claim body, and the claim body does not reference the preamble recitations. Thus, the preamble only states a purpose or intended use and is not a limitation of the claim. Nonetheless, the preamble recitations are taught by the cited reference for the above reasons.], the method comprising:
receiving a dataset having a data schema; [[0119]: “the exploration engine 110 prompts the user to select the dataset for the predictive modeling problem to be solved… the rules for mapping the target data schemas into the desired dataset schema.” See also [0120]: “each column of the matrix may correspond to a variable, and each row of the matrix may correspond to an observation” (i.e., row and column schema with respect to variables in the data.)]
generating metadata based on properties of the dataset; [[0122]: “exploration engine 110 evaluates the dataset. This evaluation may include calculating the characteristics of the dataset.” See also [0057] (“Characteristics of a dataset may include, without limitation, the dataset's width, height, sparseness, or density…”); [0058] (“characteristics of a dataset include statistical properties of the dataset's variables”). In general, such characteristics reads on the limitation of “metadata” because they are data that describes other data. See also [0120]: “The exploration engine 110 may attach relevant metadata to the variables, including metadata obtained from the original source (e.g., explicitly specified data types) and/or metadata generated during the loading process (e.g., the variable's apparent data types; whether the variables appear to be numerical, ordinal, cardinal, or interpreted types; etc.).”]
selecting, by a computer processor, based on the metadata, a machine learning model suitable for application to the dataset; [[0080] teaches: “the suitability of a predictive modeling procedure for a prediction problem may be determined based on characteristics of the dataset.” [0085]: “may select the M modeling procedures most similar to the modeling procedure at issue…” [0136]: “the selected modeling techniques may be executed using the partitioned data to evaluate the search space.” Furthermore, “machine learning” models are taught in  [0203]: “the predictive modeling system may offer a set of predictive models, including traditional regression models, neural networks, and other machine learning models (e.g., random forests, boosted trees, support vector machines).” See also [0095].] and
training the selected machine learning model using the dataset. [[0140]: “execution of a set of modeling techniques may comprise training one or more models on a same data sample extracted from the dataset.” [0101]: “Fitting the predictive models to the prediction problem's dataset(s)” (i.e., training). Use of the dataset for training is described in [0134]: “predictive modeling system 100 may partition the dataset (or suggest a partitioning of the dataset) into a training set and a ‘holdout’ test set. In some embodiments, the training set is further partitioned into K folds for cross-validation.” Note that in K-fold cross-validation, different parts of the original dataset are rotated as training and test sets.]

As to claim 12, Achin teaches the method of claim 10, wherein the metadata includes at least one selected from the group consisting of:
a size of the training set, a number of features in the dataset, a percentage of types of data fields in the dataset, a type of classification problem, a variance of types of data fields in the dataset, and an indication whether features of the dataset follow a statistical distribution. [[0057]: “the number of targets and/or features in the dataset,” corresponding to the claim recitation of “a number of features in the dataset”; see also [0058]: “the properties of the distribution of each variable's values or class membership; cardinality of the variables,” corresponding to the claim recitation of “an indication whether features of the dataset follow a statistical distribution”]

2.	Claims 13-15 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Van Rijn et al., “Hyperparameter Importance Across Datasets,” arXiv:1710.04725v2 [stat.ML] 29 May 2018 (“Van Rijn”).
As to claim 13, Van Rijn teaches a method of determining one or more suitable hyperparameters for a machine learning model in an automated machine learning system, [Abstract: “automated hyperparameter optimization methods are by now routinely used.” § 1, paragraph 2: “reliable automatic machine learning (AutoML) systems, which – given a new dataset D – determine a custom combination of algorithm and hyperparameters that performs well on D.” It is noted that the claim defines a structurally complete invention in the claim body, and the claim body does not reference the preamble recitations. Thus, the preamble only states a purpose or intended use and is not a limitation of the claim. Nonetheless, the preamble recitations are taught by the cited reference for the above reasons.] the method comprising:
receiving a selection of a machine learning model; [§ 4, paragraph 1: “Given…an algorithm with configuration space Θ.” Here, the “algorithm” corresponds to a machine learning model. Examples of machine learning models include “random forests, Adaboost, and SVMs” (§ 5, paragraph 1).]
identifying, for each hyperparameter of a plurality of hyperparameters associated with the selected machine learning model, a degree of influence on one or more performance metrics of the selected machine learning model; [In general, § 1, paragraph 4 teaches: “given an algorithm, we aim to answer the following two questions: (1) Which of the algorithm’s hyperparameters matter most for empirical performance?” For more detail, see § 3, paragraphs 1-2: “determines how much each hyperparameter (and each combination of hyperparameters) contributes to the variance of                         
                            
                                
                                    y
                                
                                ^
                            
                        
                     across the algorithm’s hyperparameter space Θ…Algorithm A has n hyperparameters with domains Θ1, . . . ,Θn and configuration space Θ = Θ1 × ... × Θn. Let N = {1, … , n} be the set of all hyperparameters of A.” Note that                         
                            
                                
                                    y
                                
                                ^
                            
                        
                     is the “predictions                         
                            
                                
                                    y
                                
                                ^
                            
                        
                     for the performance of arbitrary hyperparameter settings.”]
selecting, based on the identified degree of influence for each hyperparameter, hyperparameter values for each of the plurality of hyperparameters to use in conjunction with the selected machine learning model; [§ 7, “Now that we know which hyperparameters are important, the next natural question is which values they should be set to in order to likely obtain good performance.” § 8, paragraph 2: “In order to determine which hyperparameter values tend to yield good performance, we fitted kernel density estimators to hyperparameter values that performed well on other datasets.” For example, § 7, paragraph 2 teaches: “For both types of SVMs, the best performance can typically be achieved with low values of the gamma hyperparameter” (referring to FIG. 6). The Examiner notes that the term “based on” does not require any specific type of relationship between the selection and the degree of influence.] and
training the selected machine learning model using the selected hyperparameter values for each of the plurality of hyperparameters. [The datasets are used for training, as described in § 4.3, paragraph 1 (“training datasets”), § 5, paragraph 3 (“training data”). The procedure is described in more detail in § 7, paragraphs 3-4: “Hyperband was ran with the following hyperparameters: 5 brackets, smax = 4, η = 2 and R = |D(i)| (the number of data points of dataset D(i)). Each optimizer was ran with 10 different random seeds, and we report the average of their results. For each dataset, Figure 7 shows the difference in predictive accuracy between the two procedures.” See also § 7, paragraph 5: “For each dataset, the Hyperband procedures are ranked by their final performance on the test set (the best procedure obtaining the lower rank, and an equal rank in case of a draw).” Note that “Hyperband” is a tuning and training algorithm, and the above-quoted part of the reference teaches that actual models were built (i.e., trained) on the hyperparameters in order for the models to be evaluated on a test set.]

As to claim 14, Van Rijn teaches the method of claim 13, wherein the one or more performance metrics includes at least one selected from the group consisting of: accuracy, error, precision, recall, area under the receiver operating characteristic (ROC) curve, and area under the precision recall curve. [Van Rijn, § 3, paragraph 6: “performance yi (e.g., accuracy or AUC score) of an algorithm”; § 4.1, last paragraph: “a random search for maximizing accuracy.”]

As to claim 15, Van Rijn teaches the method of claim 13, wherein the identifying further comprises:
executing a secondary machine learning model using the plurality of hyperparameters associated with the selected machine learning model as input, the secondary machine learning model returning a ranking of the plurality of hyperparameters according to the degree of influence on the one or more performance metrics of the selected machine learning model. [Van Rijn, § 6, paragraphs 1 and 3: “determining the most important hyperparameters per classifier…this analysis is based on the performance data of 250,195 algorithm runs… The middle figure (e.g., Figure 2(b)) shows the results of the verification experiment. It shows the average rank of each run of random search, labeled with the hyperparameter whose value was fixed to a default value. A high rank implies poor performance compared to the other configurations, meaning that tuning this hyperparameter would have been important.” Note that “algorithm runs” refers to the use of machine learning models to perform verification. Such models read on “secondary machine learning model.”]

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


1.	Claims 1-5 are rejected under 35 U.S.C. 103 as being unpatentable over Achin in view of Muddu et al. (US 2017/0223036 A1) (“Muddu”), Bauer et al. (US 2019/0385052 A1) (“Bauer”), Van Rijn, and Wistuba et al., “Two-Stage Transfer Surrogate Model for Automatic Hyperparameter Optimization” in P. Frasconi et al. (Eds.): ECML PKDD 2016, Part I, LNAI 9851, pp. 199–214, 2016 (hereinafter “Wistuba (2016)”).
As to claim 1, Achin teaches a computer-implemented method performed in an automated machine learning system, [[0101]: “Fitting the predictive models to the prediction problem's dataset(s) may include tuning one or more hyper-parameters of the predictive modeling procedure”; [0131]: “search for modeling solutions in either manual mode or automatic mode”; [0135]: “to tune the parameters of a predictive model or the hyper-parameters of a modeling technique.”] the method comprising:
receiving a first dataset having a first data schema; [[0119]: “the exploration engine 110 prompts the user to select the dataset for the predictive modeling problem to be solved… the rules for mapping the target data schemas into the desired dataset schema.” See also [0120]: “each column of the matrix may correspond to a variable, and each row of the matrix may correspond to an observation” (i.e., row and column schema with respect to variables in the data.)]
generating metadata based on properties of the dataset; [[0122]: “exploration engine 110 evaluates the dataset. This evaluation may include calculating the characteristics of the dataset.” See also [0057] (“Characteristics of a dataset may include, without limitation, the dataset's width, height, sparseness, or density…”); [0058] (“characteristics of a dataset include statistical properties of the dataset's variables”). In general, such characteristics reads on the limitation of “metadata” because they are data that describes other data. See also [0120]: “The exploration engine 110 may attach relevant metadata to the variables, including metadata obtained from the original source (e.g., explicitly specified data types) and/or metadata generated during the loading process (e.g., the variable's apparent data types; whether the variables appear to be numerical, ordinal, cardinal, or interpreted types; etc.).”]
selecting, by a computer processor, based on the metadata, a machine learning model suitable for application to the dataset; [[0080] teaches: “the suitability of a predictive modeling procedure for a prediction problem may be determined based on characteristics of the dataset.” [0085]: “may select the M modeling procedures most similar to the modeling procedure at issue…” [0136]: “the selected modeling techniques may be executed using the partitioned data to evaluate the search space.” Furthermore, “machine learning” models are taught in  [0203]: “the predictive modeling system may offer a set of predictive models, including traditional regression models, neural networks, and other machine learning models (e.g., random forests, boosted trees, support vector machines).” See also [0095].]
training the selected machine learning model using the first selected group of hyperparameter values, the second selected group of hyperparameter values, and the dataset. [[0140]: “execution of a set of modeling techniques may comprise training one or more models on a same data sample extracted from the dataset.” [0101]: “Fitting the predictive models to the prediction problem's dataset(s) may include tuning one or more hyper-parameters of the predictive modeling procedure that generates the predictive model.” That is, training is performed after tuning (selecting) hyperparameter values. See also [0134]: “predictive modeling system 100 may partition the dataset (or suggest a partitioning of the dataset) into a training set and a ‘holdout’ test set. In some embodiments, the training set is further partitioned into K folds for cross-validation.” Note that in K-fold cross-validation, different parts of the original dataset are rotated as training and test sets.] 
Achin does not explicitly teach the limitation that the training is performed on “the first version of the selected machine learning model” and does not teach the following limitations:
identifying, for each hyperparameter of a plurality of hyperparameters associated with the selected machine learning model, a degree of influence of the each hyperparameter on one or more performance metrics of the selected machine learning model;
identifying a first version of the selected machine learning model;
obtaining a plurality of previously stored hyperparameter values associated with the first version of the selected machine learning model based on:
	identifying a second version of the selected machine learning model having one or more hyperparameters in common with the first version of the selected machine learning model, and
	identifying a similarity between the first data schema and a second data schema of a second dataset associated with the second version of the selected machine learning model;
determining a range of values for one or more of the previously stored hyperparameter values based on a threshold;
for each hyperparameter of the plurality of hyperparameters associated with the first version of the selected machine learning model that is in common with a hyperparameter of the one or more hyperparameters associated with the second version of the selected machine learning model:
selecting, based on the identified degree of influence for each associated hyperparameter and from the determined range of values, a first group of hyperparameter values; and
for each hyperparameter of the plurality of hyperparameters associated with the first version of the selected machine learning model that is not in common with a hyperparameter of the one or more hyperparameters associated with the second version of the selected machine learning model:
selecting, based on the identified degree of influence for each associated hyperparameter, a second group of hyperparameter values


Muddu, in an analogous art, teaches “identifying a first version of the selected machine learning model” and the limitation that the training is performed on “the first version of the selected machine learning model.” Muddu pertains to “model training and deployment” (see title) and is therefore in the same field of endeavor as the claimed invention, namely machine learning. Muddu generally teaches the handling of different version of machine learning models. See [0283]: “The model store 1532 stores model states that represent machine learning models or versions of the machine learning models.”
In particular, Muddu teaches “identifying a first version of the selected machine learning model” and training “the first version of the selected machine learning model” [[0234]: “the identity resolution module 812 can initiate, for a given user, different versions of the machine learning model at different point of time…As events related to the given user arrive, versions of a machine learning model are initiated, trained, activated, (optionally) continually updated, and finally expired.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin with the teachings of Muddu by modifying the method of Achin to include “identifying a first version of the selected machine learning model” and to perform the training on the first version of the selected machine learning model. The motivation for doing so would have been to enable the multiple versions of a model to be trained. See Muddu, paragraph [0298] (“versioning of the machine learning models simultaneous training of different machine learning models using the same data to produce model states corresponding to different windows of training data sets”). 
Bauer, in an analogous art, teaches:
obtaining a plurality of previously stored hyperparameter values associated with the first version of the selected machine learning model based on: 
	identifying a second version of the selected machine learning model having one or more hyperparameters in common with the first version of the selected machine learning model…
determining a range of values for one or more of the previously stored hyperparameter values based on a threshold;
for each hyperparameter of the plurality of hyperparameters associated with the first version of the selected machine learning model that is in common with a hyperparameter of the one or more hyperparameters associated with the second version of the selected machine learning model:
selecting,…from the determined range of values, a first group of hyperparameter values; and
for each hyperparameter of the plurality of hyperparameters associated with the first version of the selected machine learning model that is not in common with a hyperparameter of the one or more hyperparameters associated with the second version of the selected machine learning model…

Bauer generally teaches “methods for deep learning optimization” (see title) involving the use of a “hyperparameter design space” ([0028]). Therefore, Bauer is in the same field of endeavor as the claimed invention, namely machine learning.
	In particular, Bauer teaches:
obtaining a plurality of previously stored hyperparameter values [[0051]: “selecting the deep learning model configuration with the highest result metric as the output deep learning model configuration”; For example, [0080]: “deep learning model configuration that corresponds to the point 406(3) having the highest result metric.” Note that point 406(3), shown in FIG. 4B is in a two-dimensional space; thus, it includes a plurality of hyperparameter values, respectively corresponding to a plurality of hyperparameters.] associated with the first version of the selected machine learning model [The concept of the “first version” is analogously taught in [0042]: “generating 108 a second deep learning model configuration.” [0078]: “an even more optimized version of that deep learning model configuration may be found.” That is, the “second deep learning model,” which is an optimized version, corresponds to a first version. The above hyperparameter values are considered to be “associated with” this version, because they are used to configure this version.] based on: identifying a second version of the selected machine learning model having one or more hyperparameters in common with the first version of the selected machine learning model, [[0080]: “in response to the deep learning model configuration that corresponds to the point 406(3) having the highest result metric, the first sample space 408 may be based on the deep learning model that corresponds to the point 406(3).” Note that the point 406(3) corresponds to a “second version” (i.e., a second configuration), which has hyperparameters in common with the “first version”, i.e., the hyperparameters represented by the grid shown in FIG. 4B.]
determining a range of values for one or more of the previously stored hyperparameter values based on a threshold. [[0080]: “the first sample space 408 may be based on the deep learning model that corresponds to the point 406(3).” The range of the first sample space 408, as shown in FIG. 4B, corresponds to a “range of values.” See [0063]: “The first sample space 208 may be smaller than the hyperparameter design space 200... The range of the first sample space for each dimension may include 50 possible values.” With respect to “threshold,” the boundaries of the first sample space constitute a threshold. This limitation is alternatively taught by [0041]: “selecting 106 the first sample space may include selecting the first sample space in response to the first metric exceeding the exploitation threshold”]
for each hyperparameter of the plurality of hyperparameters associated with the first version of the selected machine learning model that is in common with a hyperparameter of the one or more hyperparameters associated with the second version of the selected machine learning model: [[0028]: “The hyperparameter design space may include all the parameters of the deep learning model or a subset of all of the parameters.” As shown in FIGS. 4A-4E, the search space may include multiple hyperparameters (dimensions 402, 404, see [0079]), that are in common among multiple versions (configurations) of a machine learning model (e.g., deep learning model).] selecting, from the determined range of values, a first group of hyperparameter values [[0081]: “a plurality of deep learning model configurations corresponding to the points 410(1)-(4). Each of the points 410 may be within the first sample space 408.”]
for each hyperparameter of the plurality of hyperparameters associated with the first version of the selected machine learning model that is not in common with a hyperparameter of the one or more hyperparameters associated with the second version of the selected machine learning model: selecting a second group of hyperparameter values [0028]: “The hyperparameter design space may include all the parameters of the deep learning model or a subset of all of the parameters.” This description of “subset” implicitly teaches that the hyperparameter selection method taught in the reference does not account for all possible parameters of the model. Therefore, the reference teaches that there may be hyperparameters that are configured without the use of the sampling space. It is implied that the values of these other hyperparameters are selected, since all hyperparameters of a model need to be at some value when the hyperparameter is used.] 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin and Muddu with the teachings of Bauer by modifying the combination of Achin and Muddu to include the operations of “obtaining a plurality of previously stored hyperparameter values associated with the first version of the selected machine learning model…determining a range of values for one or more of the previously stored hyperparameter values based on a threshold; for each hyperparameter of the plurality of hyperparameters associated with the first version of the selected machine learning model that is in common with a hyperparameter of the one or more hyperparameters associated with the second version of the selected machine learning model: selecting,…from the determined range of values, a first group of hyperparameter values; and for each hyperparameter of the plurality of hyperparameters associated with the first version of the selected machine learning model that is not in common with a hyperparameter of the one or more hyperparameters associated with the second version of the selected machine learning model…” and such that the training is based on said selected first and second groups of hyperparameter values. The motivation for doing so would have been to utilize existing hyperparameter values as a starting point to obtain a more optimized version of a machine learning model, as suggested by Bauer, paragraph [0078] (“by using an existing deep learning model configuration as a basis for optimization, an even more optimized version of that deep learning model configuration may be found”).
Wistuba (2016), in an analogous art, teaches obtaining the plurality of previously stored hyperparameter values “based on…identifying a similarity between the first data schema and a second data schema of a second dataset associated with the second version of the selected machine learning model.” Wistuba (2016) relates to “automatic hyperparameter optimization” (see title), and is therefore in the same field of endeavor as the claimed invention, namely machine learning.
In particular, Wistuba (2016) teaches obtaining the plurality of previously stored hyperparameter values “based on…identifying a similarity between the first data schema and a second data schema of a second dataset associated with the second version of the selected machine learning model.” [Page 206: “Table 1. The list of all meta-features used by us.” § 4.4, paragraph 2: “The most popular way to describe data sets is by utilizing meta-features. These are simple, statistical or information theoretic properties extracted from the data set. The similarity between two data sets, as defined in Eq. 9, is then dependent on the Euclidean distance between the meta-features of the corresponding data sets.” § 1, last paragraph: “rank the hyperparameter configurations for the new data set, considering the similarity between the new data set and the previous ones.” That is, the hyperparameter configurations that are being ranked correspond to a “second version of the selected machine learning model,” and the meta-features of the “new data set” corresponds to the “first data schema.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, and Bauer with the teachings of Wistuba (2016) by modifying the obtaining of the plurality of previously stored hyperparameter values to be “based on…identifying a similarity between the first data schema and a second data schema of a second dataset associated with the second version of the selected machine learning model.” The motivation would have been to “use knowledge of the performance of an algorithm on given other data sets to automatically accelerate the hyperparameter optimization for a new data set.” (Wistuba (2016), abstract).
Van Rijn teaches the remaining limitations of “identifying, for each hyperparameter of a plurality of hyperparameters associated with the selected machine learning model, a degree of influence of the each hyperparameter on one or more performance metrics of the selected machine learning model”; selection of the first group of hyperparameter values “based on the identified degree of influence for each associated hyperparameter.” Van Rijn generally pertains to “automated hyperparameter optimization methods” (abstract, first sentence), and is therefore in the same field of endeavor as the claimed invention, namely machine learning.
In particular, Van Rijn teaches identifying, for each hyperparameter of a plurality of hyperparameters associated with the selected machine learning model, a degree of influence of the each hyperparameter on one or more performance metrics of the selected machine learning mode [In general, § 1, paragraph 4 teaches: “given an algorithm, we aim to answer the following two questions: (1) Which of the algorithm’s hyperparameters matter most for empirical performance?” For more detail, see § 3, paragraphs 1-2: “determines how much each hyperparameter (and each combination of hyperparameters) contributes to the variance of                         
                            
                                
                                    y
                                
                                ^
                            
                        
                     across the algorithm’s hyperparameter space Θ…Algorithm A has n hyperparameters with domains Θ1, . . . ,Θn and configuration space Θ = Θ1 × ... × Θn. Let N = {1, … , n} be the set of all hyperparameters of A.” Note that                         
                            
                                
                                    y
                                
                                ^
                            
                        
                     is the “predictions                         
                            
                                
                                    y
                                
                                ^
                            
                        
                     for the performance of arbitrary hyperparameter settings.”] and selection of the first and second groups of hyperparameter values “based on the identified degree of influence for each associated hyperparameter” [§ 7, “Now that we know which hyperparameters are important, the next natural question is which values they should be set to in order to likely obtain good performance.” § 8, paragraph 2: “In order to determine which hyperparameter values tend to yield good performance, we fitted kernel density estimators to hyperparameter values that performed well on other datasets.” For example, § 7, paragraph 2 teaches: “For both types of SVMs, the best performance can typically be achieved with low values of the gamma hyperparameter” (referring to FIG. 6).]. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, Bauer, and Wistuba (2016) with the teachings of Van Rijn by performing the further operation of “identifying, for each hyperparameter of a plurality of hyperparameters associated with the selected machine learning model, a degree of influence of the each hyperparameter on one or more performance metrics of the selected machine learning model” and by modifying the selection of the first and second groups of hyperparameter values to each be “based on the identified degree of influence for each associated hyperparameter.” The motivation for doing so would have been to determine the most important hyperparameters (Van Rijn, abstract: “to determine the most important hyperparameters”) and to select suitable values of such hyperparameters for good performance (Van Rijn, § 7: “Now that we know which hyperparameters are important, the next natural question values they should be set to in order to likely obtain good performance”).  

As to claim 2, the combination of Achin, Muddu, Bauer, Wistuba (2016), and Van Rijn teaches the method of claim 1, wherein the metadata includes at least one selected from the group consisting of:
a size of the training set, a number of features in the dataset, a percentage of types of data fields in the dataset, a type of classification problem, a variance of types of data fields in the dataset, and an indication whether features of the dataset follow a statistical distribution. [Achin, [0057]: “the number of targets and/or features in the dataset,” corresponding to the claim recitation of “a number of features in the dataset”; see also [0058]: “the properties of the distribution of each variable's values or class membership; cardinality of the variables,” corresponding to the claim recitation of “an indication whether features of the dataset follow a statistical distribution”]

As to claim 3, the combination of Achin, Muddu, Bauer, Wistuba (2016), and Van Rijn teaches the method of claim 1, wherein the selecting a machine learning model comprises:
executing a secondary machine learning model, the secondary machine learning model returning selection of the first version of the selected machine learning model and returning suitable machine learning hyperparameter values for use with the first version of the selected machine learning model. [[0087]: “exploration engine 110 determines the suitability of a predictive modeling procedure for a prediction problem based, at least in part, on the output of a ‘meta’ machine-learning model, which may be trained to determine the suitability of a modeling procedure for a prediction problem”]
The thus-far combination of references does not explicitly teach the limitations that the execution of the secondary machine learning model is “based on the metadata as input,” and the secondary model “returning suitable machine learning hyperparameter values for use with the first version of the selected machine learning model.”
Wistuba (2016) further teaches executing a secondary machine learning model [§ 4, paragraph 1: “two-stage surrogate model.”] “based on the metadata as input” [Metadata is described in § 4.4, paragraph 2 (“Description Using Meta-features”): “The most popular way to describe data sets is by utilizing meta-features. These are simple, statistical or information theoretic properties extracted from the data set.” See Table 1 on page 206 for example metadata. Such meta-features for a new data set are used as input to the surrogate model as described in § 4, paragraphs 1-2: “The first stage of the surrogate model approximates the hyperparameter response functions of a new data set and each data set from the meta-data individually with Gaussian processes… For approximating the hyperparameter response function fD, any machine learning model can be used.”] and the secondary model “returning suitable machine learning hyperparameter values for use with the first version of the selected machine learning model.” [§ 4.3, paragraph 1: “The second stage combines all models of the first stage within one surrogate model Ψ to rank the different hyperparameter configurations and predict the uncertainty about the ranking. The predicted score of a hyperparameter configuration is determined.” See also § 5.3: “The average rank ranks the tuning strategies per data set according to the best found hyperparameter configuration.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporated, into the thus-far combination of references, the above further teachings of Wistuba (2016) by modifying the selecting operation such that the execution of the secondary machine learning model is “based on the metadata as input,” and that the secondary model also performs the operation of “returning suitable machine learning hyperparameter values for use with the first version of the selected machine learning model,” so as to arrive at each and every limitation of the instant claim. The motivation for doing so would have been to perform “the task of hyperparameter tuning as well as the task of combined algorithm selection and hyperparameter tuning” (Wistuba (2016), § 6, paragraph 1), particularly in a manner that utilizes meta-knowledge to accelerate hyperparameter search (Wistuba (2016), § 6, paragraph 1: “we propose a two-stage transfer surrogate for using meta-knowledge to accelerate the search with the SMBO framework.”).

As to claim 4, the combination of Achin, Muddu, Bauer, Wistuba (2016), and Van Rijn teaches the method of claim 1, wherein the one or more performance metrics includes at least one selected from the group consisting of: accuracy, error, precision, recall, area under the receiver operating characteristic (ROC) curve, and area under the precision recall curve. [Van Rijn, § 3, paragraph 6: “performance yi (e.g., accuracy or AUC score) of an algorithm”; § 4.1, last paragraph: “a random search for maximizing accuracy.”]

As to claim 5, the combination of Achin, Muddu, Bauer, Wistuba (2016), and Van Rijn teaches the method of claim 1, wherein the identifying a degree of influence further comprises: 
executing a secondary machine learning model using the plurality of hyperparameters associated with the first version of the selected machine learning model as input, the secondary machine learning model returning a ranking of the plurality of hyperparameters according to the degree of influence on the one or more performance metrics of the first version of the selected machine learning model. [Van Rijn, § 6, paragraphs 1 and 3: “determining the most important hyperparameters per classifier…this analysis is based on the performance data of 250,195 algorithm runs… The middle figure (e.g., Figure 2(b)) shows the results of the verification experiment. It shows the average rank of each run of random search, labeled with the hyperparameter whose value was fixed to a default value. A high rank implies poor performance compared to the other configurations, meaning that tuning this hyperparameter would have been important.” Note that “algorithm runs” refers to the use of machine learning models to perform verification. Such models read on “secondary machine learning model.”] 

2.	Claims 6-8 are rejected under 35 U.S.C. 103 as being unpatentable over Achin in view of Muddu, Bauer, Van Rijn, and Wistuba (2016), and further in view of Jimenez et al., “Finding Optimal Model Parameters by Discrete Grid Search,” in E. Corchado et al. (Eds.): Innovations in Hybrid Intelligent Systems, ASC 44, pp. 120–127, 2007 (“Jimenez”) and Bergstra et al., “Random Search for Hyper-Parameter Optimization,” Journal of Machine Learning Research 13 (2012) 281-305 (“Bergstra”).
As to claim 6, the combination of Achin, Muddu, Bauer, Wistuba (2016), and Van Rijn teaches the method of claim 1, wherein the selecting, based on the identified degree of influence for each associated hyperparameter, further comprises: identifying the hyperparameter value for each of the plurality of hyperparameters [This limitation is taught by the combination of references for the reasons discussed for the limitation of “selecting…a second group of hyperparameter values.”]
The combination of references does not teach the limitation that identifying hyperparameter values is “based on a search, the search having a variable granularity, wherein the granularity of the search corresponds to the degree of influence of each of the plurality of hyperparameters on the one or more performance metrics of the first version of the selected machine learning model.”
Jimenez, in an analogous art, teaches the limitation of “based on a search, the search having a variable granularity.” Jimenez relates to “finding optimal model parameters by discrete grid search” (see title) and is therefore in the same field of endeavor as the claimed invention, namely machine learning. 
In particular, Jimenez teaches identifying hyperparameter values based on a search, [§ 2, paragraph 1: “our method is based on the grid-search algorithm proposed in [citation]. Instead of assuming a two-dimensional grid with two continuous parameters, we generalize the method to M possible discrete parameters.”] the search having a variable granularity [§ 2, paragraph 1: “The parameter discretization is performed by fixing minimum and maximum values for each one, and by selecting a resolution level δi that determines how many possible values of the parameter will be considered for model construction. Assume that each parameter…to be optimally chosen can be discretized to values in a range                         
                            {
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            δ
                            ,
                            …
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    b
                                
                                
                                    i
                                
                            
                            }
                        
                    …” Here, the concept of “granularity” is disclosed in the form of, for example, the number of sampling points for a particular parameter, which is determined based on the resolution level and the range ai to bi. With respect to the limitation of “variable” granularity, as quoted above, in the general case, each has a different respective discretization                         
                            {
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            δ
                            ,
                            …
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    b
                                
                                
                                    i
                                
                            
                            }
                        
                     and hence, a different “granularity” for its respective dimension in the search grid. For example, as shown in Table 1 (described in § 3, paragraph 1), the hyperparameters NH and                         
                            
                                
                                    
                                        
                                            log
                                        
                                        
                                            10
                                        
                                    
                                
                                ⁡
                                
                                    μ
                                
                            
                        
                     for a multi-layer perceptron have different values of a, b, and δ; therefore, these hyperparameters have a different granularity in the form of a different number of sampling points. Similarly in the case of an SVM model (as described in § 3, paragraph 1) the three other parameters shown in Table 1 have different granularities. It is noted that the discretization is based on a user’s “discretization choices” (see § 3, paragraph 1) and thus may vary among different hyperparameters for any arbitrary reason.] 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, Bauer, Wistuba (2016), and Van Rijn with the teachings of Jemenez by modifying the operation of identifying hyperparameter values to be “based on a search, the search having a variable granularity,” for the purpose of implementing “an algorithm for finding optimal parameters that works with no specific information about the underlying model” (Jimenez, abstract).
Bergstra, in an analogous art, suggests “wherein the granularity of the search corresponds to the degree of influence of each of the plurality of hyperparameters on the one or more performance metrics of the first version of the selected machine learning model.” Bergstra generally pertains to “strategies for hyper-parameter optimization” (abstract, first sentence) and is therefore in the same field of endeavor as the claimed invention, namely machine learning.
In particular, Bergstra suggests “wherein the granularity of the search corresponds to the degree of influence of each of the plurality of hyperparameters on the one or more performance metrics of the first version of the selected machine learning model” [Page 283, second-to-bottom paragraph: “Ψ of interest are more sensitive to changes in some dimensions than others…Figure 1 illustrates how point grids and uniformly random point sets differ in how they cope with low effective dimensionality.” Page 284, first paragraph: “different subspaces are important, and to different degrees. A grid with sufficient granularity to optimizing hyper-parameters.” See FIG. 1, which illustrates the concept of important and unimportant parameters, and its relation to the “granularity” concept mentioned above. That is, this reference teaches, to one of ordinary skill in the art, that granularity should generally be sufficient depending on the “importance” or “sensitivity” of the hyperparameter, which are analogous to the “degree of influence” taught by Van Rijn. This teaching suggests the instant limitation, which recites a correspondence between granularity and the degree of influence. The Examiner notes that while this reference primarily pertains to random search, its teachings are also applicable to grid search.] 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, Bauer, Wistuba (2016), Van Rijn, and Jimenez with the teachings of Bergstra by modifying the search such that “the granularity of the search corresponds to the degree of influence of each of the plurality of hyperparameters on the one or more performance metrics of the first version of the selected machine learning model.” The motivation for doing so would have been to configure the grid search so that it has “sufficient granularity to optimizing hyper-parameters” (Bergstra, page 284, first paragraph).
As to claim 7, the combination of Achin, Muddu, Bauer, Wistuba (2016), and Van Rijn teaches the method of claim 1, wherein the selecting, based on the identified degree of influence for each associated hyperparameter and from the determined range of values, a first group of hyperparameter values, further comprises: identifying a hyperparameter value within the determined range of values for one or more of the hyperparameters of the first version of the selected machine learning model [This limitation is taught by the combination of references for the reasons discussed for the limitation of “selecting…a first group of hyperparameter values.”]
The combination of references does not teach the limitation that identifying hyperparameter values is “based on a search, the search having a variable granularity, wherein the granularity of the search corresponds to a degree of influence of each of the plurality of hyperparameters on one or more performance metrics of the first version of the selected machine learning model.”
Jimenez, in an analogous art, teaches the limitation of “based on a search, the search having a variable granularity.” Jimenez relates to “finding optimal model parameters by discrete grid search” (see title) and is therefore in the same field of endeavor as the claimed invention, namely machine learning. 
In particular, Jimenez teaches identifying hyperparameter values based on a search, [§ 2, paragraph 1: “our method is based on the grid-search algorithm proposed in [citation]. Instead of assuming a two-dimensional grid with two continuous parameters, we generalize the method to M possible discrete parameters.”] the search having a variable granularity [§ 2, paragraph 1: “The parameter discretization is performed by fixing minimum and maximum values for each one, and by selecting a resolution level δi that determines how many possible values of the parameter will be considered for model construction. Assume that each parameter…to be optimally chosen can be discretized to values in a range                         
                            {
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            δ
                            ,
                            …
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    b
                                
                                
                                    i
                                
                            
                            }
                        
                    …” Here, the concept of “granularity” is disclosed in the form of, for example, the number of sampling points for a particular parameter, which is determined based on the resolution level and the range ai to bi. With respect to the limitation of “variable” granularity, as quoted above, in the general case, each has a different respective discretization                         
                            {
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            δ
                            ,
                            …
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    b
                                
                                
                                    i
                                
                            
                            }
                        
                     and hence, a different “granularity” for its respective dimension in the search grid. For example, as shown in Table 1 (described in § 3, paragraph 1), the hyperparameters NH and                         
                            
                                
                                    
                                        
                                            log
                                        
                                        
                                            10
                                        
                                    
                                
                                ⁡
                                
                                    μ
                                
                            
                        
                     for a multi-layer perceptron have different values of a, b, and δ; therefore, these hyperparameters have a different granularity in the form of a different number of sampling points. Similarly in the case of an SVM model (as described in § 3, paragraph 1) the three other parameters shown in Table 1 have different granularities. It is noted that the discretization is based on a user’s “discretization choices” (see § 3, paragraph 1) and thus may vary among different hyperparameters for any arbitrary reason.] 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, Bauer, Wistuba (2016), and Van Rijn with the teachings of Jemenez by modifying the operation of identifying hyperparameter values to be “based on a search, the search having a variable granularity,” for the purpose of implementing “an algorithm for finding optimal parameters that works with no specific information about the underlying model” (Jimenez, abstract).
Bergstra, in an analogous art, suggests “wherein the granularity of the search corresponds to a degree of influence of each of the plurality of hyperparameters on one or more performance metrics of the first version of the selected machine learning model.” Bergstra generally pertains to “strategies for hyper-parameter optimization” (abstract, first sentence) and is therefore in the same field of endeavor as the claimed invention, namely machine learning.
In particular, Bergstra suggests “wherein the granularity of the search corresponds to a degree of influence of each of the plurality of hyperparameters on one or more performance metrics of the first version of the selected machine learning model” [Page 283, second-to-bottom paragraph: “Ψ of interest are more sensitive to changes in some dimensions than others…Figure 1 illustrates how point grids and uniformly random point sets differ in how they cope with low effective dimensionality.” Page 284, first paragraph: “different subspaces are important, and to different degrees. A grid with sufficient granularity to optimizing hyper-parameters.” See FIG. 1, which illustrates the concept of important and unimportant parameters, and its relation to the “granularity” concept mentioned above. That is, this reference teaches, to one of ordinary skill in the art, that granularity should generally be sufficient depending on the “importance” or “sensitivity” of the hyperparameter, which are analogous to the “degree of influence” taught by Van Rijn. This teaching suggests the instant limitation, which recites a correspondence between granularity and the degree of influence. The Examiner notes that while this reference primarily pertains to random search, its teachings are also applicable to grid search.] 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, Bauer, Wistuba (2016), Van Rijn, and Jimenez with the teachings of Bergstra by modifying the search such that “the granularity of the search corresponds to a degree of influence of each of the plurality of hyperparameters on one or more performance metrics of the first version of the selected machine learning model.” The motivation for doing so would have been to configure the grid search so that it has “sufficient granularity to optimizing hyper-parameters” (Bergstra, page 284, first paragraph).

As to claim 8, the combination of Achin, Muddu, Bauer, Wistuba (2016), and Van Rijn teaches the method of claim 1, wherein the selecting, based on the identified degree of influence for each associated hyperparameter, a second group of hyperparameter values, further comprises: identifying a hyperparameter value within the determined range of values for one or more of the hyperparameters of the first version of the selected machine learning model [This limitation is taught by the combination of references for the reasons discussed for the limitation of “selecting…a second group of hyperparameter values.”].
The combination of references does not teach the limitation that identifying hyperparameter values is “based on a search, the search having a variable granularity, wherein the granularity of the search corresponds to a degree of influence of each of the plurality of hyperparameters on one or more performance metrics of the first version of the selected machine learning model.”
Jimenez, in an analogous art, teaches the limitation of “based on a search, the search having a variable granularity.” Jimenez relates to “finding optimal model parameters by discrete grid search” (see title) and is therefore in the same field of endeavor as the claimed invention, namely machine learning. 
In particular, Jimenez teaches identifying hyperparameter values based on a search, [§ 2, paragraph 1: “our method is based on the grid-search algorithm proposed in [citation]. Instead of assuming a two-dimensional grid with two continuous parameters, we generalize the method to M possible discrete parameters.”] the search having a variable granularity [§ 2, paragraph 1: “The parameter discretization is performed by fixing minimum and maximum values for each one, and by selecting a resolution level δi that determines how many possible values of the parameter will be considered for model construction. Assume that each parameter…to be optimally chosen can be discretized to values in a range                         
                            {
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            δ
                            ,
                            …
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    b
                                
                                
                                    i
                                
                            
                            }
                        
                    …” Here, the concept of “granularity” is disclosed in the form of, for example, the number of sampling points for a particular parameter, which is determined based on the resolution level and the range ai to bi. With respect to the limitation of “variable” granularity, as quoted above, in the general case, each has a different respective discretization                         
                            {
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            δ
                            ,
                            …
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    b
                                
                                
                                    i
                                
                            
                            }
                        
                     and hence, a different “granularity” for its respective dimension in the search grid. For example, as shown in Table 1 (described in § 3, paragraph 1), the hyperparameters NH and                         
                            
                                
                                    
                                        
                                            log
                                        
                                        
                                            10
                                        
                                    
                                
                                ⁡
                                
                                    μ
                                
                            
                        
                     for a multi-layer perceptron have different values of a, b, and δ; therefore, these hyperparameters have a different granularity in the form of a different number of sampling points. Similarly in the case of an SVM model (as described in § 3, paragraph 1) the three other parameters shown in Table 1 have different granularities. It is noted that the discretization is based on a user’s “discretization choices” (see § 3, paragraph 1) and thus may vary among different hyperparameters for any arbitrary reason.] 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, Bauer, Wistuba (2016), and Van Rijn with the teachings of Jemenez by modifying the operation of identifying hyperparameter values to be “based on a search, the search having a variable granularity,” for the purpose of implementing “an algorithm for finding optimal parameters that works with no specific information about the underlying model” (Jimenez, abstract).
Bergstra, in an analogous art, suggests “wherein the granularity of the search corresponds to a degree of influence of each of the plurality of hyperparameters on one or more performance metrics of the first version of the selected machine learning model.” Bergstra generally pertains to “strategies for hyper-parameter optimization” (abstract, first sentence) and is therefore in the same field of endeavor as the claimed invention, namely machine learning.
In particular, Bergstra suggests “wherein the granularity of the search corresponds to a degree of influence of each of the plurality of hyperparameters on one or more performance metrics of the first version of the selected machine learning model” [Page 283, second-to-bottom paragraph: “Ψ of interest are more sensitive to changes in some dimensions than others…Figure 1 illustrates how point grids and uniformly random point sets differ in how they cope with low effective dimensionality.” Page 284, first paragraph: “different subspaces are important, and to different degrees. A grid with sufficient granularity to optimizing hyper-parameters.” See FIG. 1, which illustrates the concept of important and unimportant parameters, and its relation to the “granularity” concept mentioned above. That is, this reference teaches, to one of ordinary skill in the art, that granularity should generally be sufficient depending on the “importance” or “sensitivity” of the hyperparameter, which are analogous to the “degree of influence” taught by Van Rijn. This teaching suggests the instant limitation, which recites a correspondence between granularity and the degree of influence. The Examiner notes that while this reference primarily pertains to random search, its teachings are also applicable to grid search.] 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, Bauer, Wistuba (2016), Van Rijn, and Jimenez with the teachings of Bergstra by modifying the search such that “the granularity of the search corresponds to a degree of influence of each of the plurality of hyperparameters on one or more performance metrics of the first version of the selected machine learning model.” The motivation for doing so would have been to configure the grid search so that it has “sufficient granularity to optimizing hyper-parameters” (Bergstra, page 284, first paragraph).

3.	Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Achin in view of Muddu, Bauer, Van Rijn, and Wistuba (2016), and further in view of Wistuba et al., “Hyperparameter Search Space Pruning – A New Component for Sequential Model-Based
Hyperparameter Optimization” in A. Appice et al. (Eds.): ECML PKDD 2015, Part II, LNAI 9285, pp. 104–119, 2015 (“Wistuba (2015)”).
As to claim 9, the combination of Achin, Muddu, Bauer, Wistuba (2016), and Van Rijn teaches the method of claim 1, but does not teach the further limitations of the instant claim.
Wistuba (2015), in an analogous art, teaches the further limitations. Wistuba (2015) pertains to “automatic hyperparameter optimization” (see title) and is therefore in the same field of endeavor as the claimed invention. Wistuba (2015) teaches a method that prunes a hyperparameter space for subsequent search. As shown in Algorithm 2 (page 109), the pruned hyperparameter space is defined based on a threshold δ. 
In particular, Wistuba (2015) teaches wherein a size of the threshold varies based on a degree of influence of the one or more previously stored hyperparameters on one or more performance metrics of the first version of the selected machine learning model. [Page 109, Algorithm 2 and paragraph 2: “The ν |G| hyperparameter configurations with little potential define regions where no improvement is predicted. Hence, the pruned hyperparameter space is defined as the set of hyperparameter configurations that are not within an δ-region of these low-potential hyperparameter configurations (Line 3).” Note that in line 3 of the algorithm, δ is a threshold that defines a pruned hyperparameter space. “Low-potential” corresponds to a degree of influence on a performance metric of the version of the model that is to be optimized in Algorithm 1 (page 107). See also § 4.1, paragraph 1: “the potential is the predicted improvement when choosing λ over the hyperparameter configurations already evaluated.”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, Bauer, Wistuba (2016), and Van Rijn with the teachings of Wistuba (2015) by implementing the features that “a size of the threshold varies based on a degree of influence of the one or more previously stored hyperparameters on one or more performance metrics of the first version of the selected machine learning model.” The motivation would have been to vary the search space “to avoid unnecessary function evaluations in regions where we do not expect any improvements” (Wistuba (2015), § 4, paragraph 1).

4.	Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Achin in view of Wistuba (2016).
As to claim 11, Achin teaches the method of claim 10, wherein the selecting a machine learning model further comprises:
executing a secondary machine learning model, the secondary machine learning model returning the selection of the machine learning model. [[0087]: “exploration engine 110 determines the suitability of a predictive modeling procedure for a prediction problem based, at least in part, on the output of a ‘meta’ machine-learning model, which may be trained to determine the suitability of a modeling procedure for a prediction problem”]
Achin does not explicitly teach the limitations that the execution of the secondary machine learning model is “using the metadata as input to the secondary machine learning model,” and that the secondary model returns “suitable hyperparameter values for use with the machine learning model.”
Wistuba (2016), in an analogous art, teaches the above limitations. Wistuba (2016) teaches a “surrogate model for automatic hyperparameter optimization” (see title), and is therefore in the same field of endeavor. 
In particular, Wistuba (2016) teaches executing a secondary machine learning model [§ 4, paragraph 1: “two-stage surrogate model.”] “using the metadata as input to the secondary machine learning model” [Metadata is described in § 4.4, paragraph 2 (“Description Using Meta-features”): “The most popular way to describe data sets is by utilizing meta-features. These are simple, statistical or information theoretic properties extracted from the data set.” See Table 1 on page 206 for example metadata. Such meta-features for a new data set are used as input to the surrogate model as described in § 4, paragraphs 1-2: “The first stage of the surrogate model approximates the hyperparameter response functions of a new data set and each data set from the meta-data individually with Gaussian processes… For approximating the hyperparameter response function fD, any machine learning model can be used.”] and the secondary model “returning… suitable hyperparameter values for use with the machine learning model.” [§ 4.3, paragraph 1: “The second stage combines all models of the first stage within one surrogate model Ψ to rank the different hyperparameter configurations and predict the uncertainty about the ranking. The predicted score of a hyperparameter configuration is determined.” See also § 5.3: “The average rank ranks the tuning strategies per data set according to the best found hyperparameter configuration.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin with the teachings of Wistuba (2016) by modifying the selecting operation such that the execution of the secondary machine learning model is by “using the metadata as input to the secondary machine learning model,” and that the secondary model returns “suitable hyperparameter values for use with the machine learning model,” so as to arrive at each and every limitation of the instant claim. The motivation for doing so would have been to perform “the task of hyperparameter tuning as well as the task of combined algorithm selection and hyperparameter tuning” (Wistuba (2016), § 6, paragraph 1), particularly in a manner that utilizes meta-knowledge to accelerate hyperparameter search (Wistuba (2016), § 6, paragraph 1: “we propose a two-stage transfer surrogate for using meta-knowledge to accelerate the search with the SMBO framework.”).

5.	Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Van Rijn in view of Jimenez.
As to claim 16, Van Rijn teaches the method of claim 13, but does not teach the further limitations of the instant claim. 
Jimenez, in an analogous art, teaches the limitation of “identifying a hyperparameter value for each of the plurality of hyperparameters based on a search, the search having a variable granularity.” Jimenez relates to “finding optimal model parameters by discrete grid search” (see title) and is therefore in the same field of endeavor as the claimed invention, namely machine learning. 
In particular, Jimenez teaches identifying a hyperparameter value for each of the plurality of hyperparameters based on a search, [§ 2, paragraph 1: “our method is based on the grid-search algorithm proposed in [citation]. Instead of assuming a two-dimensional grid with two continuous parameters, we generalize the method to M possible discrete parameters.”] the search having a variable granularity [§ 2, paragraph 1: “The parameter discretization is performed by fixing minimum and maximum values for each one, and by selecting a resolution level δi that determines how many possible values of the parameter will be considered for model construction. Assume that each parameter…to be optimally chosen can be discretized to values in a range                         
                            {
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            δ
                            ,
                            …
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    b
                                
                                
                                    i
                                
                            
                            }
                        
                    …” Here, the concept of “granularity” is disclosed in the form of, for example, the number of sampling points for a particular parameter, which is determined based on the resolution level and the range ai to bi. With respect to the limitation of “variable” granularity, as quoted above, in the general case, each has a different respective discretization                         
                            {
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            δ
                            ,
                            …
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    b
                                
                                
                                    i
                                
                            
                            }
                        
                     and hence, a different “granularity” for its respective dimension in the search grid. For example, as shown in Table 1 (described in § 3, paragraph 1), the hyperparameters NH and                         
                            
                                
                                    
                                        
                                            log
                                        
                                        
                                            10
                                        
                                    
                                
                                ⁡
                                
                                    μ
                                
                            
                        
                     for a multi-layer perceptron have different values of a, b, and δ; therefore, these hyperparameters have a different granularity in the form of a different number of sampling points. Similarly in the case of an SVM model (as described in § 3, paragraph 1) the three other parameters shown in Table 1 have different granularities. It is noted that the discretization is based on a user’s “discretization choices” (see § 3, paragraph 1) and thus may vary among different hyperparameters for any arbitrary reason.] 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Van Rijn with the teachings of Jemenez by modifying the selecting to include “identifying a hyperparameter value for each of the plurality of hyperparameters based on a search, the search having a variable granularity.” The motivation would have been to implement “an algorithm for finding optimal parameters that works with no specific information about the underlying model” (Jimenez, abstract).
Bergstra, in an analogous art, suggests “wherein the granularity of the search corresponds to the degree of influence of each of the plurality of hyperparameters on the one or more performance metrics of the selected machine learning model.” Bergstra generally pertains to “strategies for hyper-parameter optimization” (abstract, first sentence) and is therefore in the same field of endeavor as the claimed invention, namely machine learning.
In particular, Bergstra suggests “wherein the granularity of the search corresponds to the degree of influence of each of the plurality of hyperparameters on the one or more performance metrics of the selected machine learning model” [Page 283, second-to-bottom paragraph: “Ψ of interest are more sensitive to changes in some dimensions than others…Figure 1 illustrates how point grids and uniformly random point sets differ in how they cope with low effective dimensionality.” Page 284, first paragraph: “different subspaces are important, and to different degrees. A grid with sufficient granularity to optimizing hyper-parameters.” See FIG. 1, which illustrates the concept of important and unimportant parameters, and its relation to the “granularity” concept mentioned above. That is, this reference teaches, to one of ordinary skill in the art, that granularity should generally be sufficient depending on the “importance” or “sensitivity” of the hyperparameter, which are analogous to the “degree of influence” taught by Van Rijn. This teaching suggests the instant limitation, which recites a correspondence between granularity and the degree of influence. The Examiner notes that while this reference primarily pertains to random search, its teachings are also applicable to grid search.] 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, Bauer, Wistuba (2016), Van Rijn, and Jimenez with the teachings of Bergstra by modifying the search such that “the granularity of the search corresponds to the degree of influence of each of the plurality of hyperparameters on the one or more performance metrics of the selected machine learning model.” The motivation for doing so would have been to configure the grid search so that it has “sufficient granularity to optimizing hyper-parameters” (Bergstra, page 284, first paragraph).

6.	Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Achin in view of Muddu, Bauer, and Wistuba (2016).
As to claim 17, Achin teaches a method of determining one or more suitable hyperparameters for a machine learning model in an automated machine learning system, [[0101]: “Fitting the predictive models to the prediction problem's dataset(s) may include tuning one or more hyper-parameters of the predictive modeling procedure”; [0131]: “search for modeling solutions in either manual mode or automatic mode”; [0135]: “to tune the parameters of a predictive model or the hyper-parameters of a modeling technique.”] the method comprising:
receiving selection of a machine learning model; [[0080] teaches: “the suitability of a predictive modeling procedure for a prediction problem may be determined based on characteristics of the dataset.” [0085]: “may select the M modeling procedures most similar to the modeling procedure at issue…” [0136]: “the selected modeling techniques may be executed using the partitioned data to evaluate the search space.” Furthermore, “machine learning” models are taught in  [0203]: “the predictive modeling system may offer a set of predictive models, including traditional regression models, neural networks, and other machine learning models (e.g., random forests, boosted trees, support vector machines).” See also [0095].]
receiving a first dataset having a first data schema; [[0119]: “the exploration engine 110 prompts the user to select the dataset for the predictive modeling problem to be solved… the rules for mapping the target data schemas into the desired dataset schema.” See also [0120]: “each column of the matrix may correspond to a variable, and each row of the matrix may correspond to an observation” (i.e., row and column schema with respect to variables in the data.)]
selecting values for one or more hyperparameters of the selected machine learning model [[0101]: “Fitting the predictive models to the prediction problem's dataset(s) may include tuning one or more hyper-parameters of the predictive modeling procedure that generates the predictive model.” That is, training is performed after tuning (selecting) hyperparameter values.]
training the selected machine learning model using the selected values. [[0140]: “execution of a set of modeling techniques may comprise training one or more models on a same data sample extracted from the dataset.” See also [0134]: “predictive modeling system 100 may partition the dataset (or suggest a partitioning of the dataset) into a training set and a ‘holdout’ test set. In some embodiments, the training set is further partitioned into K folds for cross-validation.” Note that in K-fold cross-validation, different parts of the original dataset are rotated as training and test sets.]
Achin does not explicitly teach the limitation that the training is performed on “the first version of the selected machine learning model” and the limitation that the selecting is “from the determined range of values.” does not teach the following limitations:
identifying a first version of the selected machine learning model;
receiving a plurality of previously stored hyperparameter values associated with the selected machine learning model based on:
	identifying a second version of the selected machine learning model having one or more hyperparameters in common with the first version of the selected machine learning model, and
	identifying a similarity between the first data schema and a second data schema of a second dataset associated with the second version of the selected machine learning model;
determining a range of values for one or more of the previously stored hyperparameter values based on a threshold;
selecting values for one or more hyperparameters of the selected machine learning model from the determined range of values

Muddu, in an analogous art, teaches “identifying a first version of the selected machine learning model” and the limitation that the training is performed on “the first version of the selected machine learning model.” Muddu pertains to “model training and deployment” (see title) and is therefore in the same field of endeavor as the claimed invention, namely machine learning. Muddu generally teaches the handling of different version of machine learning models. See [0283]: “The model store 1532 stores model states that represent machine learning models or versions of the machine learning models.”
In particular, Muddu teaches “identifying a first version of the selected machine learning model” and training “the first version of the selected machine learning model” [[0234]: “the identity resolution module 812 can initiate, for a given user, different versions of the machine learning model at different point of time…As events related to the given user arrive, versions of a machine learning model are initiated, trained, activated, (optionally) continually updated, and finally expired.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin with the teachings of Muddu by modifying the method of Achin to include “identifying a first version of the selected machine learning model” and to perform the training on the first version of the selected machine learning model. The motivation for doing so would have been to enable the multiple versions of a model to be trained. See Muddu, paragraph [0298] (“versioning of the machine learning models simultaneous training of different machine learning models using the same data to produce model states corresponding to different windows of training data sets”). 
Bauer, in an analogous art, teaches:
receiving a plurality of previously stored hyperparameter values associated with the selected machine learning model based on:
	identifying a second version of the selected machine learning model having one or more hyperparameters in common with the first version of the selected machine learning model, and
identifying a similarity between the first data schema and a second data schema of a second dataset associated with the second version of the selected machine learning model;
determining a range of values for one or more of the previously stored hyperparameter values based on a threshold;
[selecting…] from the determined range of values

Bauer generally teaches “methods for deep learning optimization” (see title) involving the use of a “hyperparameter design space” ([0028]). Therefore, Bauer is in the same field of endeavor as the claimed invention, namely machine learning.
	In particular, Bauer teaches:
receiving a plurality of previously stored hyperparameter values associated with the selected machine learning model [[0051]: “selecting the deep learning model configuration with the highest result metric as the output deep learning model configuration”; For example, [0080]: “deep learning model configuration that corresponds to the point 406(3) having the highest result metric.” Note that point 406(3), shown in FIG. 4B is in a two-dimensional space; thus, it includes a plurality of hyperparameter values, respectively corresponding to a plurality of hyperparameters.] based on: identifying a second version of the selected machine learning model having one or more hyperparameters in common with the first version of the selected machine learning model, [[0080]: “in response to the deep learning model configuration that corresponds to the point 406(3) having the highest result metric, the first sample space 408 may be based on the deep learning model that corresponds to the point 406(3).” Note that the point 406(3) corresponds to a “second version” (i.e., a second configuration), which has hyperparameters in common with the “first version”, i.e., the hyperparameters represented by the grid shown in FIG. 4B.]
determining a range of values for one or more of the previously stored hyperparameter values based on a threshold [[0080]: “the first sample space 408 may be based on the deep learning model that corresponds to the point 406(3).” The range of the first sample space 408, as shown in FIG. 4B, corresponds to a “range of values.” See [0063]: “The first sample space 208 may be smaller than the hyperparameter design space 200... The range of the first sample space for each dimension may include 50 possible values.” With respect to “threshold,” the boundaries of the first sample space constitute a threshold. This limitation is alternatively taught by [0041]: “selecting 106 the first sample space may include selecting the first sample space in response to the first metric exceeding the exploitation threshold”]
selecting values for one or more hyperparameters of the selected machine learning model from the determined range of values [[0028]: “The hyperparameter design space may include all the parameters of the deep learning model or a subset of all of the parameters.” As shown in FIGS. 4A-4E, the search space may include multiple hyperparameters (dimensions 402, 404, see [0079]), that are in common among multiple versions (configurations) of a machine learning model (e.g., deep learning model). With respect to the act of “selecting” [0081] teaches: “a plurality of deep learning model configurations corresponding to the points 410(1)-(4). Each of the points 410 may be within the first sample space 408.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin and Muddu with the teachings of Bauer by modifying the combination of Achin and Muddu to include the operations of “receiving a plurality of previously stored hyperparameter values associated with the selected machine learning model based on: identifying a second version of the selected machine learning model having one or more hyperparameters in common with the first version of the selected machine learning model, and identifying a similarity between the first data schema and a second data schema of a second dataset associated with the second version of the selected machine learning model; determining a range of values for one or more of the previously stored hyperparameter values based on a threshold” and such that the selecting of the values for one or more hyperparameters of the selected machine learning model is “from the determined range of values.” The motivation for doing so would have been to utilize existing hyperparameter values as a starting point to obtain a more optimized version of a machine learning model, as suggested by Bauer, paragraph [0078] (“by using an existing deep learning model configuration as a basis for optimization, an even more optimized version of that deep learning model configuration may be found”).
Wistuba (2016), in an analogous art, teaches obtaining the plurality of previously stored hyperparameter values “based on…identifying a similarity between the first data schema and a second data schema of a second dataset associated with the second version of the selected machine learning model.” Wistuba (2016) relates to “automatic hyperparameter optimization” (see title), and is therefore in the same field of endeavor as the claimed invention, namely machine learning.
In particular, Wistuba (2016) teaches obtaining the plurality of previously stored hyperparameter values “based on…identifying a similarity between the first data schema and a second data schema of a second dataset associated with the second version of the selected machine learning model.” [Page 206: “Table 1. The list of all meta-features used by us.” § 4.4, paragraph 2: “The most popular way to describe data sets is by utilizing meta-features. These are simple, statistical or information theoretic properties extracted from the data set. The similarity between two data sets, as defined in Eq. 9, is then dependent on the Euclidean distance between the meta-features of the corresponding data sets.” § 1, last paragraph: “rank the hyperparameter configurations for the new data set, considering the similarity between the new data set and the previous ones.” That is, the hyperparameter configurations that are being ranked correspond to a “second version of the selected machine learning model,” and the meta-features of the “new data set” corresponds to the “first data schema.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, and Bauer with the teachings of Wistuba (2016) by modifying the obtaining of the plurality of previously stored hyperparameter values to be “based on…identifying a similarity between the first data schema and a second data schema of a second dataset associated with the second version of the selected machine learning model.” The motivation would have been to “use knowledge of the performance of an algorithm on given other data sets to automatically accelerate the hyperparameter optimization for a new data set.” (Wistuba (2016), abstract).

7.	Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Achin in view of Muddu, Bauer, and Wistuba (2016), and further in view of Jimenez and Bergstra.
As to claim 18, the combination of Achin, Muddu, Bauer, and Wistuba (2016) teaches the method of claim 17, wherein the selecting values for the one or more hyperparameters of the selected machine learning model further comprises: identifying a hyperparameter value within the determined range of values for one or more of the hyperparameters of the selected machine learning model [This limitation is taught by the combination of references for the reasons discussed for the limitation of “selecting values…from the determined range of values.”].
The combination of references does not teach the remaining limitations of “based on a search, the search having a variable granularity, wherein the granularity of the search corresponds to a degree of influence of each of the plurality of hyperparameters on one or more performance metrics of the selected machine learning model.”
Jimenez, in an analogous art, teaches the limitation of “based on a search, the search having a variable granularity.” Jimenez relates to “finding optimal model parameters by discrete grid search” (see title) and is therefore in the same field of endeavor as the claimed invention, namely machine learning. 
In particular, Jimenez teaches identifying hyperparameter values based on a search, [§ 2, paragraph 1: “our method is based on the grid-search algorithm proposed in [citation]. Instead of assuming a two-dimensional grid with two continuous parameters, we generalize the method to M possible discrete parameters.”] the search having a variable granularity [§ 2, paragraph 1: “The parameter discretization is performed by fixing minimum and maximum values for each one, and by selecting a resolution level δi that determines how many possible values of the parameter will be considered for model construction. Assume that each parameter…to be optimally chosen can be discretized to values in a range                         
                            {
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            δ
                            ,
                            …
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    b
                                
                                
                                    i
                                
                            
                            }
                        
                    …” Here, the concept of “granularity” is disclosed in the form of, for example, the number of sampling points for a particular parameter, which is determined based on the resolution level and the range ai to bi. With respect to the limitation of “variable” granularity, as quoted above, in the general case, each has a different respective discretization                         
                            {
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            δ
                            ,
                            …
                            ,
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            +
                            
                                
                                    H
                                
                                
                                    i
                                
                            
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    b
                                
                                
                                    i
                                
                            
                            }
                        
                     and hence, a different “granularity” for its respective dimension in the search grid. For example, as shown in Table 1 (described in § 3, paragraph 1), the hyperparameters NH and                         
                            
                                
                                    
                                        
                                            log
                                        
                                        
                                            10
                                        
                                    
                                
                                ⁡
                                
                                    μ
                                
                            
                        
                     for a multi-layer perceptron have different values of a, b, and δ; therefore, these hyperparameters have a different granularity in the form of a different number of sampling points. Similarly in the case of an SVM model (as described in § 3, paragraph 1) the three other parameters shown in Table 1 have different granularities. It is noted that the discretization is based on a user’s “discretization choices” (see § 3, paragraph 1) and thus may vary among different hyperparameters for any arbitrary reason.] 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, Bauer, and Wistuba (2016) with the teachings of Jemenez by modifying the operation of identifying hyperparameter values to be “based on a search, the search having a variable granularity,” for the purpose of implementing “an algorithm for finding optimal parameters that works with no specific information about the underlying model” (Jimenez, abstract).
Bergstra, in an analogous art, suggests “wherein the granularity of the search corresponds to a degree of influence of each of the plurality of hyperparameters on one or more performance metrics of the selected machine learning model.” Bergstra generally pertains to “strategies for hyper-parameter optimization” (abstract, first sentence) and is therefore in the same field of endeavor as the claimed invention, namely machine learning.
In particular, Bergstra suggests “wherein the granularity of the search corresponds to a degree of influence of each of the plurality of hyperparameters on one or more performance metrics of the selected machine learning model” [Page 283, second-to-bottom paragraph: “Ψ of interest are more sensitive to changes in some dimensions than others…Figure 1 illustrates how point grids and uniformly random point sets differ in how they cope with low effective dimensionality.” Page 284, first paragraph: “different subspaces are important, and to different degrees. A grid with sufficient granularity to optimizing hyper-parameters.” See FIG. 1, which illustrates the concept of important and unimportant parameters, and its relation to the “granularity” concept mentioned above. That is, this reference teaches, to one of ordinary skill in the art, that granularity should generally be sufficient depending on the “importance” or “sensitivity” of the hyperparameter, which are analogous to the “degree of influence” taught by Van Rijn. This teaching suggests the instant limitation, which recites a correspondence between granularity and the degree of influence. The Examiner notes that while this reference primarily pertains to random search, its teachings are also applicable to grid search.] 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, Bauer, Wistuba (2016), and Jimenez with the teachings of Bergstra by modifying the search such that “the granularity of the search corresponds to a degree of influence of each of the plurality of hyperparameters on one or more performance metrics of the first version of the selected machine learning model.” The motivation for doing so would have been to configure the grid search so that it has “sufficient granularity to optimizing hyper-parameters” (Bergstra, page 284, first paragraph).

8.	Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Achin in view of Muddu, Bauer, and Wistuba (2016), and further in view of Wistuba (2015).
As to claim 19, the combination of Achin, Muddu, Bauer, and Wistuba (2016) teaches the method of claim 17, but does not teach the further limitations of the instant claim. 
Wistuba (2015), in an analogous art, teaches the further limitations. Wistuba (2015) pertains to “automatic hyperparameter optimization” (see title), and is therefore in the same field of endeavor as the claimed invention. Wistuba (2015) teaches a method that prunes a hyperparameter space for subsequent search. As shown in Algorithm 2 (page 109), the pruned hyperparameter space is defined based on a threshold δ. 
In particular, Wistuba (2015) teaches wherein a size of the threshold varies based on a degree of influence of the one or more previously stored hyperparameters on one or more performance metrics of the selected machine learning model. [Page 109, Algorithm 2 and paragraph 2: “The ν |G| hyperparameter configurations with little potential define regions where no improvement is predicted. Hence, the pruned hyperparameter space is defined as the set of hyperparameter configurations that are not within an δ-region of these low-potential hyperparameter configurations (Line 3).” Note that in line 3 of the algorithm, δ is a threshold that defines a pruned hyperparameter space. “Low-potential” corresponds to a degree of influence on a performance metric of the version of the model that is to be optimized in Algorithm 1 (page 107). See also § 4.1, paragraph 1: “the potential is the predicted improvement when choosing λ over the hyperparameter configurations already evaluated.”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Achin, Muddu, Bauer, and Wistuba (2016) with the teachings of Wistuba (2015) by implementing the features that “a size of the threshold varies based on a degree of influence of the one or more previously stored hyperparameters on one or more performance metrics of the selected machine learning model.” The motivation would have been to vary the search space “to avoid unnecessary function evaluations in regions where we do not expect any improvements” (Wistuba (2015), § 4, paragraph 1).


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The following documents evidence the state of the art.
Feurer et al., “Initializing Bayesian Hyperparameter Optimization via Meta-Learning,” Association for the Advancement of Artificial Intelligence (2015) teaches the use of various types of metadata for hyperparameter optimization.
Gomes et al., “Combining Meta-Learning and Search Techniques to SVM Parameter Selection,” 2010 Eleventh Brazilian Symposium on Neural Networks teaches the use of various types of metadata for hyperparameter optimization.
Li et al., “Hyperband: Bandit-based Configuration Evaluation for Hyperparameter Optimization,” ICLR 2017 teaches methods for hyperparameter evaluation.
Koch et al. (US 2018/0240041 A1) teaches grid search techniques.
Dirac et al. (US 10,474,926 B1) teaches the use of existing models as starting points.
Chen (US 2019/0362222 A1) teaches determining similarity based on metadata, in the context of model building using historical models.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YAO DAVID HUANG whose telephone number is (571)270-1764. The examiner can normally be reached Monday - Friday 8:30 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Y.D.H./Examiner, Art Unit 2124                                                                                                                                                                                                        

/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124