DETAILED ACTION
This Action is in response to amendments and arguments filed 7 October 2021 for application 16/046993 filed 15 June 2018. Currently claims 1-20 are pending. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 18 October 2021 have been fully considered but they are not persuasive.
 
The Applicants Specifically Argue:
Applicant respectfully asserts that claims 1-20 do not recite a mental process because the human mind is not equipped to perform the claim elements. For example, claim 1 recites elements including "generating a first deep learning model configuration" and "generating a second deep learning model configuration." Deep learning models are complex machine learning models. The Specification states that "training a deep learning model is a computationally intense process that may take a long time." The human mind is not equipped to generate such complex computational models. Claim 1 also recites "calculating a first result metric for the first deep learning model configuration" and "calculating a second result metric for the second deep learning model configuration." Calculating the result metrics can include running the deep 
Thus, the claims do not recite a judicial exception in the form of an abstract idea, and, therefore, the claims are patent-eligible at Step 2A, Prong One of the Alice/Mayo analysis. 
 
Examiner Response
The Examiner respectively disagrees.  As set forth in both the NOFA and the current Office Action, the Examiner maintains that the claims do recite mental steps that fall in the mental processes group because they entail observations, evaluations, judgments, or opinions or entail mental functions that can be performed using a pen and paper.  The selection of parameters for determining learning model configuration according to a calculated metric involves mental steps of evaluation and judgment while the generation of a model configuration involves mental steps of formulating a (mathematical) model (a process that can be performed on pen and paper); the recitation of deep learning is appropriately evaluated at step 2A in the analysis. 

The Applicants Further Argue:
Step 2A - Prong Two 
Even if claims 1-20 were to recite a judicial exception, claims 1-20 are then to be evaluated to determine whether the claims integrate the judicial exception into a practical application at Alice/Mayo Step 2A, Prong Two. MPEP § 2106.04(II)(A)(2). If the claim elements reflect an improvement in the functioning of a computer or an improvement to another technology or technical field, then the claim integrates the alleged judicial exception into a practical application and, thus, imposes a meaningful limit on the judicial exception and no further analysis is required because the claim is eligible at Step 2A, Prong Two. MPEP § 2106.04(d)(1). Applicant respectfully submits that the claims are directed to an improvement to computing technology or a technical field and are integrated into a practical application. … [0005] Deep learning models are adept at solving a wide number of problems such as speech recognition or image classification. However, for a deep learning model configuration to be effective, its parameters should be configured correctly. One way to configuring a deep learning model configuration includes training the deep learning model configuration. However, training a deep learning model is a computationally intense process that may take a long time. While some time may be saved by setting up an initial configuration for the deep learning model configuration, there can be thousands or even millions of possible initial configurations. …However, as identified in par. [0005] of the Specification, this type of approach is computationally intense and can take a long time to complete. The approach of the claimed subject matter, on the other hand, overcomes the disadvantages of the conventional approach described in par. [0005]. As discussed in par. [0034], randomly generating the deep learning model configuration is "quicker ... than using some learning algorithms." Furthermore, "randomly generating parameters may not be affected by local minima as some learning algorithms, like backpropagation, may be." Par. [0078] further explains. "By randomly generating deep learning model configurations and by focusing the generation around deep learning models with high result metrics, an optimized initial deep learning model configuration can be found." 
Claim 1 embodies this improved approach described in the Specification, …
Step 2B 
Even if the claims were to be directed to a judicial exception and not be integrated into a practical application, the claims are then evaluated to determine whether the claims recite additional elements that amount to significantly more than the alleged judicial exception. MPEP § 2106.05(1). One consideration when making this determination is determining whether any additional claim elements are "well-understood, routine, or conventional activity." MPEP § 2106.05(d). Applicant respectfully asserts that claims 1- 20 recites/recite additional elements that are not well-understood, routine, nor conventional. 
As discussed in the previous subsection, conventional deep learning model configuration involves a human configuring certain parameters and training the deep learning model to configure other parameters. However, the claims of the Instant Application configure a deep learning model by generating the deep learning model, selecting a sample space around the deep learning model (the sample space including ranges of parameter values centered around the deep learning models current parameter values), selecting a configuration within the sample space, and evaluating the newly selected deep learning model. This vastly different approach is not well-understood, conventional, nor routine. Furthermore, even if these additional elements were to be well- understood, conventional, or routine when considered individually, the combination of these elements amount to an inventive concept. See MPEP § 2106.05(d)(1)(3). This is because the combination of these elements, as they are specifically arranged in claims 1-20, provides a method to generate a deep learning model configuration faster and more efficiently, something that is not found in the claim elements individually nor in a conventional arrangement of such claim elements. 

Examiner Response
The Examiner respectively disagrees.  As set forth in both the NOFA and the current Office Action, the recitation of “deep learning” in the claims to perform the mental steps (including steps that may be performed on pen and paper) is at a high level of generality which simply links the function performed by the mental steps to the particular machine-learning technological environment. While the Examiner acknowledges the complexity of designing and training of deep learning models, the claims do not recite any function performed by those models themselves; i.e., a function or property which would integrate the mental steps into a practical application. Without the recitation of sufficient additional details on the function of the deep learning models (image classification for instance from a deep learning model derived from the recited hyperparameter optimization framework), the mere recitation of deep learning model fails to integrate the judicial exception into a practical application (step 2A, Prong 2) and is therefore also considered insignificant extra-solution activity at step 2B. 

The Applicants Further Argue:
Thus, Ye does not disclosure "[1] the first sample space includes a plurality of dimensions, wherein each dimension corresponds to a parameter of the selected first deep learning model configuration, [2] each dimension of the plurality of dimensions includes a range of possible values for the corresponding parameter, and [3] the range of at least one dimension of the plurality of dimensions is centered on a current value of the corresponding parameter of the selected first deep learning model configuration" as recited in amended claim 1. As mentioned before, independent claims 13 and 17 were amended to recite analogous subject matter to the amendments to claim 1.

Examiner Response
This Applicants’ Argument with respect to the amended independent claims have been considered but are moot because the new ground of rejection of the independent claims in view of Yao and Varadarajan does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.



Claims 1-20 are rejected under 35 U.S.C. 101. because the claims are directed to an abstract idea; and because the claims as a whole, considering all claim elements both individually and in combination, do not amount to significantly more than the abstract idea, see Alice Corporation Pty. Ltd. v. CLS Bank International, et al, 573 U.S. (2014).
As an initial matter, according to the first part of the Alice analysis (Step 1), the claims were determined to be directed to one of the four statutory categories: an article of manufacture, a method/process (claims 1-20), a machine/system/product, and/or a composition of matter.
Secondly, based on the claims being determined to be within one of the four categories (i.e., process, machine, manufacture, or composition of matter) it must be determined if the claims are directed to a judicial exception (i.e., law of nature, natural phenomenon, and abstract idea) (Step 2A). This step consists of a two-prong inquiry: (1) Does the claim recites an abstract idea, law of nature, or natural phenomenon? and (2) Does the claim recite additional elements that integrate the judicial exception into a practical application?
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claims recite mathematical concepts and mental processes. This judicial exception is not integrated into a practical application because it fails to integrate the judicial exception into a practical application and generic recited computer elements do not add meaningful limitations The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception as discussed in the following analysis.
Regarding independent claims 1, 13, and 17, the following analysis shows that the limitations recite the judicial exception of an abstract idea in the mathematical concepts and mental processes groups and do not recite additional elements that integrate the judicial exception into a practical application.

Claim 1 does not satisfy the two-Prong Test as explained in the analysis of each limitation below:
Step 2A
Prong 1: 
… method, comprising: generating a first … model configuration; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of formulating a model configuration.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group. 
calculating a first result metric for the first … model configuration; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of calculating a metric for the model configuration in which the calculation of the first result metric is a mental process that may be performed on pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
selecting a first sample space, wherein the first sample space is based on the first … model configuration, wherein each dimension corresponds to a parameter of the first … model configuration, each dimension of the plurality of dimensions includes a range of possible values for the corresponding parameter, and the range of at least one dimension of the plurality of dimensions is centered on a current value of the corresponding parameter of the first … model configuration; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of selecting a sample space in which each dimension of the sample space has a range of values centered at a corresponding value of a  parameter of a model configuration.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
 generating a second … model configuration, wherein the second … model configuration is within the first sample space;  (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of generating a second model configuration from a selected sample space.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
calculating a second result metric for the second … model configuration; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of calculating a metric for the (second) model configuration in which the calculation of the second result metric is a mental process that may be performed on pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
in response to the second result metric exceeding the first result metric, selecting a second sample space, wherein the range of each dimension is centered on the current value of the corresponding parameter of the second … model configuration; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of selecting a (second) sample space for/based on the (second) model configuration according to the mental step of comparing two metrics.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
and in response to the second result metric not exceeding the first result metric, reducing the size of the first sample space.  (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of modifying a sample space for/based on the model configuration based on the result of the mental step of determining that the second result metric is less than or equal to the first result metric.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim recites additional elements:
A computer-implemented method:  The computers that  that perform the mental steps of calculating, selecting, generating, and reducing are recited at a high level of generality and are no more than mere instructions to apply the exception using a generic computer and, thereby, do not impose a meaningful limit on the judicial exception.  
… deep learning …;  … deep learning …; … deep learning … deep learning …; … deep learning …;  … deep learning …; … deep learning …;   - The deep learning model is recited at a high level of generality that merely generally links the judicial exception to a particular technological environment. 
None of these additional elements integrate the judicial exception into a practical application because the computing devices and the training of a machine learning model are recited at a high level of generality and correspond to generic computer functions.  
In addition, according to the second part of the Alice/Mayo test (step 2B), it must be determined if the claim as a whole recite something significantly more than the judicial exception, when considered both individually and as an ordered combination. The recitation in the preamble is insufficient to transform a judicial exception to a patentable invention because the preamble elements are recited at a high level of generality that simply linked to a field of use, see MPEP 2106.05(h). The examiner further notes that the claim limitation(s) below are deemed insufficient to transform a judicial exception to a patentable invention, as described in the analysis that follows below:
The elements in the limitations below are insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
Generic computer implemented method, processing resources as noted above.
… deep learning …;  … deep learning …; … deep learning …; … deep learning … deep learning …;  … deep learning …; … deep learning …;– as noted above. 

As discussed in the step 1, 2A Prongs 1 and 2, and 2B analyses, claim 1 limitations examined individually or as an ordered combination recites no meaningful limitations that amount to significantly more than the exception itself. In particular, there are no indication that the combination of elements improves the functioning of a computer or improves another technology. Therefore, when looking at the claim elements individually or an ordered combination, claim 1 does not recite identified elements deemed by the courts as "significantly more”.

Claim 13 does not satisfy the two-Prong Test as explained in the analysis of each limitation below:
Step 2A
Prong 1: 
… method comprising: generating a plurality of first … model configurations; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of formulating model configurations.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group. 
and calculating a first result metric for each of the plurality of first … model configurations; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of calculating a metric for each model configuration in which the calculation of the first result metric is a mental process that may be performed on pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
selecting a first sample space, wherein the first sample space is based on the first … model configuration, wherein each dimension corresponds to a parameter of the first … model configuration, each dimension of the plurality of dimensions includes a range of possible values for the corresponding parameter, and the range of at least one dimension of the plurality of dimensions is centered on a current value of the corresponding parameter of the first … model configuration; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of selecting a sample space in which each dimension of the sample space has a range of values centered at a corresponding value of a  parameter of a model configuration in which the centering of the sample space on each model configuration is also a mental process which can be performed with pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
generating a plurality of second … model configurations, wherein each second … model configuration is within the first sample space; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of generating second model configurations from a selected sample space.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
calculating a second result metric for each second … model configuration (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of calculating a metric for the (second) model configuration in which the calculation of the second result metric is a mental process that may be performed on pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
in response to the second result metric exceeding the first result metric, selecting a second sample space, wherein the range of each dimension is centered on the current value of the corresponding parameter of the second … model configuration corresponding to the second result metric that exceeds the first result metric; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of selecting a (second) sample space for/based on the (second) model configuration according to the mental step of comparing two metrics such that the range of that sample space is centered on a current value in which the centering of the sample space on each model configuration is also a mental process which can be performed with pen and paper .  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
and in response to no result metric exceeding the first result metric, reducing the size of the first sample space. (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of modifying a sample space for/based on the model configuration based on the result of the mental step of determining that the second result metric is less than or equal to the first result metric.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim recites additional elements:
A computer-implemented method:  The computers that  that perform the mental steps of calculating, selecting, generating, and reducing are recited at a high level of generality and are no more than mere instructions to apply the exception using a generic computer and, thereby, do not impose a meaningful limit on the judicial exception.  
… deep learning …;  … deep learning …; … deep learning … deep learning …; … deep learning …;  … deep learning …; … deep learning …; … deep learning …;   - The deep learning model is recited at a high level of generality that merely generally links the judicial exception to a particular technological environment. 
None of these additional elements integrate the judicial exception into a practical application because the computing devices and the training of a machine learning model are recited at a high level of generality and correspond to generic computer functions.  
In addition, according to the second part of the Alice/Mayo test (step 2B), it must be determined if the claim as a whole recite something significantly more than the judicial exception, when considered both individually and as an ordered combination. The recitation in the preamble is insufficient to transform a judicial exception to a patentable invention because the preamble elements are recited at a high level of generality that simply linked to a field of use, see MPEP 2106.05(h). The examiner further notes that the claim limitation(s) below are deemed insufficient to transform a judicial exception to a patentable invention, as described in the analysis that follows below:
The elements in the limitations below are insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
Generic computer implemented method, processing resources as noted above.
… deep learning …;  … deep learning …; … deep learning … deep learning …; … deep learning …;  … deep learning …; … deep learning …; … deep learning …;   – as noted above. 

As discussed in the step 1, 2A Prongs 1 and 2, and 2B analyses, claim 13 limitations examined individually or as an ordered combination recites no meaningful limitations that amount to significantly more than the exception itself. In particular, there are no indication that the combination of elements improves the functioning of a computer or improves another technology. Therefore, when looking at the claim elements individually or an ordered combination, claim 1 does not recite identified elements deemed by the courts as "significantly more”.

Regarding independent claim 17, the following analysis shows that the limitations recite the judicial exception of an abstract idea in the mathematical concepts and mental processes groups and do not recite additional elements that integrate the judicial exception into a practical application.
Claim 17 does not satisfy the two-Prong Test as explained in the analysis of each limitation below:
Step 2A
Prong 1: 
… method for improving a … model configuration, comprising: … a first … model configuration; The claim, under its broadest reasonable interpretation, recites mental steps of improving a model configuration, relative to an existing model configuration.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
calculating a first result metric for the first … model configuration; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of calculating a metric for each model configuration in which the calculation of the first result metric is a mental process that may be performed on pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
selecting a first sample space, wherein the first sample space includes a plurality of dimensions, wherein each dimension corresponds to a parameter of the selected first … model configuration, each dimension of the plurality of dimensions includes a range of possible values for the corresponding parameter, and the range of at least one dimension of the plurality of dimensions is centered on a current value of the corresponding parameter of the first … model configuration; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of selecting a sample space in which each dimension of the sample space has a range of values centered at a corresponding value of a  parameter of a model configuration in which the centering of the sample space on each model configuration is also a mental process which can be performed with pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
generating a second … model configuration, wherein the second … configuration is within the first sample space; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of generating a second model configuration from a selected sample space.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
calculating a second result metric for the second … configuration; 4831-5827-2102.647 orney oc e o.Customer No. 104982 (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of calculating a metric for the (second) model configuration in which the calculation of the second result metric is a mental process that may be performed on pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
in response to the second result metric exceeding the first result metric, selecting a second sample space, wherein the range of each dimension is centered on the current value of the corresponding parameter of the second … model configuration; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of selecting a (second) sample space for/based on the (second) model configuration according to the mental step of comparing two metrics in which the centering of the sample space on each model configuration is also a mental process which can be performed with pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
and in response to the second result metric not exceeding the first result metric, reducing the size of the first sample space. (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of modifying a sample space for/based on the model configuration based on the result of the mental step of determining that the second result metric is less than or equal to the first result metric.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim recites additional elements:
Receiving - The function of receiving data is a mere data gathering step and the computers that perform that function are recited at a high level of generality that does not impose a meaningful limitation on the judicial exception. 
The computers that  that perform the mental steps of calculating, selecting, generating, and reducing are recited at a high level of generality and are no more than mere instructions to apply the exception using a generic computer and, thereby, do not impose a meaningful limit on the judicial exception.   
… deep learning …;  … deep learning …; … deep learning …; … deep learning … deep learning …; … deep learning …;  … deep learning …; … deep learning …;   - The deep learning model is recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.. 
None of these additional elements integrate the judicial exception into a practical application because the computing devices and the training of a machine learning model are recited at a high level of generality and correspond to generic computer functions.  
In addition, according to the second part of the Alice/Mayo test (step 2B), it must be determined if the claim as a whole recite something significantly more than the judicial exception, when considered both individually and as an ordered combination. The recitation in the preamble is insufficient to transform a judicial exception to a patentable invention because the preamble elements are recited at a high level of generality that simply linked to a field of use, see MPEP 2106.05(h). The examiner further notes that the claim limitation(s) below are deemed insufficient to transform a judicial exception to a patentable invention, as described in the analysis that follows below:
The elements in the limitations below are insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
Generic computer implemented method, processing resources as noted above.
… deep learning …;  … deep learning …; … deep learning …; … deep learning … deep learning …;  … deep learning …; … deep learning …;– as noted above.  
receiving… It is noted that the claimed extra-solution data gathering is acknowledged to be well-understood, routine, conventional activity (see, e.g., court recognized WURC examples in MPEP 2106.05(d)(II)(i)). Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. 
As discussed in the step 1, 2A Prongs 1 and 2, and 2B analyses, claim 17 limitations examined individually or as an ordered combination recites no meaningful limitations that amount to significantly more than the exception itself. In particular, there are no indication that the combination of elements improves the functioning of a computer or improves another technology. Therefore, when looking at the claim elements individually or an ordered combination, claim 17 does not recite identified elements deemed by the courts as "significantly more”.

Furthermore, regarding the dependent claims 2-12 which are dependent on claim 1, the disclosed limitations does not recite identified elements deemed by the courts as "significantly more”. The examiner notes that the dependent claims elements that are deemed insufficient to transform a judicial exception to a patentable invention and are considered part of the abstract idea as noted below:
Claim 2:
Step 2A
Prong 1 (Yes):
wherein the first … model configuration comprises an ensemble comprising at least two … model configurations.   (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of using at least two (distinct) model configurations in an ensemble of such model configurations (to form the overall model configuration).  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim recites additional elements
… deep learning …;  … learning …;   - The deep learning and learning models are recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning …;  … learning …;   as noted above
Generic computer system, processing resources as noted above.
Claim 3:
Step 2A
Prong 1 (Yes):
wherein generating each of the first and second … model configurations comprises at least one of: 4831-5827-2102.642 orney oc e o.Customer No. 104982 generating a weight of an edge of a neural network model based on random number generation; generating an edge connecting two nodes of the neural network model, wherein the position of the edge is based on random number generation; generating a plurality of nodes of the neural network model, wherein the number of nodes is based on random number generation; and generating a plurality of layers of the neural network model, wherein the number of layers is based on random number generation. (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of generating a model configuration that corresponds to a selection of a number of layers, number of nodes, edge connectivity, and weights of a neural network in which the usage of random number generation to form this selection is a mental process that may be performed on pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim recites additional elements
… deep learning …;     - The deep learning is recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning …;  as noted above
Generic computer system, processing resources as noted above.
Claim 4:
Step 2A
Prong 1 (Yes):
wherein calculating each of the first and second result metric comprises testing each of the first and second … model configurations on a testing dataset. (Yes)  The claim, under its broadest reasonable interpretation, recites mathematical calculation steps of calculating result metrics for the model configurations using a testing dataset.  The mere recitation of a generic computer device/system to perform these mathematical steps does not take the claim limitation out of the mental processes and mathematical concepts groups.
Prong 2 (No): The claim recites additional elements
… deep learning …;     - The deep learning is recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning …;  as noted above
Generic computer system, processing resources as noted above. 
Claim 5:
Step 2A
Prong 1 (Yes): wherein the first and second result metrics each comprise a metric based on at least one of: testing dataset accuracy;  overfit;  and underfit.  (Yes)  The claim, under its broadest reasonable interpretation, recites mathematical calculation steps of calculating result metrics for the model configurations using a testing dataset based on a determination of mathematical calculations of accuracy, overfit, and underfit.  The mere recitation of a generic computer device/system to perform these mathematical steps does not take the claim limitation out of the mental processes and mathematical concepts groups.
Prong 2 (No): The claim does not recite additional elements
Step 2B
The claim does not recite additional elements that the courts have identified as “significantly more” for the same reasons as pointed out in claim 1. 
Claim 6:
Step 2A
Prong 1 (Yes):
wherein reducing the size of the first sample space comprises reducing a size of the range of possible values for at least one dimension of the plurality of dimensions.; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of reducing the size of sample spaces for model configurations which span multiple dimensions and ranges of parameters (the characterization of which can be done on pen and paper).  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group. 
Prong 2 (No): The claim does not recites additional elements
Step 2B
The claim does not recite additional elements that the courts have identified as “significantly more” for the same reasons as pointed out in claim 1.
Claim 7:
Step 2A
Prong 1 (Yes):
wherein a parameter of the … model configuration comprises at least one of: a weight of an edge of a neural network model; a position of an edge of the neural network model; a dropout node of the neural network model; a configuration of a plurality of … models of the … model; a number of layers of the neural network model; and number of nodes in a layer of the neural network model.; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of specifying parameters of a model configuration that correspond to a selection of a number of layers, number of nodes, dropout nodes, and weights and positions of edges of a neural network and the selection of multiple (distinct) models in the overall model configuration.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim recites additional elements
… deep learning …… deep learning … machine learning …;     - The deep learning and machine learning models are recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning …… deep learning … machine learning …;   as noted above
Generic computer system, processing resources as noted above. 
Claim 8:
Step 2A
Prong 1 (Yes):
wherein the range of at least one dimension of the plurality of dimensions being centered on the current value of the corresponding parameter of the first … model configuration comprises the range of each dimension of the plurality of dimensions being centered on the current value of the corresponding parameter of the first … model configuration;  (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of selecting a separate sample space for each model configuration in which the centering of the sample space on each model configuration is also a mental process which can be performed with pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim recites additional elements
… deep learning …… deep learning … deep learning … deep learning;     - The deep learning models are recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning …… deep learning … deep learning … deep learning;   as noted above
Generic computer system, processing resources as noted above. 
Claim 9:
Step 2A
Prong 1 (Yes):
further comprising calculating an exploitation threshold; and wherein selecting a first sample space comprises selecting the first sample space in response to the first result metric exceeding the exploitation threshold.; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of determining/selecting a threshold (exploitation) used for the mental steps of selecting a sample space according to a comparison of the result metric with that threshold such that the comparison may also be performed with pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim does not recite additional elements:
Step 2B: 
The claim does not recite additional elements that the courts have identified as “significantly more” for the same reasons as pointed out in claim 1.
Claim 10:
Step 2A
Prong 1 (Yes):
wherein the second result metric exceeding the first result metric comprises the second result metric exceeding the first result metric by a threshold amount, the threshold amount comprising an amount based on user configuration.; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of determining if a result metric exceeds another result metric by more than a user-specified threshold amount such that this comparison may also be performed with pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group. 
Prong 2 (No): The claim does not recite additional elements:
The claim does not recite additional elements that the courts have identified as “significantly more” for the same reasons as pointed out in claim 1. 
Claim 11:
Step 2A
Prong 1 (Yes):
further comprising: selecting a … model configuration as an output … model configuration; calculating a third result metric, wherein the third result metric is based on additional … set data and the output … model configuration; in response to determining that the third result metric is below a threshold result metric, selecting a third sample space.; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of selecting a model configuration, calculating a result metric using another set of data applied to that configuration (with pen and paper, and selecting another sample space based on a comparison of the result metric with a threshold (also with pen and paper). The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group. 
Prong 2 (No): The claim recites additional elements
… deep learning … training … deep learning … deep learning …;   - The deep learning models and the training are recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning … training … deep learning … deep learning …; as noted above
Generic computer system, processing resources as noted above. 
Claim 12:
Step 2A
Prong 1 (Yes):
wherein: generating the first … model configuration comprises generating the first … model using a first computational approach; and generating the second … model configuration comprises generating the second … model using a second computational approach.; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of generating/designing multiple distinct model configurations using distinct computational approaches (which may be performed with pen and paper). The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.  
Prong 2 (No): The claim recites additional elements
… deep learning … deep learning … deep learning … deep learning …;   - The deep learning models are recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning … deep learning … deep learning … deep learning …;   -as noted above
Generic computer system, processing resources as noted above. 

Therefore, as a whole claims 2-12 do not recite what have the courts have identified as "significantly more”.

Furthermore, regarding the dependent claims 14-16 which are dependent on claim 13, the disclosed limitations does not recite identified elements deemed by the courts as "significantly more”. The examiner notes that the dependent claims elements that are deemed insufficient to transform a judicial exception to a patentable invention and are considered part of the abstract idea as noted below:
Claim 14:
Step 2A
Prong 1 (Yes):
wherein generating the plurality of first … model configurations comprises generating a predetermined number of … model configurations, wherein the predetermined number is based on a user-defined confidence interval.   (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of generating a specified number of model configurations according to a user-specified confidence interval (the association between the number and the user-specified confidence interval is a mental step which also can be performed with pen and paper).  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim recites additional elements
… deep learning … deep learning … deep learning … deep learning …;   - The deep learning models are recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning … deep learning …;    -as noted above
Generic computer system, processing resources as noted above. 
Claim 15:
Step 2A
Prong 1 (Yes):
wherein selecting the … model configuration from the plurality of first … model 4831-5827-2102.646 orney oc e o.Customer No. 104982 configurations comprises selecting the … model configuration with a result metric above a predetermined threshold.   (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of selecting a model configuration from a plurality of such configurations based on the mental step of comparing a result metric with a threshold (which may be performed with pen and paper). The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim recites additional elements
… deep learning … deep learning … deep learning … deep learning …;   - The deep learning models are recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning … deep learning … deep learning … deep learning …;    -as noted above
Generic computer system, processing resources as noted above. 
Claim 16:
Step 2A
Prong 1 (Yes):
wherein: generating the plurality of first … model configurations comprises generating at least two of the first … model configurations in parallel; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of generating multiple model configuration in parallel such that the parallelism may be achieved using pen and paper (e.g., multiple pieces of paper).  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group. 
calculating the result first metric for each first … model configuration comprises calculating at least two first result metrics in parallel; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of calculating multiple result metrics in parallel such that the parallelism may be achieved using pen and paper (e.g., multiple pieces of paper).  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
and generating the plurality of second … model configurations comprises generating at least two … model configurations in parallel; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of generating multiple model configuration in parallel such that the parallelism may be achieved using pen and paper (e.g., multiple pieces of paper).  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group. 
and calculating the result second metric for each first … model configuration comprises calculating at least two second result metrics in parallel.; (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of generating multiple model configuration in parallel such that the parallelism may be achieved using pen and paper (e.g., multiple pieces of paper).  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group. 
Prong 2 (No): The claim recites additional elements
… deep learning … deep learning … deep learning … deep learning … deep learning …;   - The deep learning models are recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning … deep learning … deep learning … deep learning … deep learning …;    -as noted above
Generic computer system, processing resources as noted above. 

Therefore, as a whole claims 14-16 do not recite what have the courts have identified as "significantly more”.

Furthermore, regarding the dependent claims 18-20 which are dependent on claim 17, the disclosed limitations does not recite identified elements deemed by the courts as "significantly more”. The examiner notes that the dependent claims elements that are deemed insufficient to transform a judicial exception to a patentable invention and are considered part of the abstract idea as noted below:
Claim 18:
Step 2A
Prong 1 (Yes):
wherein generating the second … model configuration comprises adjusting a parameter of the … model configuration. (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of generating a model configuration by modifying/adjusting a parameter of that model configuration. The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim recites additional elements
… deep learning … deep learning …;   - The deep learning models are recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning … deep learning …;   -as noted above
Generic computer system, processing resources as noted above. 
Claim 19:
Step 2A
Prong 1 (Yes):
wherein adjusting a parameter the first … model configuration comprises at least one of: adjusting a weight of an edge of a neural network of the first … model configuration within a predetermined amount; adding an edge to a neural network of the first … model configuration; and removing an edge from a neural network of the first … model configuration. (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of generating a model configuration by modifying a neural network model of characteristics of edge connections such as the weights (by a specified amount) and their presence or absence in which these mental steps may be performed on pen and paper.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim recites additional elements
… deep learning … deep learning … deep learning … deep learning …;   - The deep learning models are recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning … deep learning … deep learning … deep learning …;   -as noted above
Generic computer system, processing resources as noted above. 
Claim 20:
Step 2A
Prong 1 (Yes):
wherein the first … model configuration comprises an ensemble comprising a plurality of … models, and adjusting a parameter of the first … model configuration comprises at least one of: removing a … model from the ensemble; and adding a … model to the ensemble. (Yes)  The claim, under its broadest reasonable interpretation, recites mental steps of using at least two (distinct) model configurations in an ensemble of such model configurations (to form the overall model configuration) and modifying the overall model configuration by adding or subtracting individual model configurations.  The mere recitation of a generic computer device/system to perform these mental steps does not take the claim limitation out of the mental processes group.
Prong 2 (No): The claim recites additional elements
… deep learning … learning … deep learning … learning … learning …;   - The deep learning and learning models are recited at a high level of generality that merely generally links the judicial exception to a particular technological environment.
Step 2B: 
The element in the limitations below is insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g):
… deep learning … learning … deep learning … learning … learning …;   -as noted above
Generic computer system, processing resources as noted above. 

	Therefore, as a whole claims 18-20 do not recite what have the courts have identified as "significantly more.
In summary, as shown in the analysis above, claims 1-20 do not provide any additional elements that when considered individually or as an ordered combination, amount to significantly more than the abstract idea identified. Therefore, as a whole claims 1-20 do not recite what have the courts have identified as "significantly more”. In particular, there is no indication that the combination of elements improves the functioning of a computer or improves another technology when claims are considered individually or as an ordered combination.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.


Claims 1, 3-13, and 15-19 is rejected under 35 U.S.C. 103 as being unpatentable over Yao et al. (“Pre-training the deep generative models with adaptive hyperparameter optimization”, Neurocomputing 247 , 2017, pp. 144-155), hereinafter referred to as Yao, in view of  Varadarajan. et al. (US2019/0095818, filed 31 January 2018), hereinafter referred to as Varadarajan. 

In regards to claim 1, Yao teaches A computer-implemented method, comprising: generating a first deep learning model configuration; ([p. 148, Section 4, p. 148, Section 5, Figure 1, Figure 2, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., we perform the experiments on MNIST digits dataset to demonstrate that our new method is much more efficient than the traditional BO methods. In the application of the unsupervised learning task of text clustering, the empirical results show that the DGMs with the adaptive hyperparameters can surpass the state-ofthe-art., wherein a computer-based hyperparameter optimization framework determines an (optimized) deep (Figures 1, 2) learning model configuration by iterating over successive deep model configurations such that any configuration identified as a current or candidate solution (lambda) at an iteration (in algorithm 1 – a given t for either the predict or the “trails” stage)  is a first model configuration.) calculating a first result metric for the first deep learning model configuration; ([pp. 147-148, Section 4, Algorithm 1], FEGf indicates how well model has been fitted to the data, because the weights tend to make the model more discriminative. FEGo can monitor whether the model has become over-fitting, because if the over-fitting happens, the weights will be more in favor of the training set than validation set…. Then we get final definition of the loss of FEG for the optimizer as follows: <equation 15> …., wherein performance (model fit/validation loss) L_FEG is computed for a first learning model configuration (lambda^i or lambda^j  in algorithm 1 but also lambda_best in a broader sense) for either a candidate or current solution.) selecting a first sample space, wherein the first sample space includes a plurality of dimensions, wherein each dimension corresponds to a parameter of the first deep learning model configuration, each dimension of the plurality of dimensions includes a range of possible values for the corresponding parameter, and the range of at least one dimension of the plurality of dimensions is centered on a current value of the corresponding parameter of the first deep learning model configuration; ([pp. 147-148, Section 4, p. 149, Section 5.1.2, Algorithm 1, Table 1, Table 2, Table 3], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., At the pre-training phase, the initial setup of the hyperparameters optimized by the FEG is shown in Table 2. The hyperparameters initialization, which will be optimized by SMAC and TPE, are shown in Table 3., wherein a neural network hyperparameter search space includes (neural network type-dependent) specified ranges (tables 2 and 3) for continuous and categorical hyperparameters such that the range of each parameter dimension (algorithm 1) is centered at the current best configuration value (lambda_best such as may have been determined from a previous iteration)) generating a second deep learning model configuration, wherein the second deep learning model configuration is within the first sample space; ([pp. 147-148, Section 4, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., wherein, at a given iteration t, a set of candidate model configurations (lambda_i over M_trails or lambda_j over M_predict) are generated within the hyperparameter design sample space for that iteration which is within the first sample space in the case in which the first sample space centered at a best configuration (first model configuration) determined at a previous iteration.) calculating a second result metric for the second deep learning model configuration; ([pp. 147-148, Section 4, Algorithm 1], FEGf indicates how well model has been fitted to the data, because the weights tend to make the model more discriminative. FEGo can monitor whether the model has become over-fitting, because if the over-fitting happens, the weights will be more in favor of the training set than validation set…. Then we get final definition of the loss of FEG for the optimizer as follows: <equation 15> …. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., wherein performance (model fit/validation loss) L_FEG is computed for each learning model configuration (lambda^i or lambda_j in algorithm 1) for a candidate solution.) in response to the second result metric exceeding the first result metric, selecting a second sample space, wherein the range of each dimension is centered on the current value of the corresponding parameter of the second deep learning model configuration ([pp. 147-148, Section 4, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)…. Considering the computational efficiency, here we adopt the PI as below: <equation 19>, wherein, at a given iteration t, the performance metric for each candidate model configuration is compared with the current best (first) configuration such that if a candidate model configuration (second model configuration)  has a larger performance metric (equation 19) than the current best (as well as of any of the other candidate configurations), the hyperparameters for the (second) model configuration form the center of the parameter search space for the next iteration (i.e., for a second sample space).) 
 However, Yao does not explicitly teach and in response to the second result metric not exceeding the first result metric, reducing the size of the first sample space.  Yao does not disclose how or if the “short interval” around the best configuration is changed over successive iterations. 
However, Varadarajan, in the analogous environment of neural network hyperparameter optimization teaches and in response to the second result metric not exceeding the first result metric, reducing the size of the first sample space. ([0095, 0112, Figure 2, Figure 3] During or at the end of epoch 111 , current value range 133A of explored hyperparameter 123 may be narrowed by adjusting the minimum and / or maximum of the range to exclude values of tuples that yielded inferior scores . The narrowed range may be propagated into the next epoch ., Within best pair A - B , one point ( A or B ) has a higher score than the other point . In an embodiment , the horizontal position of the higher scoring point is used to set the new minimum or maximum for the new value range of hyperparameter 123 ., wherein an adaptive iterative search hyperparameter optimization process identifies the best performing model configuration design sample from among a set of candidate model configurations corresponding to a hyperparameter range and, in response to this identification, determines a narrower range of hyperparameters (a reduced sample space) based on the relative performance metrics over the sample space such that this progressive narrowing includes a reduction based on any second configuration performance metric not exceeding the first configuration performance metric (e.g., B relative to A or either C or D relative to either A or B as shown in Figure 2 where it is noted that A and B configurations that have the higher metrics are carried into a subsequent/reduced search design space).)   
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan to reduce the size of the first sample space to for the range in response to the second result metric not exceeding the first result metric. The modification would be obvious because one of ordinary skill would be motivated to efficiently optimize neural network hyperparameters for designing neural networks with optimal accuracy using automated scalable search over hyperparameter design space to generate deep neural network models with superior performance by adaptively and repeatedly narrowing the search space for the hyperparameter ranges according to the relative performance of selectively sampled design configurations in each iteration (Varadarajan, [0007, 0008, 0067, 00068, 0069]).

In regards to claim 3, the rejection of claim 1 is incorporated, and Yao further teaches wherein generating each of the first and second deep learning model configurations comprises at least one of: 4831-5827-2102.642 orney oc e o.Customer No. 104982 generating a weight of an edge of a neural network model based on random number generation; generating an edge connecting two nodes of the neural network model, wherein the position of the edge is based on random number generation; generating a plurality of nodes of the neural network model, wherein the number of nodes is based on random number generation; and generating a plurality of layers of the neural network model, wherein the number of layers is based on random number generation.  ([pp. 147-148, Section 4, Algorithm 1, Table 2, Table 3], The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., wherein the sample space over the hyperparameter dimensions is randomly sampled (i.e., each sampled hyperparameter in that space at a given iteration is based on random number generation),  wherein this random sampling includes the number of layers of a neural network (Table 3), the number of nodes/units per layer (table 3), and the positional connectivity of two nodes according to the dropout probability (table 3), and wherein it is noted that the claim only requires one of the items in the list of items.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan for the same reasons as pointed out for claim 1.

In regards to claim 4, the rejection of claim 1 is incorporated, and Yao further teaches wherein calculating each of the first and second result metric comprises testing each of the first and second deep learning model configurations on a testing dataset.  ([pp. 147-148, Section 4, Algorithm 1], Based on Eq. (8), we give the definition of two kinds free energy gaps, FEGf and FEGo as follows:… where vtrain and vvalid are the training set and validation set respectively, and vnoise is handcrafted noisy data with almost no features., wherein (as shown in algorithm 1), the performance metrics are determined by testing the model configuration on a validation data set (and wherein it is noted that the performance metric L^j_FEG is also being interpreted (along with L^i_FEG) as being based on a testing dataset since the probabilistic representation of the loss function is based on the results of the application of the validation data set).)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan for the same reasons as pointed out for claim 1.

In regards to claim 5, the rejection of claim 4 is incorporated, and Yao further teaches wherein the first and second result metrics each comprise a metric based on at least one of: testing dataset accuracy;  overfit; and underfit.  ([p. 145, Section 1, pp. 147-149, Section 4, Algorithm 1],  Here we combine two kinds of the FEGs, which are denoted as FEGf and FEGo. The FEGf is an indicator of the model fitting, while the FEGo can monitor the overfitting., FEGf indicates how well model has been fitted to the data, because the weights tend to make the model more discriminative. FEGo can monitor whether the model has become over-fitting, because if the over-fitting happens, the weights will be more in favor of the training set than validation set., wherein the performance metric in the evaluation of each learning model configuration is the loss of FEG which is an indicator for overfitting, and wherein it is noted that only one of the three items in the list of items recited in the claims is required.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan for the same reasons as pointed out for claim 1.

In regards to claim 6, the rejection of claim 1 is incorporated, and Yao does not further teach wherein reducing the size of the first sample space comprises reducing a size of the range of possible values for at least one dimension of the plurality of dimensions.4831-5827-2102.643 orney oc e o.Customer No. 104982 Yao does not disclose how or if the “short interval” around the best configuration is changed over successive iterations. 
However, Varadarajan, in the analogous environment of neural network hyperparameter optimization teaches wherein reducing the size of the first sample space comprises reducing a size of the range of possible values for at least one dimension of the plurality of dimensions. ([0095, 0112, Figure 2, Figure 3] During or at the end of epoch 111 , current value range 133A of explored hyperparameter 123 may be narrowed by adjusting the minimum and / or maximum of the range to exclude values of tuples that yielded inferior scores . The narrowed range may be propagated into the next epoch ., Within best pair A - B , one point ( A or B ) has a higher score than the other point . In an embodiment , the horizontal position of the higher scoring point is used to set the new minimum or maximum for the new value range of hyperparameter 123 ., wherein an adaptive iterative search hyperparameter optimization process identifies the best performing model configuration design sample from among a set of candidate model configurations corresponding to a hyperparameter range and, in response to this identification, determines a narrower range of hyperparameters (a reduced sample space over at least one hyperparameter dimension) based on the relative performance metrics over the sample space such that this progressive narrowing includes a reduction based on any second configuration performance metric not exceeding the first configuration performance metric (e.g., B relative to A or either C or D relative to either A or B as shown in Figure 2 where it is noted that A and B configurations that have the higher metrics are carried into a subsequent/reduced search design space).)   
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan to reduce the size of the first sample space by reducing a size of the range of possible values for at least one dimension of the plurality of dimensions. The modification would be obvious because one of ordinary skill would be motivated to efficiently optimize neural network hyperparameters for designing neural networks with optimal accuracy using automated scalable search over hyperparameter design space to generate deep neural network models with superior performance by adaptively and repeatedly narrowing the search space for each of the hyperparameter ranges according to the relative performance of selectively sampled design configurations in each iteration (Varadarajan, [0007, 0008, 0067, 00068, 0069]).

In regards to claim 7, the rejection of claim 1 is incorporated, and Yao further teaches wherein a parameter of the deep learning model configuration comprises at least one of: a weight of an edge of a neural network model; a position of an edge of the neural network model; a dropout node of the neural network model; a configuration of a plurality of machine learning models of the deep learning model; a number of layers of the neural network model; and number of nodes in a layer of the neural network model.  ([pp. 147-148, Section 4, Algorithm 1, Table 2, Table 3], More specifically, the procedure is: in each epoch, we first fix the hyperparameters to learn the weights based on the traditional training procedure of the DGMs, then fix the model with these identical weights to infer the optimal hyperparameters by using GP, in which we need a new holdout loss….The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., wherein the parameters/hyperparameters which characterize each model configuration include dropout connectivity in the neural network (dropout probability in table 2 which is being interpreted as characterizing both a dropout node in a layer as well as the position of an edge associated with the connectivity modified by the dropout, a number of layers (table 3), a number of nodes in a given layer (table 3), a configuration of a plurality of learning models (tables 2 and 3 – DBN and DBM), and a weight of an edge in the neural network by virtue of the dropout connectivity (i.e., a weight of zero or non-zero) but with the adjustment of the weights also determined through the learning of weights (including momentum parameters) and the interplay between the weights and the optimization of the learning model configuration, and wherein it is noted that the claim only requires one of the items in the list of items.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan for the same reasons as pointed out for claim 1.

In regards to claim 8, the rejection of claim 1 is incorporated, and Yao further teaches wherein the range of at least one dimension of the plurality of dimensions being centered on the current value of the corresponding parameter of the first deep learning model configuration comprises the range of each dimension of the plurality of dimensions being centered on the current value of the corresponding parameter of the first deep learning model configuration. ([pp. 147-148, Section 4, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)…. Considering the computational efficiency, here we adopt the PI as below: <equation 19>, wherein, at a given iteration t, the performance metric for each candidate model configuration is compared with the current best (first) configuration such that if a candidate model configuration (second model configuration)  has a larger performance metric (equation 19) than the current best (as well as of any of the other candidate configurations), the hyperparameters for the (second) model configuration form the center of the parameter search space for the next iteration (i.e., for a second sample space).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan for the same reasons as pointed out for claim 1.

In regards to claim 9, the rejection of claim 1 is incorporated, and Yao further teaches further comprising calculating an exploitation threshold; and wherein selecting a first sample space comprises selecting the first sample space in response to the first result metric exceeding the exploitation threshold. ([pp. 147-148, Section 4, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)…. Considering the computational efficiency, here we adopt the PI as below: <equation 19>, wherein, at a given iteration t, the performance metric for each candidate model configuration is compared with the current best (first) configuration such that if a candidate model configuration (second model configuration) has a larger performance metric (equation 19) than the current best (as well as of any of the other candidate configurations), the hyperparameters for the (second) model configuration form the center of the parameter search space for the next iteration (i.e., for a second sample space) such that the (exploitation) threshold for determining the (first) sample space for a given iteration (t) is the computed best performing performance metrics associated with the previous best configuration (equation 19) (in other words, the iterative search process is purely exploitational).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan for the same reasons as pointed out for claim 1.

In regards to claim 10, the rejection of claim 1 is incorporated, and Yao does not further teach wherein the second result metric exceeding the first result metric comprises the second result metric exceeding the first result metric by a threshold amount, the threshold amount comprising an amount based on user configuration. Yao does not teach that the previously best performance metric must be exceeded by a threshold amount to reconfigure the search space according to the parameters of the best (second model configuration) determined by the second metric exceeding the first metric. 
However, Varadarajan, in the analogous environment of neural network hyperparameter optimization teaches wherein the second result metric exceeding the first result metric comprises the second result metric exceeding the first result metric by a threshold amount, the threshold amount comprising an amount based on user configuration. ([0176, 0177,  0179, 0180, Figure 2, Figure 3]  Thus , categorical hyperparameter 421 is usually not explored while epoch ( s ) occur . However , a radical improvement to a best score of a numeric hyperparameter such as 422 during an epoch may indicate discovery of a new subspace of the configuration hyperspace that imposes a new performance regime ., The new performance regime may favor ( i . e . score higher ) a different value of categorical hyperparameter 421 than best value 441 . When epoch 411 ends , if new best score 472B for numeric hyperparameter 422 exceeds old best score 472A by more than absolute or percent threshold 430 , then computer 400 detects that some or all categorical hyperparameters , such as 421 , need spontaneous exploration . , When threshold 430 is crossed , exploration of categorical hyperparameter ( s ) is triggered . As with an epoch , tuples are generated with constants for best values of other hyperparameters ., A distinct tuple for each possible value of categorical hyperparameter 421 is generated . These tuples are scored , which may cause best value 441 to be surpassed by a different category value that has a better score , which is publicized for use by numeric hyperparameters when they start their next epoch ., wherein, a hyperparameter design space is re-focused (re-centered) if a substantially significant performance improvement is observed as determined by the score associated with the best performing model configuration exceeding the previous best model configuration score by a threshold amount (by a percentage amount or by an absolute amount the selection of either one of which is a user configuration) such that this substantially significant improvement leads to an exploration over categorical hyperparameters to identify a (local) optimization over the categorical variables so that the re-configuration of the design space (second design sample space) is centered about the optimized categorical parameters but also (and alternatively) more generally re-focused/re-centered about the numerical hyperparameters associated with the substantially significant performance improvement.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan to center the range of each dimension on the current value of the corresponding parameter of the second dep learning model configuration in response to the second result metric exceeding the first result metric, exceeding the first result metric comprises the second result metric exceeding the first result metric by a user configured threshold amount. The modification would be obvious because one of ordinary skill would be motivated to efficiently optimize neural network hyperparameters for designing neural networks with optimal accuracy using automated scalable search over hyperparameter design space to generate deep neural network models with superior performance by adaptively and repeatedly narrowing the search space for the hyperparameter ranges according to the relative performance of selectively sampled design configurations in each iteration including selective exploration and exploitation of a categorical parameter design space for when substantial performance improvements are detected in a user-designed search framework (Varadarajan, [0007, 0008, 0067, 00068, 0069, 0177]).

In regards to claim 11, the rejection of claim 1 is incorporated, and Yao further teaches further comprising: selecting a deep learning model configuration as an output deep learning model configuration; calculating a third result metric, wherein the third result metric is based on … training set data and the output deep learning model configuration; … ([pp. 147-148, Section 4, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18). wherein, at a given iteration t, a set of candidate model configurations (lambda_i over M_trails) are generated within the hyperparameter design sample space for that iteration which is within a first sample space such that any of these candidate configurations or the current best configuration is an output configuration from the random hyperparameter sampling process (candidate configurations) or from the results of a previous iteration of evaluations (best configuration and wherein each selected candidate model configuration is trained on a training set (algorithm 1) and evaluated on a validation set to determine a performance metric that is then compared to the best performing metric (equation 19) to determine a new (third) sample space in a subsequent iteration (in other words, the metric is based on each “output” candidate configuration according to the testing/validation and, alternatively, each metric for the candidate configuration is based on the “output” best configuration from which the candidate is derived and from the testing/validation of the candidate configuration.) 
However, Yao does not explicitly teach … additional…; in response to determining that the third result metric is below a threshold result metric, selecting a third sample space.  Yao does not disclose multiple distinct training data sets used over multiple iterations of the hyperparameter search process.   
However, Varadarajan, in the analogous environment of neural network hyperparameter optimization teaches selecting a deep learning model configuration as an output deep learning model configuration; calculating a third result metric, wherein the third result metric is based on additional training set data and the output deep learning model configuration; in response to determining that the third result metric is below a threshold result metric, selecting a third sample space. ([0095, 0112, 0220, 0223, Figure 2, Figure 3] During or at the end of epoch 111 , current value range 133A of explored hyperparameter 123 may be narrowed by adjusting the minimum and / or maximum of the range to exclude values of tuples that yielded inferior scores . The narrowed range may be propagated into the next epoch ., Within best pair A - B , one point ( A or B ) has a higher score than the other point . In an embodiment , the horizontal position of the higher scoring point is used to set the new minimum or maximum for the new value range of hyperparameter 123 ., With cross validation , the original dataset is equally partitioned at least three ways known as folds. Five folds is empirically best ., For example , dataset 660 is partitioned into equally sized folds 1 - 5 for reuse during five - way cross validation . Subsets of those reusable folds are used to make a distinct training dataset for each of at least training runs 621 - 622 that emit at least respective scores 631 - 632 ., wherein an adaptive iterative search hyperparameter optimization process identifies the best performing model configuration design sample from among a set of candidate model configurations corresponding to a hyperparameter range and, in response to this identification, determines a narrower range of hyperparameters (a reduced sample space) based on the relative performance metrics over the sample space such that this progressive narrowing includes a reduction based on any second configuration performance metric not exceeding the first configuration performance metric (e.g., B relative to A or either C or D relative to either A or B as shown in Figure 2 where it is noted that A and B configurations that have the higher metrics are carried into a subsequent/reduced search design space) and wherein the testing/validation process is based on an “additional” training set because more than one fold of a training set is used to generate the performance metric and also and alternatively because the cross-fold validation process does not exclude different sets of folds being applied to different hyperparameter configurations).)   
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan to calculate a third result metric based on additional training set data and an output deep learning model configuration and to reduce the size of the first sample space to for the range in response to the third result metric not exceeding the first result metric. The modification would be obvious because one of ordinary skill would be motivated to efficiently optimize neural network hyperparameters for designing neural networks with optimal accuracy using automated scalable search over hyperparameter design space to generate deep neural network models with superior performance by adaptively and repeatedly narrowing the search space for the hyperparameter ranges according to the relative performance of selectively sampled design configurations in each iteration while improving the accuracy of performance metrics through cross validation (Varadarajan, [0007, 0008, 0067, 00068, 0069, 0219]).

In regards to claim 12, the rejection of claim 1 is incorporated, and Yao further teaches wherein: generating the first deep learning model configuration comprises generating the first deep learning model using a first computational approach; and generating the second deep learning model configuration comprises generating the second deep learning model using a second computational approach. ([pp. 147-148, Section 4, Table 1, Table 2, Table 3], The hyperparameters we choose to optimize for both auto-encoder and DBN in this experiment are the momentum, the learning rate and the weight-cost, the three of the most important hyperparameters, wherein, among the hyperparameters in the search space (tables 1-3) subject to optimization at any given iteration (i.e., used to selectively and distinctly form a second model configuration or a first model configuration) include the momentum, learning rate, weight cost, and batch size, any of which is interpretable as a distinct computational approach since the particular value selected for any of those parameters determines an aspect of the computations used to determine the trained neural network (i.e., for a given training set will lead to a different trained neural network based on modification of the training computational process).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan for the same reasons as pointed out for claim 1.

In regards to claim 13, Yao teaches A computer-implemented method comprising: generating a plurality of first deep learning model configurations; ([p. 148, Section 4, p. 148, Section 5, Figure 1, Figure 2, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., we perform the experiments on MNIST digits dataset to demonstrate that our new method is much more efficient than the traditional BO methods. In the application of the unsupervised learning task of text clustering, the empirical results show that the DGMs with the adaptive hyperparameters can surpass the state-ofthe-art., wherein a computer-based hyperparameter optimization framework determines an (optimized) deep (Figures 1, 2) learning model configuration by iterating over successive deep model configurations such that the set of configurations at a given iteration (t) which includes an identified current or set of candidate solutions (lambda) (for either the predict or the “trails” stage)  is a plurality of first model configurations.) and calculating a first result metric for each of the plurality of first deep learning model configurations; 4831-5827-2102.645 ([pp. 147-148, Section 4, Algorithm 1], FEGf indicates how well model has been fitted to the data, because the weights tend to make the model more discriminative. FEGo can monitor whether the model has become over-fitting, because if the over-fitting happens, the weights will be more in favor of the training set than validation set…. Then we get final definition of the loss of FEG for the optimizer as follows: <equation 15> …., wherein performance (model fit/validation loss) L_FEG is computed for each first learning model configuration (lambda^i or lambda^j  in algorithm 1 but also lambda_best in a broader sense) for either a set of candidate solutions or a current solution.) selecting a deep learning model configuration from the plurality of first deep learning model configurations; selecting a first sample space, wherein the first sample space includes a plurality of dimensions, wherein each dimension corresponds to a parameter of the selected first deep learning model configuration, each dimension of the plurality of dimensions includes a range of possible values for the corresponding parameter, and the range of at least one dimension of the plurality of dimensions is centered on a current value of the corresponding parameter of the selected first deep learning model configuration;([pp. 147-148, Section 4, p. 149, Section 5.1.2, Algorithm 1, Table 1, Table 2, Table 3], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., At the pre-training phase, the initial setup of the hyperparameters optimized by the FEG is shown in Table 2. The hyperparameters initialization, which will be optimized by SMAC and TPE, are shown in Table 3., wherein a neural network hyperparameter search space includes (neural network type-dependent) specified ranges (tables 2 and 3) for continuous and categorical hyperparameters such that the range of each parameter dimension (algorithm 1) is centered at the current best configuration value (lambda_best such as may have been determined/selected from a previous iteration from among the set of model configurations at that iteration)) generating a plurality of second deep learning model configurations, wherein each second deep learning model configuration is within the first sample space; ([pp. 147-148, Section 4, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., wherein, at a given iteration t, a set of candidate model configurations (lambda_i over M_trails or lambda_j over M_predict) are generated within the hyperparameter design sample space for that iteration which is within the first sample space in the case in which the first sample space centered at a best configuration (first model configuration) determined at a previous iteration.)  calculating a second result metric for each second deep learning model configuration; ([pp. 147-148, Section 4, Algorithm 1], FEGf indicates how well model has been fitted to the data, because the weights tend to make the model more discriminative. FEGo can monitor whether the model has become over-fitting, because if the over-fitting happens, the weights will be more in favor of the training set than validation set…. Then we get final definition of the loss of FEG for the optimizer as follows: <equation 15> …. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., wherein performance (model fit/validation loss) L_FEG is computed for each learning model configuration (lambda^i or lambda_j in algorithm 1) for each of the set of candidate solutions.) in response to the second result metric exceeding the first result metric, selecting a second sample space, wherein the range of the at least one dimension is centered on the current value of the corresponding parameter of the second deep learning model configuration corresponding to the second result metric that exceeds the first result metric; ([pp. 147-148, Section 4, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)…. Considering the computational efficiency, here we adopt the PI as below: <equation 19>, wherein, at a given iteration t, the performance metric for each candidate model configuration is compared with the current best (first) configuration such that if a candidate model configuration (second model configuration)  has a larger performance metric (equation 19) than the current best (as well as of any of the other candidate configurations), the hyperparameters for the (second) model configuration form the center of the parameter search space for the next iteration (i.e., for a second sample space).) 
However, Yao does not explicitly teach and in response to the second result metric not exceeding the first result metric, reducing the size of the first sample space.  Yao does not disclose how or if the “short interval” around the best configuration is changed over successive iterations. 
However, Varadarajan, in the analogous environment of neural network hyperparameter optimization teaches and in response to the second result metric not exceeding the first result metric, reducing the size of the first sample space. ([0095, 0112, Figure 2, Figure 3] During or at the end of epoch 111 , current value range 133A of explored hyperparameter 123 may be narrowed by adjusting the minimum and / or maximum of the range to exclude values of tuples that yielded inferior scores . The narrowed range may be propagated into the next epoch ., Within best pair A - B , one point ( A or B ) has a higher score than the other point . In an embodiment , the horizontal position of the higher scoring point is used to set the new minimum or maximum for the new value range of hyperparameter 123 ., wherein an adaptive iterative search hyperparameter optimization process identifies the best performing model configuration design sample from among a set of candidate model configurations corresponding to a hyperparameter range and, in response to this identification, determines a narrower range of hyperparameters (a reduced sample space) based on the relative performance metrics over the sample space such that this progressive narrowing includes a reduction based on any second configuration performance metric not exceeding the first configuration performance metric (e.g., B relative to A or either C or D relative to either A or B as shown in Figure 2 where it is noted that A and B configurations that have the higher metrics are carried into a subsequent/reduced search design space).)   
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan to reduce the size of the first sample space to for the range in response to the second result metric not exceeding the first result metric. The modification would be obvious because one of ordinary skill would be motivated to efficiently optimize neural network hyperparameters for designing neural networks with optimal accuracy using automated scalable search over hyperparameter design space to generate deep neural network models with superior performance by adaptively and repeatedly narrowing the search space for the hyperparameter ranges according to the relative performance of selectively sampled design configurations in each iteration (Varadarajan, [0007, 0008, 0067, 00068, 0069]).

In regards to claim 15, the rejection of claim 13 is incorporated, and Yao further teaches wherein selecting the deep learning model configuration from the plurality of first deep learning model   orney oc e o.Customer No. 104982configurations comprises selecting the deep learning model configuration with a result metric above a predetermined threshold.  ([pp. 147-148, Section 4, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)…. Considering the computational efficiency, here we adopt the PI as below: <equation 19>, wherein, at a given iteration t, the performance metric for each candidate model configuration is compared with the current best configuration such that if a candidate model configuration (the selected model configuration) has a larger performance metric (equation 19) than the current best (as well as of any of the other candidate configurations in the set of first model configurations), the hyperparameters for the that model configuration form the center of the parameter search space for the next iteration such that the threshold for determining that search space for a given iteration (t) is the computed best performing performance metrics associated with the previous best configuration (equation 19).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan for the same reasons as pointed out for claim 13.

In regards to claim 16, the rejection of claim 13 is incorporated, and Yao does not further teach wherein: generating the plurality of first deep learning model configurations comprises generating at least two of the first deep learning model configurations in parallel; calculating the result first metric for each first deep learning model configuration comprises calculating at least two first result metrics in parallel; and generating the plurality of second deep learning model configurations comprises generating at least two deep learning model configurations in parallel; and calculating the result second metric for each first deep learning model configuration comprises calculating at least two second result metrics in parallel. Although Yao teaches the determination of result metrics for each of the learning model configurations at each iteration in the search over the hyperparameter design space, he does not teach the computation of the metrics in parallel.
However, Varadarajan, in the analogous environment of neural network hyperparameter optimization teaches wherein: generating the plurality of first deep learning model configurations comprises generating at least two of the first deep learning model configurations in parallel; calculating the result first metric for each first deep learning model configuration comprises calculating at least two first result metrics in parallel; and generating the plurality of second deep learning model configurations comprises generating at least two deep learning model configurations in parallel; and calculating the result second metric for each first deep learning model configuration comprises calculating at least two second result metrics in parallel. ([0206, 0209, 0210, Figure 2, Figure 6]  An ideal work distribution is shown for maximum horizontal scaling of a single epoch, such that each processor trains and scores one algorithm configuration in parallel., There are two processors per sampled point, which is important because each sampled point occurs in a pair of points …, With six available processors, then 6/2=three (equally spaced) values should be sampled., wherein the set of candidate model configurations (corresponding to the set of sampled hyperparameter values) at each epoch (with each epoch forming a different plurality of learning model configurations), are evaluated/scored in parallel.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan to generate and compute in parallel result metrics for a plurality of learning configurations for both a first set of deep learning model configurations and a second set of deep learning model configurations. The modification would be obvious because one of ordinary skill would be motivated to efficiently optimize neural network hyperparameters for designing neural networks with optimal accuracy using automated scalable search over hyperparameter design space to generate deep neural network models with superior performance by adaptively and repeatedly narrowing the search space for the hyperparameter ranges according to the relative performance of selectively sampled design configurations in each iteration/epoch by efficiently performing the training and evaluation of candidate model configuration in parallel (Varadarajan, [0007, 0008, 0067, 00068, 0069, 0206]).

In regards to claim 17, Yao teaches A computer-implemented method for improving a deep learning model configuration, comprising: ([p. 148, Section 4, p. 148, Section 5, Figure 1, Figure 2, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., we perform the experiments on MNIST digits dataset to demonstrate that our new method is much more efficient than the traditional BO methods. In the application of the unsupervised learning task of text clustering, the empirical results show that the DGMs with the adaptive hyperparameters can surpass the state-ofthe-art., wherein a computer-based hyperparameter optimization framework determines an (optimized) deep (Figures 1, 2) learning model configuration by iterating over successive deep model configurations such that any configuration identified as a current or candidate solution (lambda) at an iteration (in algorithm 1 – a given t for either the predict or the “trails” stage)  is a first model configuration.) receiving a first deep learning model configuration; calculating a first result metric for the first deep learning model configuration; ([pp. 147-148, Section 4, Algorithm 1], FEGf indicates how well model has been fitted to the data, because the weights tend to make the model more discriminative. FEGo can monitor whether the model has become over-fitting, because if the over-fitting happens, the weights will be more in favor of the training set than validation set…. Then we get final definition of the loss of FEG for the optimizer as follows: <equation 15> …., wherein performance (model fit/validation loss) L_FEG is computed for a first learning model configuration (lambda^i or lambda^j  received/selected for evaluation as shown in algorithm 1 but also lambda_best in a broader sense) for either a candidate or current solution.) selecting a first sample space, wherein  the first sample space includes a plurality of dimensions, wherein each dimension corresponds to a parameter of the selected first deep learning model configuration, each dimension of the plurality of dimensions includes a range of possible values for the corresponding parameter, and the range of at least one dimension of the plurality of dimensions is centered on a current value of the corresponding parameter of the selected first deep learning model configuration; ; ([pp. 147-148, Section 4, p. 149, Section 5.1.2, Algorithm 1, Table 1, Table 2, Table 3], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., At the pre-training phase, the initial setup of the hyperparameters optimized by the FEG is shown in Table 2. The hyperparameters initialization, which will be optimized by SMAC and TPE, are shown in Table 3., wherein a neural network hyperparameter search space includes (neural network type-dependent) specified ranges (tables 2 and 3) for continuous and categorical hyperparameters such that the range of each parameter dimension (algorithm 1) is centered at the current best configuration value (lambda_best such as may have been determined from a previous iteration))  generating a second deep learning model configuration, wherein the second deep learning configuration is within the first sample space; ([pp. 147-148, Section 4, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., wherein, at a given iteration t, a set of candidate model configurations (lambda_i over M_trails or lambda_j over M_predict) are generated within the hyperparameter design sample space for that iteration which is within the first sample space in the case in which the first sample space centered at a best configuration (first model configuration) determined at a previous iteration.) calculating a second result metric for the second deep learning configuration; ([pp. 147-148, Section 4, Algorithm 1], FEGf indicates how well model has been fitted to the data, because the weights tend to make the model more discriminative. FEGo can monitor whether the model has become over-fitting, because if the over-fitting happens, the weights will be more in favor of the training set than validation set…. Then we get final definition of the loss of FEG for the optimizer as follows: <equation 15> …. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., wherein performance (model fit/validation loss) L_FEG is computed for each learning model configuration (lambda^i or lambda_j in algorithm 1) for a candidate solution.) in response to the second result metric exceeding the first result metric, selecting a second sample space, wherein the range of the at least one dimension is centered on the current value of the corresponding parameter of the second deep learning configuration; ([pp. 147-148, Section 4, Algorithm 1], The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)…. Considering the computational efficiency, here we adopt the PI as below: <equation 19>, wherein, at a given iteration t, the performance metric for each candidate model configuration is compared with the current best (first) configuration such that if a candidate model configuration (second model configuration)  has a larger performance metric (equation 19) than the current best (as well as of any of the other candidate configurations), the hyperparameters for the (second) model configuration form the center of the parameter search space for the next iteration (i.e., for a second sample space).) 
However, Yao does not explicitly teach and in response to the second result metric not exceeding the first result metric, reducing the size of the first sample space.  Yao does not disclose how or if the “short interval” around the best configuration is changed over successive iterations. 
However, Varadarajan, in the analogous environment of neural network hyperparameter optimization teaches and in response to the second result metric not exceeding the first result metric, reducing the size of the first sample space. ([0095, 0112, Figure 2, Figure 3] During or at the end of epoch 111 , current value range 133A of explored hyperparameter 123 may be narrowed by adjusting the minimum and / or maximum of the range to exclude values of tuples that yielded inferior scores . The narrowed range may be propagated into the next epoch ., Within best pair A - B , one point ( A or B ) has a higher score than the other point . In an embodiment , the horizontal position of the higher scoring point is used to set the new minimum or maximum for the new value range of hyperparameter 123 ., wherein an adaptive iterative search hyperparameter optimization process identifies the best performing model configuration design sample from among a set of candidate model configurations corresponding to a hyperparameter range and, in response to this identification, determines a narrower range of hyperparameters (a reduced sample space) based on the relative performance metrics over the sample space such that this progressive narrowing includes a reduction based on any second configuration performance metric not exceeding the first configuration performance metric (e.g., B relative to A or either C or D relative to either A or B as shown in Figure 2 where it is noted that A and B configurations that have the higher metrics are carried into a subsequent/reduced search design space).)   
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan to reduce the size of the first sample space to for the range in response to the second result metric not exceeding the first result metric. The modification would be obvious because one of ordinary skill would be motivated to efficiently optimize neural network hyperparameters for designing neural networks with optimal accuracy using automated scalable search over hyperparameter design space to generate deep neural network models with superior performance by adaptively and repeatedly narrowing the search space for the hyperparameter ranges according to the relative performance of selectively sampled design configurations in each iteration (Varadarajan, [0007, 0008, 0067, 00068, 0069]).

In regards to claim 18, the rejection of claim 17 is incorporated, and Yao further teaches wherein generating the second deep learning model configuration comprises adjusting a parameter of the deep learning model configuration. ([pp. 147-148, Section 4, Algorithm 1, Table 2, Table 3], The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., wherein each (second) learning model configuration at a given iteration is characterized by a (randomly sampled) modification of the best configuration as determined in a previous iteration with the particular hyperparameter design space shown in tables 2 and 3.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan for the same reasons as pointed out for claim 17.

In regards to claim 19, the rejection of claim 18 is incorporated, and Yao further teaches wherein adjusting a parameter the first deep learning model configuration comprises at least one of: adjusting a weight of an edge of a neural network of the first deep learning model configuration within a predetermined amount; adding an edge to a neural network of the first deep learning model configuration; and removing an edge from a neural network of the first deep learning model configuration. ([pp. 147-148, Section 4, Algorithm 1, Table 2, Table 3], More specifically, the procedure is: in each epoch, we first fix the hyperparameters to learn the weights based on the traditional training procedure of the DGMs, then fix the model with these identical weights to infer the optimal hyperparameters by using GP, in which we need a new holdout loss….The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18).,  wherein the adjustment/modification of the current best learning model configuration hyperparameters in a particular iteration include parameters that characterize each model configuration include dropout connectivity in the neural network (dropout probability in table 2) which is being interpreted as characterizing both a dropout node in a layer as well as the position of an edge associated with the connectivity modified (added or removed) by the dropout and a weight of an edge in the neural network by virtue of the dropout connectivity (i.e., a weight of zero or non-zero) but with the weight adjustment also determined through the learning of weights (including momentum parameters) and the interplay between the weights and the optimization of the learning model configuration, and wherein it is noted that the claim only requires one of the items in the list of items.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao to incorporate the teachings of Varadarajan for the same reasons as pointed out for claim 17.
4831-5827-2102.6
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Yao, in view of  Varadarajan, and in further view of Li et al. (“Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization”, Journal of Machine Learning Research 18 (2018), April, 2018, pp. 1-52), , hereinafter referred to as Li.

In regards to claim 14, the rejection of claim 13 is incorporated, and Yao further teaches wherein generating the plurality of first deep learning model configurations comprises generating a predetermined number of deep learning model configurations, ….  ([p. 12, Section 3.4, p. 18, Section 4.2], Given some trials of the hyperparameter settings, training the model Aλ(i) (Vtrain) a little bit, we can get pairs {(λ(i),LFEG(t)(i))} Mtrails i=1 (Mtrails ≤ 20) … The posterior predictive distribution is also Gaussian, based on which, we can evaluate individual guesses of the hyperparameters λ∗ very fast. The procedure is: we randomly sample {λ(j) ∗ } Mpredict j=1 within the interval centered by the last best hyperparameter setting λbest, and get the posterior distribution of L(j) ∗ for each point of λ(j) ∗ by Eqs. (17) and (18)., wherein the number of learning model configurations is m_trails for determining the performance metric L^i_FEG for a given iteration (a number of learning model configurations) and wherein, alternatively, this number also corresponds to M_predict used to determine the best sampled configuration for a given iteration.) 
However, Yao and Varadarajan do not explicitly teach wherein the predetermined number is based on a user-defined confidence interval. Although the number of configurations sampled for a given iteration appears to be user-defined in the sense that it is a parameter specified by the analysts performing the evaluation of the framework, Yao does not disclose that this number is based on a confidence interval specified by the user.  Varadarajan teaches that the number of learning model configurations is based on the number of processors rather than on a confidence interval ([0209]).
However, Li, in the analogous environment of hyperparameter optimization using particle swarms, teaches wherein generating the plurality of first deep learning model configurations comprises generating a predetermined number of deep learning model configurations, wherein the predetermined number is based on a user-defined confidence interval. ([pp. 7-8, Section 3.2, pp. 10-11, Section 3.5, p. 21, Section 4.4.2, p. 32, Section 5.4, Algorithm 1, Table 1, Theorem 8], Hyperband, shown in Algorithm 1, addresses this “n versus B/n” problem by considering several possible values of n for a fixed B, in essence performing a grid search over feasible value of n. Associated with each value of n is a minimum resource r that is allocated to all configurations before some are discarded; a larger value of n corresponds to a smaller r and hence more aggressive early-stopping…. The two inputs dictate how many different brackets are considered; specifically, smax + 1 different values for n are considered with smax = blogη (R)c. Hyperband begins with the most aggressive bracket s = smax, which sets n to maximize exploration, subject to the constraint that at least one configuration is allocated R resources. Each subsequent bracket reduces n by a factor of approximately η until the final bracket, s = 0, in which every configuration is allocated R resources (this bracket simply performs classical random search)…. a function that returns a set of n i.i.d. samples from some distribution defined over the hyperparameter configuration space. In this work, we assume uniformly sampling of hyperparameters from a predefined space (i.e., hypercube with min and max bounds for each hyperparameter), which immediately yields consistency guarantees., If there is a range of possible values for R, a smaller R will give a result faster (since the budget B for each bracket is a multiple of R), but a larger R will give a better guarantee of successfully differentiating between the configurations…. Thus, one unit of resource can be interpreted as the minimum desired resource and R as the ratio between maximum resource and minimum resource., We believe prior knowledge about a task can be particularly useful for limiting the range of brackets explored by Hyperband. In our experience, aggressive early-stopping is generally safe for neural network tasks and even more aggressive early-stopping may be reasonable for larger data sets and longer training horizons. However, when pushing the degree of early-stopping by increasing s, one has to consider the additional overhead cost associated with examining more models., It is important to note that in the finite horizon setting, for all sufficiently large B (e.g. B > 3R) and all distributions F, the budget B of SuccessiveHalving should scale linearly with n ' ∆−β log(1/δ) as ∆ → 0. …One consequence of this observation is that in the finite horizon setting it suffices to set B large enough to identify an ∆-good arm with just constant probability, say 1/10, and then repeat SuccessiveHalving m times to boost this constant probability to probability 1 − ( 9 10 ) m. While in this theoretical treatment of Hyperband we grow B over time, in practice we recommend fixing B as a multiple of R as we have done in Section 3., wherein the number of learning model configuration generated at any iteration in an adaptive search over hyperparameter search space is determined according to resource and budget constraints such that the specification of the (user) input parameters to this hyperparameter search framework are configured (even if through trial and error by the user) to achieve a satisfactory level of performance and such that these (user-defined) specifications are directly associated with a level of confidence with which the performance metrics for one sampled configuration can be differentiated from another, especially for finding the best single configuration from the progressively decreased number of sampled configurations being applied to progressively more relevant search regions (Figure 1).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao and Varadarajan to incorporate the teachings of Li to generate a predetermined number of deep learning model configurations in which the predetermined number is based on a user-defined confidence interval. The modification would be obvious because one of ordinary skill would be motivated to efficiently optimize the search for neural network hyperparameters for designing neural network configurations with good (near state of the art) accuracy given resource and budget constraints through the optimal allocation of resources over the evaluation of the neural network configuration for attaining a sufficient level of confidence in metrics of evaluation for identifying optimal configurations  (Li, [Abstract, p. 2, Section 1, p. 32, Section 5.4, Figure 3, Figure 4]).

Claims 2 and 20 is rejected under 35 U.S.C. 103 as being unpatentable over Yao, in view of  Varadarajan, and in further view of Levesque et al. (“Bayesian Hyperparameter Optimization for Ensemble Learning”, https:// https://arxiv.org/pdf/1605.06394.pdf, arXiv:1605.06394v1 [cs.LG], 20 May 2016, pp. 1-10), hereinafter referred to as Levesque.

In regards to claim 2, the rejection of claim 1 is incorporated, and Yao and Varadarajan do not further teach wherein the first deep learning model configuration comprises an ensemble comprising at least two learning model configurations.  In other words, neither Yao nor Varadarajan explicitly discloses that the model configuration that is formed through the hyperparameter optimization framework is an ensemble of distinct model configurations.
However, Levesque, in the analogous environment of performing neural network model design by searching over neural network hyperparameters, teaches wherein the first deep learning model configuration comprises an ensemble comprising at least two learning model configurations.  ([pp. 2-3, Section 2.1, p. 3, Section 3, p. 4, Section 3.2, Algorithm 1, Figure 2], At each iteration, given a pool of trained classifiers H to select from, a new classifier is added to the ensemble, selected according to the minimum ensemble generalization error. At the first iteration, the classifier added is simply the single best classifier. At step t, given the ensemble E = {he1 , he2 , . . . , het−1 }, the next classifier is chosen to minimize the empirical error on the validation dataset when added to E:…, We define the objective function to be the performance of a given ensemble E when it is augmented with a new classifier trained with hyperparameters γ, or hγ. In other words, the objective function is the empirical error provided by adding a model hγ to the ensemble E …, The ensemble E will in fact consist of m fixed positions, and at every iteration i, the classifier at position j = (i mod m) will be removed from the ensemble before finding hyperparameters which minimize Equation 5 – effectively optimizing the classifier at this position for the given iteration. At the end of an iteration the ensemble is updated again greedily, selecting the new best classifier (it could be the same classifier or a better one). The whole procedure is described in Algorithm 1 and in Figure 2., wherein an optimal ensemble of distinct learning model configurations is determined by successively/iteratively determining the optimal configuration model for each iteration (through a model-based optimization process) and incorporating the best observed configuration (according to an objective function such as an empirical generalization error) as well as removing any current model member of that ensemble according to a (round-robin) evaluation process such that the resultant (deep) learning model configuration (first or second) comprises at least two distinct model configuration identified during the optimization process.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao and Varadarajan to incorporate the teachings of Levesque for the first deep learning model configuration to comprise an ensemble to comprise at least two learning model configurations. The modification would be obvious because one of ordinary skill would be motivated to generate neural network models with superior generalization capacity through efficient and effective optimization of ensemble construction performed along with the optimization of learning model hyper-parameters (Levesque, [Abstract, p. 2, Section 2.1, p. 3, Section 3, Figure 7]).

In regards to claim 20, the rejection of claim 18 is incorporated, and Yao and Varadarajan do not further teach wherein the first deep learning model configuration comprises an ensemble comprising a plurality of learning models, and adjusting a parameter of the first deep learning model configuration comprises at least one of: removing a learning model from the ensemble; and adding a learning model to the ensemble. In other words, neither Yao nor Varadarajan explicitly discloses that the model configuration that is formed through the hyperparameter optimization framework is an ensemble of distinct model configurations.
 However, Levesque, in the analogous environment of performing neural network model design by searching over neural network hyperparameters, teaches wherein the first deep learning model configuration comprises an ensemble comprising a plurality of learning models, and adjusting a parameter of the first deep learning model configuration comprises at least one of: removing a learning model from the ensemble; and adding a learning model to the ensemble ([pp. 2-3, Section 2.1, p. 3, Section 3, p. 4, Section 3.2, Algorithm 1, Figure 2], At each iteration, given a pool of trained classifiers H to select from, a new classifier is added to the ensemble, selected according to the minimum ensemble generalization error. At the first iteration, the classifier added is simply the single best classifier. At step t, given the ensemble E = {he1 , he2 , . . . , het−1 }, the next classifier is chosen to minimize the empirical error on the validation dataset when added to E:…, We define the objective function to be the performance of a given ensemble E when it is augmented with a new classifier trained with hyperparameters γ, or hγ. In other words, the objective function is the empirical error provided by adding a model hγ to the ensemble E …, The ensemble E will in fact consist of m fixed positions, and at every iteration i, the classifier at position j = (i mod m) will be removed from the ensemble before finding hyperparameters which minimize Equation 5 – effectively optimizing the classifier at this position for the given iteration. At the end of an iteration the ensemble is updated again greedily, selecting the new best classifier (it could be the same classifier or a better one). The whole procedure is described in Algorithm 1 and in Figure 2., wherein an optimal ensemble of distinct learning model configurations is determined by successively/iteratively determining the optimal configuration model for each iteration (through a model-based optimization process) and incorporating the best observed configuration (according to an objective function such as an empirical generalization error) as well as removing any current model member of that ensemble according to a (round-robin) evaluation process such that the resultant (deep) learning model configuration (first or second) comprises at least two distinct model configuration identified during the optimization process.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yao and Varadarajan to incorporate the teachings of Levesque for the first deep learning model configuration to comprise an ensemble to which learning model configurations are removed and added. The modification would be obvious because one of ordinary skill would be motivated to generate neural network models with superior generalization capacity through efficient and effective optimization of ensemble construction performed along with the optimization of learning model hyper-parameters (Levesque, [Abstract, p. 2, Section 2.1, p. 3, Section 3, Figure 7]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Larochelle et al. (“An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation”, Proceedings of the 24th International Conference on Machine Learning, 2007, pp. 473-480) teach a fine-coarse grid search approach to hyperparameter optimization.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT LEWIS KULP whose telephone number is (571)272-7983. The examiner can normally be reached M, Th, F 8-5:30; Tu 8-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ROBERT LEWIS KULP/Examiner, Art Unit 2124             
                                                                                                                                                                                           /MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124