DETAILED ACTION
This action is in response to the claims filed 08 February 2022 for application 15/856755 filed 28 December, 2017. Currently claims 1-8, 10-18 and 21 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 08 February 2022 has been entered.
 
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have 

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.


Claims 1-8, 10-18 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization) in view of Hutter et al. (An Efficient Approach for Assessing Hyperparameter Importance). 

claims 1, 11 and 21, Li discloses: An apparatus for training a neural network, the apparatus comprising:
a training data segmenter to generate a partial set of labeled training data from a set of labeled training data (“Here we consider the setting of a black-box batch training algorithm that takes a data set as input and outputs a model. In this setting, we treat the resource as the size of a random subset of the data set with R corresponding to the full data set size. Subsampling data set sizes using Hyperband, especially for problems with super-linear training times like kernel methods, can provide substantial speedups.” P8 ¶2);
a matrix constructor to create a design of experiments matrix identifying permutations of hyperparameters to be tested (“get hyperparameter configuration(n) – a function that returns a set of n i.i.d. samples from some distribution defined over the hyperparameter configuration space. In this work, we assume uniformly sampling of hyperparameters from a predefined space (i.e., hypercube with min and max bounds for each hyperparameter), which immediately yields consistency guarantees. However, the more aligned the distribution is towards high quality hyperparameters (i.e., a useful prior), the better Hyperband will perform (see Section 6 for further discussion).” P6 last ¶, Table 1, note: a hypercube of defined space for the hyperparameters is interpreted as a matrix constructor);
a training controller to cause a neural network trainer to train a neural network using a plurality of the permutations of the hyperparameters in the design of experiments matrix and the partial set of labeled training data, the training controller to access results of the training corresponding to each of the permutations of the (“We next present a concrete example to provide further intuition about Hyperband. We work with the MNIST data set and optimize hyperparameters for the LeNet convolutional neural network trained using mini-batch stochastic gradient descent (SGD).5 Our search space includes learning rate, batch size, and number of kernels for the two layers of the network as hyperparameters (details are shown in Table 2 in Appendix A).” p7 §2.3 ¶1, “Figure 3 shows an empirical comparison of the average test error across 70 trials of the individual brackets of Hyperband run separately as well as standard Hyperband. In practice, we do not know a priori which bracket s ∈ {0, . . . , 4} will be most effective in identifying good hyperparameters, and in this case neither the most (s = 4) nor least aggressive (s = 0) setting is optimal. However, we note that Hyperband does nearly as well as the optimal bracket (s = 3) and outperforms the baseline uniform allocation (i.e., random search), which is equivalent to bracket s = 0.” P7 §2.3 ¶3); and
a result comparator to: 
estimate interaction effects … of the hyperparameters (“The quality of a predictive model critically depends on its hyperparameter configuration, but it is poorly understood how these hyperparameters interact with each other to affect the resulting model.” P1 §1 ¶1, “Figure 4 shows an empirical comparison of the average test error across 70 trials of the individual brackets of Hyperband run separately as well as standard Hyperband. In practice, we do not know a priori which bracket s ∈ {0, . . . , 4} will be most effective in identifying good hyperparameters, and in this case neither the most (s = 4) nor least aggressive (s = 0) setting is optimal. However, we note that Hyperband does nearly as well as the optimal bracket (s = 3) and outperforms the baseline uniform allocation (i.e., random search), which is equivalent to bracket s = 0.” P7 §2.3 ¶3); and 
select a permutation of hyperparameters based on the results, the training controller to instruct the neural network trainer to train the neural network using the selected permutation of the hyperparameters and the labeled training data (“The idea behind the original SuccessiveHalving algorithm follows directly from its name: uniformly allocate a budget to a set of hyperparameter configurations, evaluate the performance of all configurations, throw out the worst half, and repeat until one configuration remains. The algorithm allocates exponentially more resources to more promising configurations. Unfortunately, SuccessiveHalving requires the number of configurations n as an input to the algorithm. Given some finite budget B (e.g., an hour of training time to choose a hyperparameter configuration), B/n resources are allocated on average across the configurations.” P4 §2.1 ¶1).

However, Li does not explicitly disclose: estimate interaction effects based on a linear function of the hyperparameter using at least one of model coefficients or an error.

Hutter teaches: estimate interaction effects based on a linear function of the hyperparameter using at least one of model coefficients or an error (p4 §3.1, Functional ANOVA decomposes a linear function into additive components (coefficients) representing the importance/interaction effects of each hyperparameter, “In this work, we introduced an efficient approach for assessing the importance of the inputs to a blackbox function, and applied it to quantify the effect of algorithm hyperparameters. We first derived a novel linear-time algorithm for computing marginal predictions over arbitrary input dimensions in random forests and then showed how this algorithm can be used to quantify the importance of main effects and interaction effects through a functional ANOVA framework.” P8 §5 ¶1)
Li and Hutter are both in the same field of endeavor of hyperparameter optimization and are analogous. Li teaches exemplary hyperparameter tuning while Hutter teaches determining interaction effects between hyperparameters based on a linear function and coefficients. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the hyperparameter tuning as taught by Li with the modeled interaction effects as taught by Hutter. One would have been motivated as using the models allows a user to quantify the importance of interaction effects in a black box model (Hutter p8 §5 ¶1).

Regarding claims 2 and 12, Li discloses: The apparatus of claim 1, wherein the hyperparameters to be tested represent instructions on how the neural network is to be trained (“In recent years, machine learning models have exploded in complexity and expressibility at the cost of staggering computational costs and a growing number of tuning parameters that are difficult to set by standard optimization techniques. These ‘hyperparameters’ are inputs to machine learning algorithms that govern how the algorithm’s performance generalizes to new, unseen data; examples of hyperparameters include those that impact model architecture, amount of regularization, and learning rates.” P1 §1 ¶1).

Regarding claims 3 and 13, Li discloses: The apparatus of claim 1, wherein the partial set of labeled training data represents less than ten percent of the labeled training data (“Here we consider the setting of a black-box batch training algorithm that takes a data set as input and outputs a model. In this setting, we treat the resource as the size of a random subset of the data set with R corresponding to the full data set size. Subsampling data set sizes using Hyperband, especially for problems with super-linear training times like kernel methods, can provide substantial speedups.” P8 ¶2, “Note that R is also the number of configurations evaluated in the bracket that performs the most exploration, i.e s = smax. In practice one may want n ≤ nmax to limit overhead associated with training many configurations on a small budget, i.e., costs associated with initialization, loading a model, and validation. In this case, set smax = |logη(nmax)|. Alternatively, one can redefine one unit of resource so that R is artificially smaller (i.e., if the desired maximum iteration is 100k, defining one unit of resource to be 100 iterations will give R = 1, 000, whereas defining one unit to be 1k iterations will give R = 100). Thus, one unit of resource can be interpreted as the minimum desired resource and R as the ratio between maximum resource and minimum resource.” P8 §2.5 ¶3).



Regarding claims 4 and 14, Li discloses: The apparatus of claim 1, wherein the result represents an accuracy of the training of the neural network (“SuccessiveHalving’s budget scales like ∆− max{α,β}, which can be significantly smaller than the uniform allocation’s budget of ∆−(α+β). However, because α and β are unknown in practice, neither method knows how to choose the optimal n or B to achieve this ∆ accuracy. In Section 4.3.3, we show how Hyperband addresses this issue” p22, please see also §4.3.3, “The issue arises when configurations with slower convergence rates give a better final model. While if time is a priority, it may make sense to optimize for speed and accuracy, in general, Hyperband should be able to handle differing convergence rates.” P29 §6.2 ¶1).

Regarding claims 5 and 15, Li discloses: The apparatus of claim 1, wherein the training controller is further to, in response to completion of the training of the neural network using the selected permutation and the labeled training data, validate the (“• run then return val loss(t, r) – a function that takes a hyperparameter configuration t and resource allocation r as input and returns the validation loss after training the configuration for the allocated resources.” P7 ¶1).

Regarding claims 6 and 16, Li discloses: The apparatus of claim 5, wherein the training controller is further to, in response to determining that the neural network is not accurate, cause the neural network trainer to train the neural network using a second permutation of hyperparameters and the labeled training data (“: (a) The heatmap shows the validation error over a two-dimensional search space with red corresponding to areas with lower validation error. Configuration selection methods adaptively choose new configurations to train, proceeding in a sequential manner as indicated by the numbers. (b) The plot shows the validation error as a function of the resources allocated to each configuration (i.e. each line in the plot). Configuration evaluation methods allocate more resources to promising configurations.” P2 Fig1).

Regarding claims 7 and 17, Li discloses: The apparatus of claim 1, wherein the plurality of permutations of hyperparameters in the design of experiments matrix represents all of the permutations of hyperparameters in the design of experiments matrix (“get hyperparameter configuration(n) – a function that returns a set of n i.i.d. samples from some distribution defined over the hyperparameter configuration space. In this work, we assume uniformly sampling of hyperparameters from a predefined space (i.e., hypercube with min and max bounds for each hyperparameter), which immediately yields consistency guarantees. However, the more aligned the distribution is towards high quality hyperparameters (i.e., a useful prior), the better Hyperband will perform (see Section 6 for further discussion).” P6 last ¶, Table 1, see also Fig 1).

Regarding claims 8 and 18, Li discloses: The apparatus of claim 1, wherein the plurality of permutations of hyperparameters in the design of experiments matrix represents less than all of the permutations of hyperparameters in the design of experiments matrix (“get hyperparameter configuration(n) – a function that returns a set of n i.i.d. samples from some distribution defined over the hyperparameter configuration space. In this work, we assume uniformly sampling of hyperparameters from a predefined space (i.e., hypercube with min and max bounds for each hyperparameter), which immediately yields consistency guarantees. However, the more aligned the distribution is towards high quality hyperparameters (i.e., a useful prior), the better Hyperband will perform (see Section 6 for further discussion).” P8 last ¶, Table 1, Fig 1).
Regarding claim 10, Li discloses: The apparatus of claim 9, wherein the result comparator is further to cause the estimated interaction effects to be displayed to a user (“The quality of a predictive model critically depends on its hyperparameter configuration, but it is poorly understood how these hyperparameters interact with each other to affect the resulting model.” P1 §1 ¶1, “Figure 4 shows an empirical comparison of the average test error across 70 trials of the individual brackets of Hyperband run separately as well as standard Hyperband. In practice, we do not know a priori which bracket s ∈ {0, . . . , 4} will be most effective in identifying good hyperparameters, and in this case neither the most (s = 4) nor least aggressive (s = 0) setting is optimal. However, we note that Hyperband does nearly as well as the optimal bracket (s = 3) and outperforms the baseline uniform allocation (i.e., random search), which is equivalent to bracket s = 0.” P7 §2.3 ¶3).

Regarding claims 3 and 13, Li discloses: The apparatus of claim 1, wherein the partial set of labeled training data represents less than ten percent of the labeled training data (“Here we consider the setting of a black-box batch training algorithm that takes a data set as input and outputs a model. In this setting, we treat the resource as the size of a random subset of the data set with R corresponding to the full data set size. Subsampling data set sizes using Hyperband, especially for problems with super-linear training times like kernel methods, can provide substantial speedups.” P8 ¶2, “Note that R is also the number of configurations evaluated in the bracket that performs the most exploration, i.e s = smax. In practice one may want n ≤ nmax to limit overhead associated with training many configurations on a small budget, i.e., costs associated with initialization, loading a model, and validation. In this case, set smax = |logη(nmax)|. Alternatively, one can redefine one unit of resource so that R is artificially smaller (i.e., if the desired maximum iteration is 100k, defining one unit of resource to be 100 iterations will give R = 1, 000, whereas defining one unit to be 1k iterations will give R = 100). Thus, one unit of resource can be interpreted as the minimum desired resource and R as the ratio between maximum resource and minimum resource.” P8 §2.5 ¶3).
Alternatively, Li states in the above citation that any value of the desired resource can be used and gives an example of 10% of the training data. It would have been obvious to one of ordinary skill in the art before the effective filing date to use any desired value around 10% as taught by Li, including the less than 10% in the claim, to yield predictable results. MPEP 2144.05.II.A. states that optimization of values or ranges through routine experimentation cannot produce inventive subject matter unless there is evidence indicating that value is critical.

Response to Arguments








Applicant’s arguments with respect to claim(s) 1-8, 10-18 and 21 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC NILSSON whose telephone number is (571)272-5246. The examiner can normally be reached M-F: 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ERIC NILSSON/Primary Examiner, Art Unit 2122