Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Detailed Action
This office action is responsive to the Amendments and Request for Continued Examination (RCE) filed on 20 October 2021.  As directed by the Amendments, claims 6 and 17 have been amended.  Claims 6-8 and 11-17 are pending in the application.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 20 October 2021 has been entered.
 
Response to Arguments
The arguments presented on pages 7-9 of the Remarks filed on 20 October 2021 have been fully considered by the Examiner.  These arguments, while persuasive, are based either upon newly amended claim limitations in the independent claims or upon the dependent claims’ dependence from their respective base claims, and are moot in view of the new grounds for rejection presented below.
Claim Objections
Claims 6 and 17 are objected to because of the following informalities:  
In the second to last lines of claims 6 and 17, the phrase “notify the first sever” should instead read “notifying the first server.”
Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 6-8 and 11-17 are rejected under 35 U.S.C. 103 as being unpatentable over Young et al., "Optimizing Deep Learning Hyper-parameters Through an Evolutionary Algorithm," Machine Learning in High Performance Computing Environments Workshop, Austin, Texas, 15-20 November, 2015, hereinafter “Young,” (previously cited) in view of Domhan et al., “Speeding Up Automatic Hyperparameter  “Domhan” (previously cited) and further in view of Snoek et al., “Practical Bayesian Optimization of Machine Learning Algorithms,” arXiv:1206.2944v2 29 [stat.ML] Aug 2012, hereinafter “Snoek” (previously cited) and Drevo et al., (US 2016/0132787, hereinafter “Drevo”).

Regarding claim 6, Young discloses [a] method for use in a system configured to construct a neural network for performing deep learning operations, and to search for parameters defining the deep learning operations, (Young, § 3, ¶ 1, “In this work, a framework for optimizing the hyper-parameters of a deep network using an evolutionary algorithm is presented.”;
Young, § 1, ¶ 2, “This work proposes to address the model selection problem and ease the demands on data researchers using MENNDL, an evolutionary algorithm that leverages a large number of compute nodes. These nodes communicate over MPI [Message Passing Interface] to distribute the task of finding the optimal hyper-parameters across the nodes of a super computer.”;
Young, § 3 “The evolutionary algorithm is implemented as a master-slave process where the genes [corresponds to “parameters defining a learning operation] are calculated on a single node (i.e. selection, crossover, mutation are handled by a single node), and many slaves are used to evaluate the fitness of the population.”) 
a first server, a second server and third server included in the system (Ibid., the “master” node corresponding to the claimed “first server” and the “many slaves” corresponding to at least the claimed “second server” and “third server”)
the method comprising: configuring the first server to perform a first plurality of processes, the first plurality of processes comprises specifying, from a search range of the parameters, a combination of first  parameters and a combination of second parameters, using a search method  based on a uniform distribution, (Young, § 3.1 and Table 1, Deep Learning network hyper-parameters are called "genes," and a combination of one or more hyper-parameters to be evaluated is called an "individual"; “A range [corresponds to claimed “search range”] and resolution is defined for each gene in order to eliminate searching the areas of the hyperparameter space that are not of interest”, and “the initial population of individuals is created by sampling each gene from a uniform random distribution.”  In Table 1 there are 500 individuals (combinations of hyper-parameters) with 6 different genes (hyper-parameters) per individual.  The number of individuals and the number of genes per individual are configurable, so the method is operable to specify a first individual and a second individual corresponding to the claimed “combination of first parameters” and “combination of second parameters”.)
transmitting the combination of the first parameters to the second server, (Young, § 3.2 “Computing Framework”, “The gene for each individual is communicated between nodes using MPI in order for its fitness to be evaluated and its fitness is returned to the master node using MPI.” [The master supercomputer node transmits the “individual” corresponding to the claimed “combination of first parameters” to a slave node.]
transmitting the combination of the second  parameters to the third server (Ibid.)
receiving, from the second server, a first learning result obtained by the deep learning operations based on the combination of the first parameters; (Ibid., The “fitness” corresponding to the claimed “first learning result” is returned to the master node from the slave node using MPI.]
receiving, from the third server, a second learning result obtained by the deep learning operations based on the combination of the second parameters,2740/101986-0203P16168010 1 a03/01/21-2-Application No. 15/214380Docket No.: 101986.0203PAmendment dated 03/01/2021T4ASA-15S0510 L1G9099014-US-A (Ibid.)

[…]

transmitting the combination of the third parameters to the second server or the third server, (Young, § 3.2 “Computing Framework”, “The gene for each individual is communicated between nodes using MPI in order for its fitness to be evaluated and its fitness is returned to the master node using MPI.” [The master supercomputer node transmits the “individual” corresponding to the claimed “third combination of third parameters” to a slave node.]
receiving, from the second server or the third server, a third learning result obtained by the deep learning operations based on the combination of the third parameters; (Ibid., The “fitness” corresponding to the claimed “third learning result” is returned to the master node from a slave node using MPI.]
and configuring the second server and the third server to perform a second plurality of processes, the second plurality of processes comprises receiving the combination of the parameters transmitted from the first server, (Young, § 3.2 

Young at least contemplates combining initial random uniform hyperparameter selection with a Bayesian (i.e. probabilistic) optimization method trained on earlier hyperparameter combinations. (Young, § 6 “CONCLUSIONS AND FUTURE WORK,” ¶ 2 “Although the work presented here most commonly performs mutation by pulling new hyperparameters from a constrained uniform distribution, it would be interesting to explore layering the genetic algorithm’s mutation with other currently researched algorithms in hyper-parameter optimization. For example, a Bayesian optimization method could be learned from performance samples acquired in the course of the genetic algorithm’s evolution. Future generations can selectively sample from the distribution learned over the hyperparameters.”  	

Young, however, does not explicitly disclose Response to Office Action dated 09/30/2020specifying, from the search range of the parameters, a combination of third parameters, based on the first and second learning results and using a search method based on a probability distribution
Snoek teaches specifying, from the search range of the parameters, a combination of third parameters, based on the first and second learning results and using a search method based on a probability distribution (Snoek, pg. 2, § 2 “Bayesian Optimization with Gaussian Process Priors,” “What makes Bayesian optimization different from other procedures is that it constructs a probabilistic model for f(x) and then exploits this model to make decisions about where in X to next evaluate the function, while integrating out uncertainty. The essential philosophy is to use all of the information available from previous evaluations of f(x) and not simply rely on local gradient and Hessian approximations. This results in a procedure that can find the minimum of difficult non-convex functions with relatively few evaluations, at the cost of performing more computation to determine the next point to try. When evaluations of f(x) are expensive to perform -- as is the case when it requires training a machine learning algorithm – it is easy to justify some extra computation to make better decisions.” [The Bayesian optimization method of Snoek can take the results of prior evaluations of hyperparameters that were chosen using a uniform (Gaussian) distribution, and use those results in a Bayesian (probabilistic) optimization model to choose which hyperparameters to evaluate next.]

Snoek is analogous art, as it is directed to the task of choosing and evaluating neural network hyperparameters.
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to take the previous evaluations of hyperparameters chosen by Young’s uniform distribution method and provide them as inputs to the Bayesian optimization method of Snoek, the benefit being improvement in both the speed of determining hyperparameters to evaluate and the quality of the 

 	The combination of Young and Snoek does not disclose or teach performing  the deep learning operations based on the received combination of the parameters, monitoring an index operating as a metric directed to a level of performance of learning being achieved during the deep learning operations in accordance with the received combination of the parameters and determining if the index satisfies a threshold, responsive to determining that the index satisfies the threshold during the deep learning operations, continuing performance of the deep learning operations, and responsive to determining that the index fails to satisfy the threshold during the deep learning operations, interrupting the deep learning operations and reporting results of the interrupted deep learning operations to the first server. 

Domhan teaches performing the deep learning operations based on the received combination of the parameters, (Domhan, pg. 3465, Col. 1, first full paragraph, “Figure 4 shows the effect of our predictive termination criterion on the different DNN training runs [training runs corresponding to the claimed “deep learning operations”]: the predictive termination criterion successfully terminated runs that do not 
Domhan, pg. 3466, Fig 4(c) showing training runs extending for various numbers of epochs)
monitoring an index operating as a metric directed to a level of performance of learning being achieved during the deep learning operations in accordance with the received combination of the parameters and determining if the index satisfies a threshold, (Domhan, pg. 3463, “Speeding up Hyperparameter Optimization,” “We use our predictive models to speed up hyperparameter optimizers as follows. Firstly, while the hyperparameter optimizer is running we keep track of the best performance ^y found so far (we initialize ^y to -infinity). Each time the optimizer queries the performance l(lamda) of a hyperparameter setting (lamba) we train a DNN using (lamda) as usual, except that we terminate this run early if our extrapolation model predicts the network to eventually yield worse performance than ^y.”)
responsive to determining that the index satisfies the threshold during the deep learning operations, continuing performance of the deep learning operations, (Domhan, pg. 3463, lines 4-9, “We then consider the predicted probability P(ym _ ^yjy1:n) that the network, after training for m intervals, will exceed the performance ^y. If this probability is above a threshold (sigma) then training continues as usual for the next p epochs. Otherwise, training is terminated and we return the expected validation error”)
and responsive to determining that the index fails to satisfy the threshold during the deep learning operations, interrupting the deep learning operations (Ibid.) and reporting results of the interrupted deep learning operations to the first server. 

Domhan is analogous art, as it is directed to the task of optimizing the selection of hyperparameters for deep learning neural networks.
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to modify the hyperparameter selection and evaluation of Young with the predictive early termination of Domhan, the benefit increased efficiency in hyperparameter selection and evaluation, as cited by Domhan in the Abstract. “Experiments with a broad range of neural network architectures on various prominent object recognition benchmarks show that our resulting approach speeds up state-of-the-art hyperparameter optimization methods for DNNs roughly twofold, enabling them to find DNN settings that yield better performance than those chosen by human experts.”

The combination of Young, Snoek, and Domhan does not teach wherein the second plurality of processes further comprises saving a model including a best performance during or after the deep learning calculation, and notify[ing] the first se[r]ver together with the learning result.

	Drevo teaches wherein the second plurality of processes further comprises saving a model including a best performance during or after the deep learning calculation, (Drevo, ¶ [0034] “Non-limiting examples of model methodologies include support vector machine (SVM), neural networks (NN), Bayesian networks (BN), deep neural networks (DNN), deep belief networks (DBN), stochastic gradient descent (SGD), and random forest (RF).” (underlining added)
Drevo, Fig. 6, element 608 “Perform hybrid model optimization to find best model for the dataset” and element 610 “Return best model”;
Drevo, ¶ [0153] “At block 610, the optimized (or best performing) model is returned. The model may be returned to the user via a UI 102 and/or via email. In some embodiments, a trained model may be returned from the repository 104c.”;
Drevo, ¶ [0043] “The trained model repository 104c stores models trained by the system 100, e.g., models trained as part of the model recommendation, training, and optimization techniques described below. The trained models may be stored temporarily (e.g., until provided to the user) or long-term.”; 
Drevo, ¶ [0050] “The performance table 106d stores performance data for models trained for given datasets [corresponds to claimed “including a best performance”]. A record of table 105d is associated with a methodology 106a, a dataset 106b, and a hyperpartition 106c, and includes a complete model parameterization along with evaluated performance information. In some embodiments, the processing cluster 108 use the performance table as an immutable log, appending and reading data, but not editing or deleting records. 


and notify[ing] the first se[r]ver together with the learning result. (Drevo, ¶ [0114] “At block 410, the highest performing model k* is trained on the received dataset using, for example, the training process described below in conjunction with FIG. 7. The newly trained model may be evaluated for performance using the specified performance metric (e.g., the metric specified by attribute 204v of the data runs table 106b) and the results stored in the data hub (and, thus, within the performance matrix M.)”;
Drevo, ¶ [0115] “If the termination criteria is reached, the highest performing model k* is returned (or "recommended") at block 414.” [corresponds to claimed “notify[ing] the first se[r]ver together with the learning result”.]

	Drevo is analogous art, as it is in the field of distributed hyperparameter optimization and deep learning model evaluation.
	It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the trained model storage and “best model” identification of Drevo into the distributed hyperparameter optimization of Young, the benefit being that storing trained models and identifying a best model allow for the 
	
Claim 17 recites similar limitations as claim 6, and is rejected under the same rationale as applied to claim 6 above.

2740/101986-0203P16168010 1 a03/01/21-3-Application No. 15/214380Docket No.: 101986.0203PAmendment dated 03/01/2021T4ASA-15S0510 L1G9099014-US-AResponse to Office Action dated 09/30/2020 	Regarding claim 7, the combination of references as applied to claim 6 above teaches [t]he method of claim 6.  Further, the combination teaches wherein the search method based on the uniform distribution is a random method; (Young, § 3.1, Deep Learning network hyper-parameters are called "genes," and a combination of one or more hyper-parameters to be evaluated is called an "individual"; a range and a resolution for each gene [hyper-parameter] is specified, and the initial population of individuals is created by sampling each gene from a uniform random distribution.) and the search method based on the probability distribution is a Bayesian method. (Domhan, § 2.1, “Hyperparameter Optimization Methods,” pg. 3461, ¶ 2, “The three most popular implementations of Bayesian optimization are Spearmint [Snoek et al., 2012], which uses a Gaussian process (GP) [Rasmussen and Williams, 2006] model for M; SMAC [Hutter et al., 2011], which uses random forests [Breiman, 2001] modified to 

Regarding claim 8, the combination of references as applied to claim 6 above teaches [t]he method of claim 6.  Further, the combination teaches, wherein the first parameters of the first combination includes a first number of layers of the neural network; and the second  parameters of the second combination includes a second number of layers of the neural network different from the first number of layers. (Domhan, pg. 3464, Table 1, “Network hyperparameters,” [The number of neural network layers to be used in a combination of hyperparameters under evaluation ranges from 1 to 6, so the system is operable to generate first and second combinations with differing numbers of layers].

  
Regarding claim 11, the combination of references as applied to claim 6 above teaches [t]he method of claim 6.  Further, the combination teaches wherein at least the first server operates as a managing server to provide (i) the combination of first parameters to the second server to perform deep learning operations on the combination of first parameters and (ii) the combination of second parameters to the third server to perform deep learning operations on the combination of second parameters. (Young, pg. 2, § 3 “METHOD”, “In this work, a framework for 
2740/101986-0203P 16168010 1 a03/01/21-4-Application No. 15/214380Docket No.: 101986.0203PAmendment dated 03/01/2021T4ASA-15S0510 L1G9099014-US-A Response to Office Action dated 09/30/2020
Regarding claim 12, the combination of references as applied to claim 6 above teaches [t]he method of claim 6.  Further, the combination teaches wherein the combination of first parameters or the combination of second parameters includes a number of layers of the neural network.  (Domhan, pg. 3464, Table 1, “Network Hyperparameters,” the number of layers is selectable between 1 and 6, with a default of 1.)

claim 13, the combination of references as applied to claim 6 above teaches [t]he method of claim 6.  Further, the combination teaches wherein the combination of first parameters or the combination of second parameters includes a number of nodes within each layer of the neural network or a rate of learning.  (Domhan, pg. 3464, Table 1, “Fully connected layer hyperparameters,” the number of units [corresponds to claimed “nodes”] within a layer is selectable from 128 to 6144, with a default of 1024; Ibid., “Network hyperparameters,” the initial learning rate is selectable between 1 X 10^-7 and 0.5, with a default of 0.001)

Regarding claim 14, the combination of references as applied to claim 6 above teaches [t]he method of claim 6.  Further, the combination teaches wherein the index includes a recognition ratio where the index satisfies the threshold when the recognition ratio is equal to or exceeds the threshold.  (Domhan, pg. 3463, § 4.1 “Experimental Setup,” “We used three popular datasets concerning object recognition from small-sized images: the image recognition datasets CIFAR-10 and CIFAR-100 [Krizhevsky, 2009] and the well-known MNIST dataset [LeCun et al., 1989].”; Ibid., “The MNIST dataset is a classic object recognition dataset consisting of 60;000 training and 10;000 test images with 28 X 28 pixels depicting hand-written digits to be classified into 10 digit classes.” [i.e., the generated network models are used for recognition tasks]; (Domhan, pg. 3464, Col. 1, ¶ 3, “For the predictive termination criterion we set the threshold to σ = 0:05 in all experiments, that is, we stopped training a network if our extrapolation model was 95% certain that it would not improve over the best known performance ^y when fully trained.” [The stopping criteria is based on the future 

Regarding claim 15, the combination of references as applied to claim 6 above teaches [t]he method of claim 6.  Further, the combination teaches wherein the index includes an error ratio where the index satisfies the threshold when the error ratio is below the threshold.  (Domhan, pg. 3464, Col. 1, ¶ 3, “For the predictive termination criterion we set the threshold to σ = 0:05 in all experiments, that is, we stopped training a network if our extrapolation model was 95% certain that it would not improve over the best known performance ^y when fully trained.” [The stopping criteria is based on the future predicted performance of the particular combination of hyperparameters]; Domhan, pg. 3465, Table 2, [Network performance is expressed as an error percentage (corresponds to claimed “error ratio”)]. [The training for a particular network is terminated early if the predicted future performance (e.g. error percentage) of the network is not better than the currently best known performance.]

Regarding claim 16, the combination of references as applied to claim 11 above teaches [t]he method of claim 11.  Further, the combination teaches wherein a number of the second plurality of deep learning operations is equal to or less than one-half of a number of the first plurality of deep learning operations.  (Domhan, pg. 3465, Col. 1, first full paragraph, “Figure 4 shows the effect of our predictive termination criterion on the different DNN training runs: the predictive termination criterion successfully terminated runs that do not reach top performance but rather converge slowly to mediocre results. The figure also shows that it was possible to terminate many poor runs quite early.”; Domhan, Fig. 4(c) showing network configurations that were predictively terminated after fewer than ~100 epochs, while others continued to be evaluated for as many as ~300 epochs.[i.e., candidates whose training is terminated early may be trained for a number of operations less than one-half the number of operations that other candidates receive])

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SCOTT R GARDNER whose telephone number is (469)295-9128. The examiner can normally be reached 8:00am - 5:00pm M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann J Lo can be reached on 571-272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.






/SCOTT R GARDNER/Examiner, Art Unit 2126    
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126