Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on January 12, 2022, in which claims 1, 2, 4, and 5 are amended. Claims 1-5 are currently pending.

Response to Arguments
The rejections to claims 1-5 under 35 U.S.C. § 112(b)/(f) are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.
Applicant’s arguments with respect to rejection of claims 1-5 under 35 U.S.C. 101 based on amendment have been considered and are not deemed to be persuasive. 
Applicant’s arguments with respect to rejection of claims 1-5 under 35 U.S.C. 103(a) based on amendment have been considered and are persuasive. The argument is moot in view of a new ground of rejection set forth below.

Claim Rejections - 35 USC § 101
101 Rejection
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-5 are rejected under 35 USC § 101 because the claimed invention is directed to non-statutory subject matter.

Regarding Claim 1:  Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis:  Claim 1 recites a computer implemented method of processing neural networks, which, under its broadest reasonable interpretation is a series of mental processes.  For example, but for the generic computer components language, the above limitations in the context of this claim encompass neural network processing, including the following: 
calculating, by the processor, based on the measured data in which a first data size is associated with a prediction performance of a model generated by using training data of the first data size, a first parameter value which defines a first prediction performance curve that indicates a relationship between a data size and a prediction performance (mathematical calculation),
 wherein a difference between the sampled second prediction performance and the third prediction performance is less than a threshold (mathematical calculation)
calculating, by the processor, a plurality of second parameter values which defines a plurality of second prediction performance curves that represents the plurality of sample point seguences and determining a plurality of weights associated with the plurality of second prediction performance curves by using the plurality of second parameter values and the measured data (mathematical calculation)
Generating variance information which indicates variation of a fourth prediction performance of a second data size estimated from on the first prediction performance curve by using the plurality of second prediction performance curves and the plurality of weights (mathematical calculation)
determining, by the processor, based on the variance information, whether to execute the machine learning algorithm by using training data of the second data size (evaluation and judgement)
Therefore, claim 1 recites an abstract idea which is a judicial exception.
Step 2A Prong Two Analysis:  Claim 1 recites additional elements “a processor”. However, these additional features are computer components recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception using a generic computer component.  An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application.  Claim 1 also recites additional insignificant extra-solution activity 
generating measured data in which the first data size is associated with a first prediction performance of the model 
sampling, by the processor, a second prediction performance within a predetermined range different from a third prediction performance on the first prediction performance curve a plurality of times for each of different a plurality of data sizes 
to generate a plurality of sample point seguences, each of which is a seguence of combinations of a data size and a prediction performance
which amounts to gathering and outputting data. Claim 1 also recites additional elements “a machine learning algorithm to generate a model by using training data of a first data size” which amounts to generally linking the judicial exception to a particular technology or field of use.  Therefore, claim 1 is directed to a judicial exception.
Step 2B Analysis:  Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in claim 1 amount to no more than mere instructions to apply the judicial exception using a generic computer component.
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claims 4 and 5, which recite a system and a computer program product, respectively, as well as to dependent claims 2-3. The additional limitations of the dependent claims are addressed briefly below:
Dependent claim 2 recites additional mental processes “the threshold is different depending on the plurality of data sizes.” which amounts to evaluation and judgement.
Dependent claim 3 recites additional mathematical calculations “calculating a plurality of first occurrence probabilities corresponding to the plurality of second parameter values by using the plurality of second parameter values and the measured data”, “converting the plurality of first occurrence probabilities into a plurality of second occurrence probabilities corresponding to the plurality of sample point sequences by using the plurality of sample point sequences and the plurality of second parameter values” as well as additional mental processes “determining the plurality of weights from the plurality of second occurrence probabilities” which amounts to evaluation and judgement.

Therefore, when considering the elements separately and in combination, they do not do not add significantly more to the inventive concept. Accordingly, claims 1-5 are rejected under 35 U.S.C. § 101. 


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, and 3-5 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Klein (“Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets”, 2017), and Shark (“Shark Machine Learning Library Documentation”, 2016) and in further view of Hara (US 2017/0228639 A1).  

Regarding claim 1, Klein teaches executing, by a processor, a machine learninq algorithm to generate a model by using training data of a first data size ([Abstract] "Bayesian optimization has become a successful tool for hyperparameter optimization of machine learning algorithms, such as support vector machines or deep neural network..."To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size").
and generating measured data in which the first data size is associated with a first prediction performance of the model ([Abstract] "To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size" Klein explicitly teaches a generating data associated with a first prediction performance further associated with the data size.).
calculating, by the processor, based on the measured data, ([Abstract] “To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size").
a first parameter value which defines a first prediction performance curve that indicates a relationship between a data size and a prediction performance, (See figure 6 on p. 14, also included below, for error as a function of dataset size.  See eqn. 6 for description of kernel.  First parameter represented by C. Second parameter represented by gamma. See also section 3.1).
sampling, by the processor, a second prediction performance different from a a third prediction performance on the first prediction performance curve a plurality of times for each of a plurality of data sizes, ([p. 11 Secton A.1] “after sampling K hyperparameter settings from the marginal loglikelihood for the GP using MCMC (line 1), for every hyperparameter setting.” [p. 13 Section B] “Scaling of Loss and Computational Cost With Dataset Size…Figure 6 shows these trends for ten random configurations, evaluated on subsets of different sizes” See figure 8 for range of noise (variance) detected at each dataset size).
to generate a plurality of sample point sequences, each of which is a sequence of combinations of a data size and a prediction performance ([p. 13 Section B] See Figure 6 and 7.  Each point is a combination of dataset size and model performance, each curve is a sequence of these points and there are a plurality of curves in each graph to compare performance.  “To show that our method, i.e. the kernel we use and our initial design, actually capture these trends, we sampled points from that data as our initial design and predicted loss and cost of unseen configurations”).
calculating, by the processor, a plurality of second parameter values which defines a plurality of second prediction performance curves that represents the plurality of sample point sequences (Figure 6 shows a plurality of curves that represent prediction performance each of which can be represented by a second parameter value gamma.  The sample point sequences are combinations of dataset size and error [p. 13 B] “Figure 6 shows these trends for ten random configurations, evaluated on subsets of different sizes.” Evaluated interpreted as synonymous with calculated, by the processor.).
variance information which indicates variation of a prediction performance of a second data size ([Klein 2.3] "(multi-task Bayesian optimization) The blackbox function f : X _ R ! R now takes another input representing the data subset size;” Data subset interpreted as second data “we will use relative sizes s = Nsub=N 2 [0; 1], with s = 1 representing the entire dataset. While the eventual goal is to minimize the loss f(x; s = 1) for the entire dataset, evaluating f for smaller s is usually cheaper...We propose a principled rule for the automatic selection of the next (x; s) pair to evaluate...Based on these observations, we expect that relatively small fractions of the dataset yield representative performances and therefore vary our relative size parameter s on a logarithmic scale.” [Section C] “We repeated each run with a given subset size K = 10 times using different subsets, and estimate the observation noise variance at each point” See eqn. 9). However, Klein does not explicitly teach wherein a difference between the sampled second prediction performance and the third prediction performance is less than a threshold and determining a plurality of weights associated with the plurality of second prediction performance curves by using the plurality of second parameter values and the measured data, 
generating, by the processor 
, variance information which indicates variation of a fourth prediction performance of a second data size on the first prediction performance curve by using the plurality of second prediction performance curves and the plurality of weights 
and determining, by the processor, based on the variance information, whether to execute the machine learning algorithm by using training data of the second data size.  

Shark who teaches a related method of neural network optimization teaches wherein a difference between the sampled second prediction performance and the third prediction performance is less than a threshold ([p. 4] "Next we employ a stopping criterion that monitors progress on the training error E. The stopping criterion TrainingError takes in its constructor a window size (or number of time steps) T together with a threshold value ϵ. If the improvement over the last T timesteps does not exceed ϵ, that is, E(t−T)−E(t)<ϵ, the stopping criterion becomes active and tells the optimizer to stop" sampled second and third prediction are interpreted as timesteps t and t-T respectively.). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network optimization techniques in Klein with the stopping criteria in Shark. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Shark that in optimizing a neural network it’s common to use the loss function delta value as stopping criteria in neural network training. 
Shark, however, does not explicitly teach and determining a plurality of weights associated with the plurality of second prediction performance curves by using the plurality of second parameter values and the measured data, 
generating, by the processor 
variance information which indicates variation of a fourth prediction performance of a second data size on the first prediction performance curve by using the plurality of second prediction performance curves and the plurality of weights 
and determining, by the processor, based on the variance information, whether to execute the machine learning algorithm by using training data of the second data size.  

Hara who teaches a related method of optimizing machine learning hyperparameters teaches and determining a plurality of weights associated with the plurality of second prediction performance curves by using the plurality of second parameter values and the measured data, ([¶0015] "The function is operable to estimate the evaluation value from differences between the tentative weight data of a first iteration of the plurality of iterations and the tentative weight data of a second iteration of the plurality of iterations. According to the tenth aspect, the apparatus may generate an accurate predictive model based on the tentative weight data." [¶0068] "At S170, the training section may generate a new setting used for training of second neural networks. [¶0042] "the apparatus 100 may improve prediction accuracy of the predictive model, and thereby may efficiently determine an optimized setting of the neural network).
generating, by the processor (Hara [¶0104] “The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.”).
variance information which indicates variation of a fourth prediction performance of a second data size on the first prediction performance curve by using the plurality of second prediction performance curves and the plurality of weights ([¶0050] "In the embodiment, the setting may include one or more hyper parameters relating to a local response normalization (or LRN) such as local size" [¶0041] "The selecting section 190 may select one setting based on performances of neural networks of which training is not terminated. For example, the selecting section 190 may select a setting that gives a neural network the best evaluation value among the first neural networks, and, the second neural networks of which training is not terminated by the terminating section 170" [¶0042] "As explained above, the apparatus 100 may improve prediction accuracy of the predictive model, and thereby may efficiently determine an optimized setting of the neural network by terminating at least part of the training of the neural networks by predicting the performance from  the tentative weight data." Local size is an explicitly determined parameter of model setting using both weights and model settings of first and second models.).
and determining, by the processor, based on the variance information, whether to execute the machine learning algorithm by using training data of the second data size. ([¶0050] "In the embodiment, the setting may include one or more hyper parameters relating to a local response normalization (or LRN) such as local size" [¶0041] "The selecting section 190 may select one setting based on performances of neural networks of which training is not terminated. For example, the selecting section 190 may select a setting that gives a neural network the best evaluation value among the first neural networks, and, the second neural networks of which training is not terminated by the terminating section 170" Hara explicitly teaches using a setting based on the second data size to start or stop training of an epoch of a machine learning algorithm.  The primary reference Klein explicitly teaches that variance to determine performance may be based on the data subset size.). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to generate, by a processor a prediction performance parameter relative to the model weights and indicative of the model prediction error. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Hara that the method ([¶0042] “may improve prediction accuracy of the predictive model, and thereby may efficiently determine an optimized setting of the neural network by terminating at least part of the training of the neural networks by predicting the performance from the tentative weight data”).

Regarding claim 3, the combination of Klein, Shark, and Hara teaches 
The estimation method according to claim 1, wherein the determining of a plurality of weights includes calculating a plurality of first occurrence probabilities corresponding to the plurality of second parameter values by using the plurality of second parameter values and the measured data (Hara [¶0015] "the function is operable to estimate the evaluation value from differences between the tentative weight data of a first iteration of the plurality of iterations and the tentative weight data of a second iteration of the plurality of iterations. According to the tenth aspect, the apparatus may generate an accurate predictive model based on the tentative weight data." [¶0045] "The training data may include at least one set of input data" [¶0065] "the generating section may normalize the tentative weight data of a first iteration of the plurality of iterations and the tentative weight data of a second iteration of the plurality of iteration" [¶0066] "The generating section may adopt calculation of entropy and/or basic statistics" [¶0067] "The generating section may generate a predictive model").
converting the plurality of first occurrence probabilities into a plurality of second occurrence probabilities corresponding to the plurality of sample point sequences by using the plurality of sample point sequences and the plurality of second parameter values (Hara [¶0015] “the function is operable to estimate the evaluation value from differences between the tentative weight data of a first iteration of the plurality of iterations and the tentative weight data of a second iteration of the plurality of iterations… the apparatus may generate an accurate predictive model based on the tentative weight data.” Converting and generating interpreted as synonymous).
determining the plurality of weights from the plurality of second occurrence probabilities (Hara [¶0065] “the generating section may normalize the tentative weight data of a first iteration of the plurality of iterations and the tentative weight data of a second iteration of the plurality of iteration"). 

Regarding claim 4, claim 4 effectively mirrors claim 1 and is therefore rejected under a similar interpretation.

Regarding claim 5, claim 5 effectively mirrors claim 1 and is therefore rejected under a similar interpretation.

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Klein, Shark, and Hara and in further view of Domhan (“Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves”, 2015) 

Regarding claim 2, the combination of Klein, Shark, and Hara teaches The estimation method according to claim 1.  However, the combination of Klein, Shark, and Hara does not explicitly teach, wherein, the threshold is different depending on the plurality of data sizes.  

Domhan who teaches a related art of neural network optimization teaches The estimation method according to claim 1, wherein, the threshold is different depending on the plurality of data sizes. ([p. 3463] "We then consider the predicted probability P(...) that the network, after training for m intervals, will exceed the performance ^y. If this probability is above a threshold   then training continues as usual for the next p epochs. Otherwise, training is terminated and we return the expected validation error" See also Eqns. 9-11. Domhan explicitly teaches that the threshold is dependent on the probability which is a function of the data set size.). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network optimization methods of Klein, Shark, and Hara with that of Domhan. The combination would have been obvious because a person of ordinary skill in the art would be able to determine that all four methods optimize the neural network through a parametrized loss function.  Hara and Klein both explicitly teach that the loss function is dependent on data subset sizes, and Hara further teaches that the threshold used as a stopping criteria for training may be dependent on the data subset size.  Domhan further explains the motivation for using a probabilistic model for stopping criterion ([p. 3462 Col. 1] “Given this model, a simple approach would be to find a maximum likelihood estimate for all parameters. However, this would not properly model the uncertainty in the model parameters. Since our predictive termination criterion aims at only terminating runs that are highly unlikely to improve on the best run observed so far we need to model uncertainty as truthfully as possible and will hence adopt a Bayesian perspective, predicting values ym using Markov Chain Monte Carlo (MCMC) inference.”).

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124