Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on September 21, 2022, in which claims 1, 4, and 5 are amended. Claims 1, 3-5 are currently pending.

Response to Arguments
With regards to Applicant's arguments directed towards the practical application of the claimed invention.  The broad recitation of improving the predictive performance of a machine learning model is not seen as providing a detailed improvement over the existing art.  The claimed invention is directed almost entirely towards mathematical calculations and concepts which are broadly linked to the field of machine learning on a generic computer system.  Generating variance information is a well-known mathematical concept and the mention of machine learning is seen as merely a high level attempt to integrate the judicial exception into a practical application.  It is unclear from the claim language how or why the mathematical concepts rely on the generic computer system and how their application to machine learning reflects an improvement.  The claim language is seen as lacking details regarding an improvement which would integrate the judicial exception into a practical application.
Applicant’s arguments with respect to rejection of claims 1-5 under 35 U.S.C. 103 based on amendment have been considered, however, have not been deemed persuasive. 	
With respect to Applicant's arguments that the threshold in Domhan does not depend on the dataset size, Examiner respectfully disagrees.  Examiner asserts that the threshold in Domhan is solely correlated to the probability, which is determined as a function of the dataset size.  Domhan teaches ([p. 3461 §2.2] "The term learning curve appears in the literature for describing two different phenomena...(2) the performance of a machine learning algorithm as a function of the size of the dataset it has available for training...we describe related work on modelling both types of learning curves".  Domhan then teaches in section 3.1 that the probability is based on the interval y, which is taught to commonly be either the training time or dataset size.  Therefore, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention that any change in the threshold would be dependent on the probability which is explicitly dependent on the dataset size, and changing the threshold as a function of the data size would lead to obvious and expected results.  Examiner further asserts that this interpretation is synonymous with the support for a different threshold dependent on the dataset size in the instant specification.  There is no direct recitation for a threshold that is different depending on the data size in the claimed invention specification, however, the Examiner has noted the apparent dependence between the performance curves and the threshold to provide support synonymously with the interpretation of the threshold in Domhan.  For example, ¶0244 of the published instant specification states that the threshold is predetermined, but makes no mention of a different threshold for each dataset size.  Examiner asserts that other interpretations of the different threshold dependent on the dataset size may be well outside of the scope of the instant specification as drafted, and may be considered new matter.  For this reason Examiner asserts that the prior art rejection is very reasonable.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1, 3-5 rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claims 1, 4, and 5 contain subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. Regarding claims 1, 4, and 5 “using training data of the second data size when the evaluation value is higher than the other evaluation value.” Is not explicitly outlined in the specification.  While the specification shows a correlation between performance curves and data size, there is no indication that the training data is conditionally selected based on the evaluation value.  Therefore, this is interpreted as introducing new matter.

Claim 3 is rejected with respect to the dependence on claim 1. 

Claim Rejections - 35 USC § 101
101 Rejection
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1, 3-5 are rejected under 35 USC § 101 because the claimed invention is directed to non-statutory subject matter.

Regarding Claim 1:  Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis:  Claim 1 recites a computer implemented method of processing neural networks, which, under its broadest reasonable interpretation is a series of mental processes and mathematical calculations and relationships.  For example, but for the generic computer components language, the above limitations in the context of this claim encompass neural network processing, including the following: 
calculating, by the processor, based on the measured data, a first formula which includes a first parameter value and indicates a shape of a first prediction performance curve that indicates a relationship between a data size and a prediction performance (mathematical calculation)
 wherein a difference between the sampled second prediction performance and the third prediction performance is less than a threshold (mathematical calculation)
sampling, by the processor, a second prediction performance within a predetermined range different from a third prediction performance on the first prediction performance curve a plurality of times for each of different a plurality of data sizes (mathematical relationship) 
to generate a plurality of sample point sequences, each of which is a sequence of combinations of a data size and a prediction performance (mathematical relationship)
calculating, by the processor, a plurality of second formulas which respectively include second parameter values and indicate shapes of a plurality of second prediction performance curves that represents the plurality of sample point sequences and determining a plurality of weights associated with the plurality of second prediction performance curves by using the plurality of second parameter values and the measured data (mathematical calculation)
generating variance information which indicates variation of a fourth prediction performance of a second data size estimated from on the first prediction performance curve by using the plurality of second prediction performance curves and the plurality of weights (mathematical calculation)
comparing, by the processor, an evaluation value calculated from the variance information and another evaluation value for another machine learning algorithm (mathematical calculations and relationships)
executing, by the processor, based on the variance information, whether to execute the machine learning algorithm by using training data of the second data size when the evaluation value is higher than the other evaluation value (observation, evaluation, and judgement)
Therefore, claim 1 recites an abstract idea which is a judicial exception.
Step 2A Prong Two Analysis:  Claim 1 recites additional elements “a processor”. However, these additional features are computer components recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception using a generic computer component.  An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application.  Claim 1 also recites additional insignificant extra-solution activity 
generating measured data in which the first data size is associated with a first prediction performance of the model 
which amounts to gathering and outputting data. Claim 1 also recites additional elements “a machine learning algorithm to generate a model by using training data of a first data size” which amounts to generally linking the judicial exception to a particular technology or field of use.  Therefore, claim 1 is directed to a judicial exception.
Step 2B Analysis:  Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in claim 1 amount to no more than mere instructions to apply the judicial exception using a generic computer component.
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claims 4 and 5, which recite a system and a computer program product, respectively, as well as to dependent claims 2-3. The additional limitations of the dependent claims are addressed briefly below:
Dependent claim 3 recites additional mathematical calculations “calculating a plurality of first occurrence probabilities corresponding to the plurality of second parameter values by using the plurality of second parameter values and the measured data”, “converting the plurality of first occurrence probabilities into a plurality of second occurrence probabilities corresponding to the plurality of sample point sequences by using the plurality of sample point sequences and the plurality of second parameter values” as well as additional mental processes “determining the plurality of weights from the plurality of second occurrence probabilities” which amounts to evaluation and judgement.
Therefore, when considering the elements separately and in combination, they do not do not add significantly more to the inventive concept. Accordingly, claims 1, 3-5 are rejected under 35 U.S.C. § 101. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: 
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

	Claims 1, 3-5 are rejected under U.S.C. §103 as being unpatentable over the combination of Klein (“Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets”, 2017), and Golovin (“Google Vizier: A Service for Black-Box Optimization”, 2017) and Shark (“Shark Machine Learning Library Documentation”, 2016) and in further view of Domhan (“Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves”, 2015).

	 Regarding claim 1, Klein teaches executing, by a processor, a machine learninq algorithm to generate a model by using training data of a first data size([Abstract] "Bayesian optimization has become a successful tool for hyperparameter optimization of machine learning algorithms, such as support vector machines or deep neural network..."To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size")
	and generating measured data in which the first data size is associated with a first prediction performance of the model([Abstract] "To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size" Klein explicitly teaches a generating data associated with a first prediction performance further associated with the data size.)
	calculating, by the processor, based on the measured data, ([Abstract] “To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size")
	a first formula which includes a first parameter value which defines and indicates a shape of a first prediction performance curve that indicates a relationship between a data size and a prediction performance ([p. 4 §2] "The numerator represents the information gain on the target task averaged over the possible outcomes of f(x; t). If the cost c(x; t) of a configuration x on task t is not known a priori it can be modelled the same way as the objective function" [p. 4 §3] "FABOLAS models loss and computational cost across dataset size and uses these models to carry out Bayesian optimization with an extra degree of freedom. The blackbox function f : X  R -> R now takes another input representing the data subset size; we will use relative sizes s  [0, 1], with s = 1 representing the entire dataset. While the eventual goal is to minimize the loss f(x; s = 1) for the entire dataset"" See figure 6 on p. 14, also included below, for error as a function of dataset size.  See eqn. 6 for description of kernel.  First parameter represented by C. Second parameter represented by gamma. See also section 3.1 Klein explicitly teaches using the hyperparameters as part of a configuration x (which are shown in figure 6 to indicate the shape of the prediction performance curve) to optimize the black box function as a function of dataset size and said configuration, such that the function is interpreted as synonymous with a first formula taking a first configuration as input.)
	sampling, by the processor, a second prediction performance different from a third prediction performance on the first prediction performance curve a plurality of times for each of a plurality of data sizes, ([p. 11 Section A.1] “after sampling K hyperparameter settings from the marginal loglikelihood for the GP using MCMC (line 1), for every hyperparameter setting.” [p. 13 Section B] “Scaling of Loss and Computational Cost With Dataset Size…Figure 6 shows these trends for ten random configurations, evaluated on subsets of different sizes” See figure 8 for range of noise (variance) detected at each dataset size)
	to generate a plurality of sample point sequences, each of which is a sequence of combinations of a data size and a prediction performance([p. 13 Section B] See Figure 6 and 7.  Each point is a combination of dataset size and model performance, each curve is a sequence of these points and there are a plurality of curves in each graph to compare performance.  “To show that our method, i.e. the kernel we use and our initial design, actually capture these trends, we sampled points from that data as our initial design and predicted loss and cost of unseen configurations”)
	calculating, by the processor, a plurality of a plurality of second formulas which respectively include second parameter values and indicate shapes of a plurality of second prediction performance curves that represents the plurality of sample point sequences (Figure 6 shows a plurality of curves that represent prediction performance each of which can be represented by a second parameter value gamma.  The sample point sequences are combinations of dataset size and error [p. 13 B] “Figure 6 shows these trends for ten random configurations, evaluated on subsets of different sizes.” Evaluated interpreted as synonymous with calculated, by the processor.)
	variance information which indicates variation of a prediction performance of a second data size ([Klein 2.3] "(multi-task Bayesian optimization) The blackbox function f : X _ R ! R now takes another input representing the data subset size;” Data subset interpreted as second data “we will use relative sizes s = Nsub=N 2 [0; 1], with s = 1 representing the entire dataset. While the eventual goal is to minimize the loss f(x; s = 1) for the entire dataset, evaluating f for smaller s is usually cheaper...We propose a principled rule for the automatic selection of the next (x; s) pair to evaluate...Based on these observations, we expect that relatively small fractions of the dataset yield representative performances and therefore vary our relative size parameter s on a logarithmic scale.” [Section C] “We repeated each run with a given subset size K = 10 times using different subsets, and estimate the observation noise variance at each point” See eqn. 9)
	comparing, by the processor, an evaluation value calculated from the variance information and another evaluation value for another machine learning algorithm;([p. 6 §3.4] "we use hyper-priors to emphasize meaningful values for the parameters, chiefly adopting the choices of the SPEARMINT toolbox [5]: a uniform prior between ... for all length scales in log space, a lognormal prior ... for the covariance amplitude , and a horseshoe prior with length scale of 0:1 for the noise variance" [p. 8 §4.3] "The results in Figure 4 show that—compared to the SVM tasks—FABOLAS’ speedup was smaller because CNNs only scale linearly in the number of datapoints." CNN and SVM interpreted as two different machine learning algorithms.).
	However, Klein does not explicitly teach a first formula which includes a first parameter value which defines and indicates a shape of a first prediction performance curve that indicates a relationship between a data size and a prediction performance 
	wherein a difference between the sampled second prediction performance and the third prediction performance is less than a threshold
	a threshold that is different depending on the plurality of data sizes.
	calculating, by the processor, a plurality of a plurality of second formulas which respectively include second parameter values and indicate shapes of a plurality of second prediction performance curves that represents the plurality of sample point sequences 
	and determining a plurality of weights associated with the plurality of second prediction performance curves by using the plurality of second parameter values and the measured data, 
	generating, by the processor, variance information which indicates variation of a fourth prediction performance of a second data size on the first prediction performance curve by using the plurality of second prediction performance curves and the plurality of weights
	and executing, by the processor, the machine learning algorithm by using training data of the second data size when the evaluation value is higher than the other evaluation value..

	Golovin, in the same field of endeavor, teaches a first formula which includes a first parameter value which defines and indicates a shape of a first prediction performance curve that indicates a relationship between a data size and a prediction performance ([p. 1492 §3.3 Col. 2] "More formally, we have a sequence of studies {Si} k i=1 on unknown objective functions  {fi} k i=1 , where the current study is Sk , and we build two sequences of regressors {Ri} k i=1 and (R ′ i)k i=1 having posterior mean functions {µi} k i=1 and (µ ′ i)k i=1 respectively, and posterior standard deviation functions {σi} k i=1 and (σ ′ i)k i=1 , respectively. Our final predictions will be µk and σk . Let Di = ((x i t ,y i t)) t be the dataset for study Si . Let R ′ i be a regressor trained using data (((x i t ,y i t − µi−1 (x i t))) t which computes µ ′ i and σ ′ i . Then we define as our posterior means at level i as µi (x) := µ ′ i (x) +µi−1 (x). We take our posterior standard deviations at level i, σi (x), to be a weighted geometric mean of σ ′ i (x) and σi−1 (x), where the weights are a function of the amount of data (i.e., completed trials) in Si and Si−1. The exact weighting function depends on a constant α ≈ 1 sets the relative importance of old and new standard deviations.")
	calculating, by the processor, a plurality of a plurality of second formulas which respectively include second parameter values and indicate shapes of a plurality of second prediction performance curves that represents the plurality of sample point sequences ([p. 1492 §3.3 Col. 2] "More formally, we have a sequence of studies {Si} k i=1 on unknown objective functions  {fi} k i=1 , where the current study is Sk , and we build two sequences of regressors {Ri} k i=1 and (R ′ i)k i=1 having posterior mean functions {µi} k i=1 and (µ ′ i)k i=1 respectively, and posterior standard deviation functions {σi} k i=1 and (σ ′ i)k i=1 , respectively. Our final predictions will be µk and σk . Let Di = ((x i t ,y i t)) t be the dataset for study Si . Let R ′ i be a regressor trained using data (((x i t ,y i t − µi−1 (x i t))) t which computes µ ′ i and σ ′ i . Then we define as our posterior means at level i as µi (x) := µ ′ i (x) +µi−1 (x). We take our posterior standard deviations at level i, σi (x), to be a weighted geometric mean of σ ′ i (x) and σi−1 (x), where the weights are a function of the amount of data (i.e., completed trials) in Si and Si−1. The exact weighting function depends on a constant α ≈ 1 sets the relative importance of old and new standard deviations.")
	and determining a plurality of weights associated with the plurality of second prediction performance curves by using the plurality of second parameter values and the measured data, ([p. 1492 §3.3 Col. 2] " Then we define as our posterior means at level i as µi (x) := µ ′ i (x) +µi−1 (x). We take our posterior standard deviations at level i, σi (x), to be a weighted geometric mean of σ ′ i (x) and σi−1 (x), where the weights are a function of the amount of data (i.e., completed trials) in Si and Si−1. The exact weighting function depends on a constant α ≈ 1 sets the relative importance of old and new standard deviations")
	generating, by the processor, variance information which indicates variation of a fourth prediction performance of a second data size on the first prediction performance curve by using the plurality of second prediction performance curves and the plurality of weights([p. 1492 §3.3 Col. 2] " Then we define as our posterior means at level i as µi (x) := µ ′ i (x) +µi−1 (x). We take our posterior standard deviations at level i, σi (x), to be a weighted geometric mean of σ ′ i (x) and σi−1 (x), where the weights are a function of the amount of data (i.e., completed trials) in Si and Si−1. The exact weighting function depends on a constant α ≈ 1 sets the relative importance of old and new standard deviations" standard deviations interpreted as a form of variance information.)
	and executing, by the processor, the machine learning algorithm by using training data of the second data size when the evaluation value is higher than the other evaluation value.([p. 1491 §3.2.2] "Median Stopping Rule. The median stopping rule stops a pending trial xt at step s if the trial’s best objective value by step s is strictly worse than the median value of the running averages oˆ τ 1:s of all completed trials’ objectives xτ reported up to step s." Golovin explicitly teaches stopping the execution of a machine learning algorithm using training data of a second size based on the variance information satisfying a specific condition.  It would therefore be implicit that upon the condition that the stopping criteria weren't satisfied the algorithm would continue to be executed.  Golovin teaches the stopping criteria being the median value being higher than the objective value.).

		Klein and Golovin are both directed towards black box methods of optimizing neural network training.  Therefore, Klein and Golovin are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Klein with the teachings of Golovin by performing regression over the curves shown in (for example) figures 6 or 7 in Klein and using the regression coefficients to further optimize the network training. It would be obvious to one of ordinary skill in the art that regression could be performed in this type of system for analysis, and since the system itself is seen as an analytical system for automated performance determination, it would be obvious to use regression which is well understood, routine, and conventional in the art.  This is reinforced by Golovin which teaches a related system which relies heavily on regression to optimize the training.  Golovin teaches as an additional motivation for combination ([p. 1494 §5.1] “Vizier is used across Google to optimize hyperparameters of machine learning models, both for research and production models. Our implementation scales to service the entire hyperparameter tuning workload across Alphabet, which is extensive. As one (admittedly extreme) example, Collins et al. [6] used Vizier to perform hyperparameter tuning studies that collectively contained millions of trials for a research project investigating the capacity of different recurrent neural network architectures.”).  While Golovin doesn’t explicitly teach a fourth prediction performance curve, Golovin does teach a stack, or plurality, of prediction performance curves, and the fourth prediction performance is reinforced with the combination of Klein who explicitly teaches ten prediction performance curves. 
	While the combination of Klein and Golovin explicitly teaches a stopping criteria , Golovin (Golovin [p. 1491 §3.2.2] “Median Stopping Rule. The median stopping rule stops a pending trial xt at step s if the trial’s best objective value by step s is strictly worse than the median value of the running averages oˆ τ 1:s of all completed trials’ objectives xτ reported up to step s.”), the combination of Klein and Golovin does not explicitly teach wherein a difference between the sampled second prediction performance and the third prediction performance is less than a threshold
	a threshold that is different depending on the plurality of data sizes..

	Shark, in the same field of endeavor, teaches wherein a difference between the sampled second prediction performance and the third prediction performance is less than a threshold([p. 4] "Next we employ a stopping criterion that monitors progress on the training error E. The stopping criterion TrainingError takes in its constructor a window size (or number of time steps) T together with a threshold value ϵ. If the improvement over the last T timesteps does not exceed ϵ, that is, E(t−T)−E(t)<ϵ, the stopping criterion becomes active and tells the optimizer to stop" sampled second and third prediction are interpreted as timesteps t and t-T respectively.).

		Klein, Golovin, and Shark are all directed towards neural network training systems.  Therefore, Klein, Golovin, and Shark are all related art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Klein and Golovin with the teachings of Shark by substituting the stopping criterion in Golovin with a threshold timestep difference as a stopping criterion as recommended in Shark. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Shark that in optimizing a neural network it’s common to use the loss function delta value as stopping criteria in neural network training.
	However, the combination of Klein, Golovin, and Shark does not explicitly teach a threshold that is different depending on the plurality of data sizes..

	Domhan, in the same field of endeavor, teaches a threshold that is different depending on the plurality of data sizes.( ([p. 3461 §2.2] "The term learning curve appears in the literature for describing two different phenomena...(2) the performance of a machine learning algorithm as a function of the size of the dataset it has available for training...we describe related work on modelling both types of learning curves" [p. 3463] "We then consider the predicted probability P(...) that the network, after training for m intervals, will exceed the performance ^y. If this probability is above a threshold δ then training continues as usual for the next p epochs. Otherwise, training is terminated and we return the expected validation error" See also Eqns. 9-11. Domhan explicitly teaches that the threshold is dependent on the probability which is a function of y.  It would be obvious to one of ordinary skill in the art based on the two known types of performance modeling in Domhan that y would represent the data set size.  Therefore, the threshold is dependent on the dataset size.).

	The combination of Klein, Golovin, and Shark as well as Domhan are directed towards black box optimization of neural network training.  Therefore, the combination of Klein, Golovin, and Shark as well as Domhan are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the neural network optimization methods of Klein, Shark, and Golovin with that of Domhan. The combination would have been obvious because a person of ordinary skill in the art would be able to determine that all four methods optimize the neural network through a parametrized loss function.  Klein explicitly teaches that the loss function is dependent on data subset sizes, and Golovin further teaches that the threshold used as a stopping criteria for training may be dependent on the data subset size.  Domhan further explains the motivation for using a probabilistic model for stopping criterion ([p. 3462 Col. 1] “Given this model, a simple approach would be to find a maximum likelihood estimate for all parameters. However, this would not properly model the uncertainty in the model parameters. Since our predictive termination criterion aims at only terminating runs that are highly unlikely to improve on the best run observed so far we need to model uncertainty as truthfully as possible and will hence adopt a Bayesian perspective, predicting values ym using Markov Chain Monte Carlo (MCMC) inference.”).  This motivation for combination also applies to the remaining claims which depend on this combination.

	 Regarding claim 3, the combination of Klein, Golovin, Shark, and Domhan teaches The estimation method according to claim 1, wherein the determining of a plurality of weights includes calculating a plurality of first occurrence probabilities corresponding to the plurality of second parameter values by using the plurality of second parameter values and the measured data (Golovin [p. 1492 §3.3 Col. 2] "More formally, we have a sequence of studies {Si} k i=1 on unknown objective functions  {fi} k i=1 , where the current study is Sk , and we build two sequences of regressors {Ri} k i=1 and (R ′ i)k i=1 having posterior mean functions {µi} k i=1 and (µ ′ i)k i=1 respectively, and posterior standard deviation functions {σi} k i=1 and (σ ′ i)k i=1 , respectively. Our final predictions will be µk and σk . Let Di = ((x i t ,y i t)) t be the dataset for study Si . Let R ′ i be a regressor trained using data (((x i t ,y i t − µi−1 (x i t))) t which computes µ ′ i and σ ′ i . Then we define as our posterior means at level i as µi (x) := µ ′ i (x) +µi−1 (x). We take our posterior standard deviations at level i, σi (x), to be a weighted geometric mean of σ ′ i (x) and σi−1 (x), where the weights are a function of the amount of data (i.e., completed trials) in Si and Si−1. The exact weighting function depends on a constant α ≈ 1 sets the relative importance of old and new standard deviations." Dataset Di interpreted as synonymous with measured data.  Posterior mean interpreted as synonymous with occurrence probability corresponding to the parameter values by using the measured data.)
	converting the plurality of first occurrence probabilities into a plurality of second occurrence probabilities corresponding to the plurality of sample point sequences by using the plurality of sample point sequences and the plurality of second parameter values (Golovin [p. 1492 §3.3 Col. 2] "More formally, we have a sequence of studies {Si} k i=1 on unknown objective functions  {fi} k i=1 , where the current study is Sk , and we build two sequences of regressors {Ri} k i=1 and (R ′ i)k i=1 having posterior mean functions {µi} k i=1 and (µ ′ i)k i=1 respectively, and posterior standard deviation functions {σi} k i=1 and (σ ′ i)k i=1 , respectively. Our final predictions will be µk and σk . Let Di = ((x i t ,y i t)) t be the dataset for study Si . Let R ′ i be a regressor trained using data (((x i t ,y i t − µi−1 (x i t))) t which computes µ ′ i and σ ′ i . Then we define as our posterior means at level i as µi (x) := µ ′ i (x) +µi−1 (x). We take our posterior standard deviations at level i, σi (x), to be a weighted geometric mean of σ ′ i (x) and σi−1 (x), where the weights are a function of the amount of data (i.e., completed trials) in Si and Si−1. The exact weighting function depends on a constant α ≈ 1 sets the relative importance of old and new standard deviations...This approach has nice properties when the prior regressors are densely supported (i.e. has many well-spaced data points)" Mean at level i interpreted as synonymous with second occurrence probability which depends on mean at level i-1)
	determining the plurality of weights from the plurality of second occurrence probabilities (Golovin [p. 1492 §3.3 Col. 2] "More formally, we have a sequence of studies {Si} k i=1 on unknown objective functions  {fi} k i=1 , where the current study is Sk , and we build two sequences of regressors {Ri} k i=1 and (R ′ i)k i=1 having posterior mean functions {µi} k i=1 and (µ ′ i)k i=1 respectively, and posterior standard deviation functions {σi} k i=1 and (σ ′ i)k i=1 , respectively. Our final predictions will be µk and σk . Let Di = ((x i t ,y i t)) t be the dataset for study Si . Let R ′ i be a regressor trained using data (((x i t ,y i t − µi−1 (x i t))) t which computes µ ′ i and σ ′ i . Then we define as our posterior means at level i as µi (x) := µ ′ i (x) +µi−1 (x). We take our posterior standard deviations at level i, σi (x), to be a weighted geometric mean of σ ′ i (x) and σi−1 (x), where the weights are a function of the amount of data (i.e., completed trials) in Si and Si−1. The exact weighting function depends on a constant α ≈ 1 sets the relative importance of old and new standard deviations." Golovin explicitly teaches determining the weights based on the posterior mean at level i which is interpreted as synonymous with the second occurrence probability.).
	
Regarding claims 4 and 5, claims 4 and 5 are directed towards and apparatus and a non-transitory machine readable medium, respectively, for performing the method of claim 1.  Therefore the rejections applied to claim 1 also apply to claims 4 and 5.  

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/           Supervisory Patent Examiner, Art Unit 2124