DETAILED ACTION
1.		The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of the Application
2.		Claims 1-20 have been examined in this application.  
		This communication is the first action on the merits.

Priority
3. 		The Examiner has noted the Applicants claiming Priority from Provisional Application 	62/852,029 filed on 05/23/2019. 	

IDS Statement
4.		The information disclosure statement filed on 04/07/2021 complies with the provisions 	of 37 CFR 1.97, 1.98 and MPEP § 609 and is considered by the Examiner. 

Claim Objections
5.		Claim 1 is objected to because of the following informalities:  
	(A).	Claim 1 recites the limitation: “[[if]] the cost function [[is]] successfully optimized 	subject 	to the constraint, providing the vector of variable hyperparameter values as an output.” 	Examiner notes that the broadest reasonable interpretation of a method (or process) claim 	having contingent limitations requires only those steps that must be performed and does not 	include 	steps that are not required to be performed because the condition(s) precedent are 	not met. For example, assume a method claim requires step A if a first condition happens and 	step B if a second condition happens. If the claimed invention may be practiced without either 	the first or second condition happening, then neither step A or B is required by the broadest 	reasonable interpretation of the claim. If the claimed invention requires the first condition to 	occur, then the broadest reasonable interpretation of the claim requires step A. If the claimed 	invention requires both the first and second conditions to occur, then the broadest reasonable 	interpretation of the claim requires both steps A and B.
The broadest reasonable interpretation of a system (or apparatus or product) claim having structure that performs a function, which only needs to occur if a condition precedent is met, requires structure for performing the function should the condition occur. The system claim interpretation differs from a method claim interpretation because the claimed structure must be present in the system regardless of whether the condition is met and the function is actually performed.
See Ex parte Schulhauser, Appeal 2013-007847 (PTAB April 28, 2016) for an analysis of contingent claim limitations in the context of both method claims and system claims. In Schulhauser, both method claims and system claims recited the same contingent step. When analyzing the claimed method as a whole, the PTAB determined that giving the claim its broadest reasonable interpretation, "[i]f the condition for performing a contingent step is not satisfied, the performance recited by the step need not be carried out in order for the claimed method to be performed" (quotation omitted). Schulhauser at 10. When analyzing the claimed system as a whole, the PTAB determined that "[t]he broadest reasonable interpretation of a system claim having structure that performs a function, which only needs to occur if a condition precedent is met, still requires structure for performing the function should the condition occur." Schulhauser at 14. 
Examiner suggests to Applicant to amend the limitation in Independent Claim 1 to read as follows: “in response to the cost function being successfully optimized subject to the constraint, providing the vector of variable hyperparameter values as an output”.
		Appropriate correction is required.

Claim Rejections - 35 USC § 101
6.		According to the New 2019 Revised Patent Subject Matter Eligibility Guidance (2019 PEG) submitted on January 7, 2019 as well as the October 2019 Update: Subject Matter Eligibility, Examiner provides his 35 U.S.C. 101 analysis for Claims 1-20 shown below.

7.		35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

8.		Claims 1-20 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more.
Step 1: Claims 1-20 are each focused to a statutory category namely, a “method” or a “process” (Claims 1-14), a “system” or an “apparatus” (Claims 15-19) and a “non-transitory computer readable medium” or an “article of manufacture” (Claim 20). 
Step 2A Prong One: Independent Claims 1, 15 and 20 recite limitations that set forth the abstract idea(s), namely (see in bold except where strikethrough):
		“for each of one or more learning iterations” (see Independent Claim 1);
		“determining, a plurality of feature vectors 	respectively associated with wherein the feature vector 	for describes a respective performance of on each of a plurality of loss function components” (see Independent Claim 1);
		“determining, a plurality of validation errors 	respectively for wherein the validation error for describes a performance of relative to a 	validation metric” (see Independent Claim 1);
		“for each of one or more optimization iterations respectively associated with one or 	more of ” (see Independent Claim 1);
		“attempting to optimize, a cost function to learn 	a vector of variable hyperparameter values subject to a constraint, wherein the cost function 	evaluates a sum, of an absolute or squared error between 	a respective loss function for and the validation error for wherein the respective loss function comprises the feature vector for respectively multiplied by the vector 	of variable hyperparameter values, and wherein the constraint requires that the vector of 	variable hyperparameter values be such that minimization of the respective loss returns a associated with a current 	optimization iteration” (see Independent Claim 1);
		“if the cost function is successfully optimized subject to the constraint, providing the 	vector of variable hyperparameter values as an output” (see Independent Claim 1);
		“” (see Independent Claim 15);
		“	” (see 	Independent Claim 15);
		“determining, a plurality of feature vectors respectively 	associated with wherein the feature vector for describes a respective performance of on 	each of a plurality of loss function components” (see Independent Claims 15 and 20);
		“determining, a plurality of validation errors respectively for 	wherein the validation error for describes a performance of relative to a validation metric” 	(see Independent Claims 15 and 20);
		“for each of one or more optimization iterations respectively associated with one or 	more of ” (see Independent Claims 15 and 20);
		“attempting to optimize, a cost function to learn a vector of 	variable hyperparameter values subject to a constraint, wherein the cost function evaluates a 	sum, of an absolute or squared error between a respective 	loss function for and the validation error for wherein the respective loss function for comprises the 	feature vector for respectively multiplied by the vector of variable 	hyperparameter values, and wherein the constraint requires that the vector of variable 	hyperparameter values be such that minimization of the respective loss for returns a associated with a current optimization 	iteration” (see Independent Claims 15 and 20);		
		“if the cost function is successfully optimized subject to the constraint, providing the 	vector of variable hyperparameter values as an output” (see Independent Claims 15 and 20)
		These abstract idea limitations (as identified above in bold), under their broadest 	reasonable interpretation of the claims as a whole, cover performance of their limitations as 	“Mathematical Concepts” which pertains to (1) mathematical calculations and/or mathematical 	relationships and/or mathematical formulas or equations.
		Additionally and/or alternatively, these abstract idea limitations (as identified above in 	bold), under their broadest reasonable interpretation of the claims as a whole, cover 	performance of their limitations as “Mental Processes” which pertains to (2) concepts 	performed in the human mind (including observation(s) and/or evaluation(s) and/or 	judgment(s)) and/or (3) via the use of physical aids by pen to paper.
According to MPEP § 2106.04(a)(2) this provides further explanation on the abstract idea groupings. It should be noted that these groupings are not mutually exclusive, i.e., some claims recite limitations that fall within more than one grouping or sub-grouping. Accordingly, examiners should identify at least one abstract idea grouping, but preferably identify all groupings to the extent possible, if a claim limitation(s) is determined to fall within multiple groupings and proceed with the analysis in Step 2A Prong Two.
		That is, other than reciting (e.g., “computing system”, “machine-learned model(s)”, “one 	or more processors”, “one or more non-transitory computer-readable media” & “one or more 	computing devices”, etc…) nothing in the claim elements precludes the steps from being 	performed as “Mathematical Concepts” which pertains to (1) mathematical calculations and/or 	mathematical relationships and/or mathematical formulas or equations and/or “Mental 	Processes” which pertains to (2) concepts performed in the human mind (including observation(s) 	and/or evaluation(s) and/or judgment(s)) and/or (3) via the use of physical aids by pen to paper.
Moreover, the mere recitation of computer components such as (e.g., “computing system”, “one or more processors”, “one or more non-transitory computer-readable media” & “one or more computing devices”) does not take the claims out of “Mathematical Concepts” and/or “Mental Processes” groupings.
Independent Claims 1, 15 and 20: The additional element(s) concerning “machine-learned model(s) instances” merely narrow the abstract ideas shown above in step 2a prong one to the “Mathematical Concepts” and/or “Mental Processes” groupings concerning efficiently learning loss functions as well as collectively store a vector of variable hyperparameters generated by performance of operations which are executed using a computer.
Dependent Claims 2-14 and 16-19:
The additional limitations of the claimed invention merely narrow the previously recited abstract idea limitations and are further directed to additional abstract ideas such as “Mathematical Concepts” and/or “Mental Processes” Groupings as described in Claims 1, 15 and 20. 
Dependent Claims 2-4, 6, 11-14 and 16-18: The additional element(s) concerning “machine-learned model(s) instances” merely narrow the abstract ideas shown above in step 2a prong one to the “Mathematical Concepts” and/or “Mental Processes” groupings concerning efficiently learning loss functions as well as collectively store a vector of variable hyperparameters generated by performance of operations which are executed using a computer.
Additional details will be now investigated more granularly below. For now, Examiner submits that there is a preponderance of legal evidence for the claims reciting, describing or at least setting forth the abstract exception.
[Step 2A Prong 1 = Yes, Claims 1-20 recite an abstract idea. Therefore, Examiner proceeds onto Step 2A Prong 2 of the 35 U.S.C. 101 analysis.]. 
Step 2A Prong Two: The claims recite the combination of additional elements of (in bold):
		“for each of one or more learning iterations” (see Independent Claim 1);
		“determining, by one or more computing devices, a plurality of feature vectors 	respectively associated with a plurality of machine-learned models, wherein the feature vector 	for each machine-learned model describes a respective performance of the machine-learned 	model on each of a plurality of loss function components” (see Independent Claim 1);
		“determining, by one or more computing devices, a plurality of validation errors 	respectively for the plurality of machine-learned models, wherein the validation error for each 	machine-learned model describes a performance of the machine-learned model relative to a 	validation metric” (see Independent Claim 1);
		“for each of one or more optimization iterations respectively associated with one or 	more of the machine-learned models” (see Independent Claim 1);
		“attempting to optimize, by the one or more computing devices, a cost function to learn 	a vector of variable hyperparameter values subject to a constraint, wherein the cost function 	evaluates a sum, for all of the machine-learned models, of an absolute or squared error between 	a respective loss function for each machine-learned model and the validation error for such 	machine-learned model, wherein the respective loss function for each machine-learned model 	comprises the feature vector for the machine-learned model respectively multiplied by the vector 	of variable hyperparameter values, and wherein the constraint requires that the vector of 	variable hyperparameter values be such that minimization of the respective loss for each 	machine-learned model returns a current machine-learned model associated with a current 	optimization iteration” (see Independent Claim 1);
		“if the cost function is successfully optimized subject to the constraint, providing the 	vector of variable hyperparameter values as an output” (see Independent Claim 1);
		“one or more processors” (see Independent Claim 15);
		“one or more non-transitory computer-readable media that collectively store 	instructions that, when executed by the one or more processors, cause the computing system to 	perform operations, the operations comprising, for each of one or more learning iterations” (see 	Independent Claim 15);
		“determining, by the computing system, a plurality of feature vectors respectively 	associated with a plurality of machine-learned models, wherein the feature vector for each 	machine-learned model describes a respective performance of the machine-learned model on 	each of a plurality of loss function components” (see Independent Claims 15 and 20);
		“determining, by the computing system, a plurality of validation errors respectively for 	the plurality of machine-learned models, wherein the validation error for each machine-learned 	model describes a performance of the machine-learned model relative to a validation metric” 	(see Independent Claims 15 and 20);
		“for each of one or more optimization iterations respectively associated with one or 	more of the machine-learned models” (see Independent Claims 15 and 20);
		“attempting to optimize, by the computing system, a cost function to learn a vector of 	variable hyperparameter values subject to a constraint, wherein the cost function evaluates a 	sum, for all of the machine-learned models, of an absolute or squared error between a respective 	loss function for each machine-learned model and the validation error for such machine-learned 	model, wherein the respective loss function for each machine-learned model comprises the 	feature vector for the machine-learned model respectively multiplied by the vector of variable 	hyperparameter values, and wherein the constraint requires that the vector of variable 	hyperparameter values be such that minimization of the respective loss for each machine-	learned model returns a current machine-learned model associated with a current optimization 	iteration” (see Independent Claims 15 and 20);		
		“if the cost function is successfully optimized subject to the constraint, providing the 	vector of variable hyperparameter values as an output” (see Independent Claims 15 and 20)
Independent Claims 1, 15 and 20 recite additional elements that as a whole do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The additional elements include (e.g., “one or more processors”, “one or more computing devices”, “computing system”, “one or more non-transitory computer-readable media” & “machine-learned model(s)”, etc…) in conjunction with the limitations are no more than mere instructions to “apply” the exception using computer components (see MPEP § 2106.05 (f)). Additionally and/or alternatively, the claims as a whole are limited to a particular technological environment or field of use by learning effective loss functions as well as collectively store a vector of variable hyperparameters generated by performance of operations in a machine-learned model environment (see MPEP § 2106.05 (h)).
Independent Claims 1, 15 and 20: The additional element(s) concerning “machine-learned model(s) instances” merely narrow the abstract ideas shown above in step 2a prong one concerning efficiently learning loss functions as well as collectively store a vector of variable hyperparameters generated by performance of operations via mere instructions to implement an abstract idea on a computer or using a computer as a tool to “apply” the recited judicial exceptions (see MPEP § 2106.05 (f)). Additionally and/or alternatively, the claims as a whole are limited to a particular technological environment or field of use by learning effective loss functions as well as collectively store a vector of variable hyperparameters generated by performance of operations in a machine-learned model environment (see MPEP § 2106.05 (h)).
		Dependent Claims 2-14 and 16-19 recite additional elements such as (e.g., “one or more 	computing devices”, “machine-learned model(s)”, “computing system”, “one or more 	processors”, “one or more non-transitory computer-readable media”, etc…) in conjunction with 	the limitations, are no more than mere instructions to implement an abstract idea on a computer 	or using a computer as a tool to “apply” the recited judicial exceptions (see MPEP § 2106.05 (f)). 	Additionally and/or alternatively, the claims as a whole are limited to a particular technological 	environment or field of use by learning effective loss functions as well as collectively store a vector 	of variable hyperparameters generated by performance of operations in a machine-learned 	model environment (see MPEP § 2106.05 (h)).
		Dependent Claims 2-4, 6, 11-14 and 16-18: The additional element(s) concerning 	“machine-learned model(s) instances” merely narrow the abstract ideas shown above in step 2a 	prong one concerning efficiently learning loss functions as well as collectively store a vector of 	variable hyperparameters generated by performance of operations via mere instructions to 	implement an abstract idea on a computer or using a computer as a tool to “apply” the recited 	judicial exceptions (see MPEP § 2106.05 (f)). Additionally and/or alternatively, the claims as a 	whole are limited to a particular technological environment or field of use by learning effective 	loss functions as well as collectively store a vector of variable hyperparameters generated by 	performance of operations in a machine-learned model environment (see MPEP § 2106.05 (h)).
The underlining abstract considerations would remain the same regardless of the field of use or technological environment to which it is applies. Merely narrowing the abstract exception to such computerized field of use or technological environment does not integrate the abstract idea into a practical application no matter of the resource manipulation in the respective field.
Therefore, these claims as a whole do not amount to a “practical application” for the abstract idea because they neither (1) recite any improvements to another technology or technical field; (2) recite any improvements to the functioning of the computer itself; (3) apply the judicial exception with, or by use of, a particular machine; (4) effect a transformation or reduction of a particular article to a different state or thing; (5) provide other meaningful limitations beyond generally linking the use of the judicial exception to a particular technological environment.
[Step 2A Prong 2 = Yes, Claims 1-20 are directed to the abstract idea and do not recite additional elements that integrate into a practical application. Therefore, Examiner proceeds onto Step 2B of the 35 U.S.C. 101 analysis.]. 
		Step 2B: Claims 1-20 and their underlining limitations, steps, features and terms, are 	further inspected by the Examiner under the current examining guidelines, and found, both 	individually and as a whole not to include additional elements that are sufficient to amount to 	significantly more than the judicial exception because the additional elements or combination of 	elements in the claims amount to no more than recitation of ubiquitous structure recited at a high 	level such as: [ (1) “one or more processors” shown in Applicant’s Specification ¶ [0084-0085] 	denoting that “The model trainer 160 can be implemented in hardware, firmware, and/or 	software controlling a general purpose processor.” Also that “The loss learner 164 can be 	implemented in hardware, firmware, and/or software controlling a general purpose processor. 	For example, in 	some implementations, the loss learner 164 includes program files stored on a 	storage device, 	loaded into a memory and executed by one or more processors.” (2) “one or more 	computing devices” shown in Applicant’s Specification ¶ [0069-0070]: “The user computing 	device 102 can be any type of computing device, such as, for example, a personal computing 	device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a 	gaming console or controller, a wearable computing device, an embedded computing device, or 	any other type of computing device.” (3) “computing system” shown in Applicant’s Specification 	¶ [0075]: “The server computing system 130 includes one or more processors 132 and a 	memory 134. The one or more processors 132 can be any suitable processing device (e.g., a 	processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be 	one processor or a plurality of processors that are operatively connected.”  (4) “one or more non-	transitory computer-readable 	media”, shown in Applicant’s Specification ¶ [0079]: “The 	memory 154 can include one or more non-transitory computer-readable storage mediums, such 	as RAM, ROM, EEPROM, EPROM, flash 	memory devices, magnetic disks, etc., and combinations 	thereof. The memory 154 can store data 156 and instructions 158 which are executed by the 	processor 153 to cause the training computing system 150 to perform operations.”]. The 	“machine learning model(s)” aspects are utilized as mere instructions to apply the previously 	recited judicial exceptions by requiring the use of software to tailor information and 	providing the results to the user on a computer. These additional elements in conjunction with 	the limitations recites mere instructions to implement an abstract idea on a computer or using a 	computer as a tool to “apply” the recited judicial exceptions (see MPEP § 2106.05(f)). 
The limitations are directed to limitations referenced in MPEP § 2106.05I.A. that are not enough to qualify as significantly more when recited in a claim with an abstract idea including adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, and generally linking the use of the judicial exception to a particular technological environment or field of use. 
		Independent Claims 1, 15 and 20 recite additional elements that are merely directed to 	the particulars of the abstract idea and likewise do not add significantly more to the above-	identified judicial exceptions. The additional elements include (e.g., “one or more processors”, 	“one or more computing devices”, “computing system”, “one or more non-transitory computer-	readable media” & “machine-learned model(s)”, etc…) in conjunction with the limitations, are no 	more than mere instructions to implement an abstract idea on a computer or using a computer 	as a tool to “apply” the recited judicial exceptions (see MPEP § 2106.05 (f)).
		Independent Claims 1, 15 and 20: The additional element(s) concerning “machine-	learned model(s) instances” merely narrow the abstract ideas shown in step 2a prong one 	concerning efficiently learning loss functions as well as collectively store a vector of variable 	hyperparameters generated by performance of operations via mere instructions to implement an 	abstract idea on a computer or using a computer as a tool to “apply” the recited judicial 	exceptions (see MPEP § 2106.05 (f)). Additionally and/or alternatively, the claims as a whole are 	limited to a particular technological environment or field of use by learning effective loss 	functions as well as collectively store a vector of variable hyperparameters generated by 	performance of operations in a machine-learned model environment (see MPEP § 2106.05 (h)).
		Dependent Claims 2-14 and 16-19 recite additional elements that are merely directed to 	the particulars of the abstract idea and likewise do not add significantly more to the above-	identified judicial exceptions. The additional elements include (e.g., “one or more computing 	devices”, “machine-learned model(s)”, “computing system”, “one or more processors”, “one or 	more non-transitory computer-readable media”, etc…) in conjunction with the limitations, are 	no more than mere instructions to implement an abstract idea on a computer or using a computer 	as a tool to “apply” the recited judicial exceptions (see MPEP § 2106.05 (f)). Additionally and/or 	alternatively, the claims as a whole are 	limited to a particular technological environment or field 	of use by learning effective loss 	functions as well as collectively store a vector of variable 	hyperparameters generated by 	performance of operations in a machine-learned model 	environment (see MPEP § 2106.05 (h)).
Dependent Claims 2-4, 6, 11-14 and 16-18: The additional element(s) concerning “machine-learned model(s) instances” merely narrow the abstract ideas shown above in step 2a prong one concerning efficiently learning loss functions as well as collectively store a vector of variable hyperparameters generated by performance of operations via mere instructions to implement an abstract idea on a computer or using a computer as a tool to “apply” the recited judicial exceptions (see MPEP § 2106.05 (f)). Additionally and/or alternatively, the claims as a whole are limited to a particular technological environment or field of use by learning effective loss functions as well as collectively store a vector of variable hyperparameters generated by performance of operations in a machine-learned model environment (see MPEP § 2106.05 (h)).
Moreover, these claims do not amount to “significantly more” than the abstract idea because they neither (1) recite any improvements to another technology or technical field; (2) recite any improvements to the functioning of the computer itself; (3) apply the judicial exception with, or by use of, a particular machine; (4) effect a transformation or reduction of a particular article to a different state or thing; (5) add a specific limitation other than what is well-understood, routine and conventional in the field; (6) add unconventional steps that confine the claim to a particular useful application; nor (7) provide other meaningful limitations beyond generally linking the use of the judicial exception to a particular technological environment.
Secondly, MPEP § 2106.05(d) ii court cases laws, see the following:
-> determining an estimated outcome, in light of MPEP § 2106.05 (d) ii citing among others: in light of MPEP § 2106.05 (d) ii citing among others: OIP Techs., 788 F.3d at 1362-63, 115 USPQ2d at 1092-93. Receiving or transmitting data over a network, e.g., using the Internet to gather data, in light of MPEP § 2106.05 (d) ii citing among others: Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362; OIP Techs., Inc. v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network); buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014). Performing Repetitive Calculations, in light of MPEP § 2106.05 (d) ii citing among others: Flook, 437 U.S. at 594, 198 USPQ2d at 199 (recomputing or readjusting alarm limit values); Bancorp Services v. Sun life, 687 F.3d 1266, 1278, 103 USPQ2d 1425, 1433 (Fed. Cir. 2012); Gathering statistics, in light of MPEP § 2106.05 (d) ii citing among others: OIP Techs., 788 F.3d at 1362-63, 115 USPQ2d at 1092-93. Electronic recordkeeping, in light of MPEP § 2106.05 (d) ii citing among others: Alice Corp., 1344 S. Ct. at 2359, 110 USPQ2d at 1984 (creating and maintaining “shadow accounts”); Ultramercial, 772 F.3d at 716, 112 USPQ2d at 1755 (updating an activity log).
Based on all these, Examiner finds that when viewed either individually or in combination, these additional claim element(s) do not provide meaningful limitation(s) that raise to the high standards of eligibility to transform the abstract idea(s) into a patent eligible application of the abstract idea(s) such that the claim(s) amounts to significantly more than the abstract idea(s) itself. Accordingly, Claims 1-20 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to a judicial exception (i.e. abstract idea exception) without significantly more.
[Step 2B = No, Claims 1-20 does not provide an inventive concept significantly more than the abstract idea.]
The claims are ineligible.

Claim Rejections - 35 USC § 103
9.		The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

10.		Claims 1-7, 11 and 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over US Patent Application (US 2018/0240041 A1) to Koch, and in view of US Patent Application (US 2017/0024642 A1) to Xiong.
		Regarding Independent Claims 1 and 20, Koch method / non-transitory computer-readable medium to learn improved loss function teaches the following:
- for each of one or more learning iterations (see at least Koch: Fig. 11 & ¶ [0130-0131]. Koch notes that the one or more tuning search methods may be indicated to run simultaneously and/or successively. When executed successively, objective function values from one or more previous iterations are used to determine a next iteration of a set of hyperparameter configurations to be evaluated. See also ¶ [0213] of Koch.)
	- determining, by the one or more computing devices (see at least Koch: Fig. 1 & ¶ [0032].) / by the computing system (see at least Koch: Fig. 1 & Fig. 18.), a plurality of feature vectors respectively associated with a plurality of machine-learned models (see at least Koch: ¶ [0062] & Figs. 10-13. Koch teaches that the plurality of rows may be referred to as observation vectors or records (observations), and the columns may be referred to as variables. The input dataset may be transposed. The input dataset may include supervised and/or unsupervised data. The plurality of variables may define multiple dimensions for each observation vector. An observation vector xi may include a value for each of the plurality of variables associated with the observation i. ¶ [0199]: “For illustration, if fraction value is 0.3 or 30%, 30% of the observation vectors in the portion of the input dataset at each session worker device 420 of the session is extracted to create validation dataset subset 436 and the remaining 70% of the observation vectors in the portion of the input dataset at each session worker device 420 of the session forms training dataset subset 434.” See also Table 1 shown on ¶ [0097].), wherein the feature vector for each machine-learned model describes a respective performance of the machine-learned model on each of a plurality of loss function components (see at least Koch: ¶ [0093-0094] & ¶ [0100]. Koch teaches that the eleventh indicator indicates a name of an objective function. The objective function specifies a measure of model error (performance) to be used to identify a best configuration of the hyperparameters among those evaluated. The eleventh indicator may be received by training application 122 after selection from a user interface window or after entry by a user into a user interface window. The neural network model type hyperparameters further may include an L1 norm regularization parameter (regL1) that is greater than or equal to zero with a default value of zero. The neural network model type hyperparameters further may include an L2 norm regularization parameter (regL2) that is greater than or equal to zero with a default value of zero. A value for each of these hyperparameters is defined in each hyperparameter configuration for the neural network model type. The methodology by which the values are determined is based on the tuning search method discussed further below and the values, if any, indicated in operation 516. See also Fig. 14)
	- determining, by one or more computing devices (see at least Koch: Fig. 1 & ¶ [0032].) / by the computing system (see at least Koch: Fig. 1 & Fig. 18.), a plurality of validation errors respectively for the plurality of machine-learned models (see at least Koch: ¶ [0076] & ¶ [0079]. Koch teaches that a tuneDecisionTree action selects different hyperparameter configurations to run a dtreeTrain action, optionally a dtreePrune action, and a dtreeScore action (an assess action may be run after each dtreeScore action) multiple times to train and validate a decision tree model as it searches for a model that has reduced validation error. Also a tuneForest action selects different hyperparameter configurations to run a forestTrain action and a forestScore action multiple times to train and validate the forest model as it searches for a model that has reduced validation error. See also Figs. 16-17 of Koch.), wherein the validation error for each machine-learned model describes a performance of the machine-learned model relative to a validation metric (see at least Koch: ¶ [0233] & ¶ [0239]. Koch teaches that some of the challenges of hyperparameter tuning discussed earlier can be seen referring to FIG. 10, which shows the error for the hyperparameter configurations evaluated in the first iteration of tuning that used LHS to obtain an initial sample of the space. The majority of the evaluated hyperparameter configurations produced a validation error larger than that of the default configuration and shown as default value 1000, which is 2.57%. Numerous different hyperparameter configurations produced very similar error rates.  A fourth curve 1606 shows results for each dataset using the forest model type. A fifth curve 1608 shows results for each dataset using the gradient boosting model type. For dataset A, the neural network and the support vector machine model types provided the worst results. The other four model types produced very similar errors of around 10%. See Figs. 16-17 of Koch.)
	- for each of one or more optimization iterations respectively associated with one or more of the machine-learned models (see at least Koch: ¶ [0077] & ¶ [0138]. Koch notes that the Forest model type creates a decision tree recursively by choosing an input variable and using it to create a rule to split the data into two or more subsets. The process is repeated in each subset, and again in each new subset, and so on until a constraint is met.  A GSS search method can provide a measure of local optimality that is very useful in performing multimodal optimization. The GSS search method may add additional “growth steps” to the GA search method whenever the hyperparameter is a continuous variable. These additional growth steps may be performed each iteration to permit selected hyperparameter configurations of the population (based on diversity and fitness) to benefit from local optimization over the continuous variables.)
		Koch method / non-transitory computer readable medium to learn improved loss function doesn’t explicitly teach or suggest the following:
	- attempting to optimize, by the one or more computing devices / by the computing system, a cost function to learn a vector of variable hyperparameter values subject to a constraint, wherein the cost function evaluates a sum, for all of the machine-learned models, of an absolute or squared error between a respective loss function for each machine-learned model and the validation error for such machine-learned model, wherein the respective loss function for each machine-learned model comprises the feature vector for the machine-learned model respectively multiplied by the vector of variable hyperparameter values, and wherein the constraint requires that the vector of variable hyperparameter values be such that minimization of the respective loss for each machine-learned model returns a current machine-learned model associated with a current optimization iteration; 
	- if the cost function is successfully optimized subject to the constraint, providing the vector of 	variable hyperparameter values as an output
		Xiong method / non-transitory computer readable medium however in the analogous art to learn improved loss function teaches the following:
- attempting to optimize (see at least Xiong: ¶ [0026]. Xiong teaches that during the training stage, the neural network optimizes weights for each feature detector. After learning, the optimized weight configuration can then be applied to test data.), by the one or more computing devices (see at least Xiong: ¶ [0015]. Xiong teaches one or more computing devices.) / by the computing system (see at least Xiong: Fig. 1.), a cost function to learn a vector of variable hyperparameter values subject to a constraint (see at least Xiong: ¶ [abstract] & ¶ [0002]. Xiong notes that each unit has a weight vector that is determined during learning, which can be referred to as the training stage. In the training stage, the feature vectors are first initialized by a pre-determined random or pseudo-random algorithm. After that, a training set of data (a training set of inputs each having a known output) is used by a learning algorithm to adjust the feature vectors in the neural network.  ¶ [abstract]: “A hyper-parameter that controls the variance of the ensemble predictors is used to address overfitting. For larger values of the hyper-parameter, the predictions from the ensemble have more variance, so there is less overfitting. This technique can be applied to ensemble learning with various cost functions, structures and parameter sharing. A cost function is provided and a set of techniques for learning are described.”) wherein the cost function evaluates a sum, for all of the machine-learned models (see at least Xiong:  ¶ [0046] & ¶ [0059]. Xiong notes that a new random predictor {tilde over (f)}(x, m, w) is provided, which is the variance-adjusted version of f(x, m, w), and an adjusted cost function that is based on the new predictor is provided, as follows: f ~  ( x , m , w ) = f _  ( x , w ) + α  ( f  ( x , m , w ) - f _  ( x , w ) )   C = - 1 M  ∑ m  l  ( f ~  ( x , m , w ) , t ) ( 6 ). Also models in the ensemble, each of which may comprise a neural network, do not have tied weights and may have different architectures. Each model in the ensemble is generally optimized independently. This may be considered to be roughly equivalent to optimizing the average of the cost function of models in the ensemble: C = - 1 n  ∑ i = 1 n  l  ( f i  ( x , w i ) , t ) ( 8 ) where fi(x, wi) is the i-th model with parameters wi. When testing the ensemble, predictions from the models in the ensemble are averaged. See also ¶ [abstract] of Xiong.), of an absolute or squared error between a respective loss function for each machine-learned model and the validation error for such machine-learned model (see at least Xiong: (Dependent Claim 19 of Xiong) & ¶ [0052]. Xiong notes that at block (314) f(x, w) is added, providing the new predictor {tilde over (f)}(x, m, w). At block (316), {tilde over (f)}(x, m, w) is then compared to the target (referred to as a desired output) and the error is computed using {tilde over (f)}(x, m, w). The difference between the variance-adjusted output and the desired output may be computed using squared error, absolute error, log-likelihood or cross-entropy. The error is then back-propagated at block (318) through the two networks according to equation 7. After that, the network parameters are updated at block (320) and the training procedure may move on to the next mini-batch. See also Dependent Claim 19 of Xiong “computing the sum of the squared errors for the further outputs for the validation data items.”) wherein the respective loss function for each machine-learned model (see at least Xiong: ¶ [0058]. Xiong notes that what we refer to neural networks include embodiments of neural networks with different depths, structures, activation functions, loss functions, parameter regularization, parameter sharing and optimization methods. These embodiments include linear regression, ridge regression, lasso, polynomial regression, logistic regression, multinomial regression, convolutional networks and recurrent neural networks.) comprises the feature vector for the machine-learned model (see at least Xiong: ¶ [0002]. Xiong notes that “after that, a training set of data (a training set of inputs each having a known output) is used by a learning algorithm to adjust the feature vectors in the neural network. It is intended that the neural network learn how to provide an output for new input data by generalizing the information it learns in the training stage from the training data. Generally, during different stages of training, a validation set is processed by the neural network to validate the results of training and to select hyper-parameters used for training.”) respectively multiplied by the vector of variable hyperparameter values (see at least Xiong: ¶ [0033] & Fig. 4. Xiong notes that “once the training set has been learned by the neural network, the switch may enable all feature detectors and normalize their outgoing weights (204). Normalization comprises reducing the outgoing weights of each feature detector or input by multiplying them by the probability that the feature detector or input was not disabled. In an example, if the feature detectors of each hidden layer were selectively disabled with probability 0.5 in the training stage, the outgoing weights are halved for the test case since approximately twice as many feature detectors will be enabled.” See Fig. 4 of Xiong denoting the multiplication by the vector of variable hyperparameter values at Fig. 4 item 312 where the hyperparameters are represented by the “α” values.), and wherein the constraint requires that the vector of variable hyperparameter values be such that minimization of the respective loss for each machine-learned model returns a current machine-learned model associated with a current optimization iteration (see at least Xiong: ¶ [0019-0020] & ¶ [0042]. Xiong notes that the standard training technique is to minimize the discrepancy between the prediction of each model to the target, without regard to the predictions of other members in the ensemble. This variance is instrumental in preventing the network from overfitting to training data, by encouraging parameter settings that minimize the distance of all f(x, m, w) to the target, as opposed to just the distance of f(x, w) to the target. Also Xiong teaches “a feedforward neural network training system comprising an extra hyper-parameter that controls the variance of the ensemble predictors and generalizes ensemble learning. The hyper-parameter can be smoothly adjusted to vary the behavior of the method from a single model learning to a family of ensemble learning comprising a plurality of interacting models.” See also ¶ [0047] of Xiong.)
	- if the cost function is successfully optimized subject to the constraint (see at least Xiong: ¶ [0026] & ¶ [0058-0059]. Xiong notes that the neural network optimizes weights for each feature detector. After learning, the optimized weight configuration can then be applied to test data.), providing the vector of variable hyperparameter values as an output (see at least Xiong: ¶ [0046-0047] & ¶ [0059]. Xiong notes that models in the ensemble, each of which may comprise a neural network, do not have tied weights and may have different architectures. Each model in the ensemble is generally optimized independently. This may be considered to be roughly equivalent to optimizing the average of the cost function of models in the ensemble: C = - 1 n  ∑ i = 1 n  l  ( f i  ( x , w i ) , t ) ( 8 ) where fi(x, wi) is the i-th model with parameters wi. When testing the ensemble, predictions from the models in the ensemble are averaged. Also a new random predictor {tilde over (f)}(x, m, w) is provided, which is the variance-adjusted version of f(x, m, w), and an adjusted cost function that is based on the new predictor is provided, as follows:f ~  ( x , m , w ) = f _  ( x , w ) + α  ( f  ( x , m , w ) - f _  ( x , w ) )   C = - 1 M  ∑ m  l  ( f ~  ( x , m , w ) , t ) ( 6 ). By using the mean network approximation described above in equation 5, the new predictor {tilde over (f)}(x, m, w) adjusts the variance of f(x, m, w) by a factor of α2. Variance-adjustable dropout training thus makes use of an additional hyper-parameter α, which may be set using cross-validation. The search for a may be combined with the search for other hyper-parameters, to produce a jointly optimal hyper-parameter setting, using a grid search, random search, Bayesian hyper-parameter optimization, or a combination of these. See also Fig. 3 of Xiong reference.)
It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Koch method  / non-transitory computer-readable medium to learn improved loss functions with the aforementioned teachings regarding attempting to optimize, by the one or more computing devices / by the computing system, a cost function to learn a vector of variable hyperparameter values subject to a constraint, wherein the cost function evaluates a sum, for all of the machine-learned models, of an absolute or squared error between a respective loss function for each machine-learned model and the validation error for such machine-learned model, wherein the respective loss function for each machine-learned model comprises the feature vector for the machine-learned model respectively multiplied by the vector of variable hyperparameter values, and wherein the constraint requires that the vector of variable hyperparameter values be such that minimization of the respective loss for each machine-learned model returns a current machine-learned model associated with a current optimization iteration & if the cost function is successfully optimized subject to the constraint, providing the vector of variable hyperparameter values as an output in view of Xiong, in order to improve the performance of dropout training by adjusting the variance of predictions introduced by dropout during training, and can be referred to for convenience as variance-adjustable dropout. Dropout training of a single neural network has the effect of creating an exponentially large ensemble of neural networks with different structures, but with shared parameters, which may provide improved computational efficiency at training time and at testing time (see at least Xiong: ¶ [0017-0018]).
Further, the claimed invention is merely a combination of old elements in a similar field to learn improved loss functions, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Xiong, the results of the combination were predictable.

Regarding Independent Claim 15, Koch system to learn improved loss function teaches the following:
- one or more processors (see at least Koch: Fig. 1 & ¶ [0032].)
- one or more non-transitory computer-readable media that collectively store instructions that (see at least Koch: Fig. 1 & ¶ [0032-0033].), when executed by the one or more processors (see at least Koch: Fig. 1 & ¶ [0032].), cause the computing system to perform operations, the operations comprising, for each of one or more learning iterations (see at least Koch: Fig. 11 & ¶ [0130-0131]. Koch notes that the one or more tuning search methods may be indicated to run simultaneously and/or successively. When executed successively, objective function values from one or more previous iterations are used to determine a next iteration of a set of hyperparameter configurations to be evaluated. See also ¶ [0213] of Koch.)
	- determining, by the computing system (see at least Koch: Fig. 1 & Fig. 18.), a plurality of feature vectors respectively associated with a plurality of machine-learned models (see at least Koch: ¶ [0062] & Figs. 10-13. Koch teaches that the plurality of rows may be referred to as observation vectors or records (observations), and the columns may be referred to as variables. The input dataset may be transposed. The input dataset may include supervised and/or unsupervised data. The plurality of variables may define multiple dimensions for each observation vector. An observation vector xi may include a value for each of the plurality of variables associated with the observation i. ¶ [0199]: “For illustration, if fraction value is 0.3 or 30%, 30% of the observation vectors in the portion of the input dataset at each session worker device 420 of the session is extracted to create validation dataset subset 436 and the remaining 70% of the observation vectors in the portion of the input dataset at each session worker device 420 of the session forms training dataset subset 434.” See also Table 1 shown on ¶ [0097].), wherein the feature vector for each machine-learned model describes a respective performance of the machine-learned model on each of a plurality of loss function components (see at least Koch: ¶ [0093-0094] & ¶ [0100]. Koch teaches that the eleventh indicator indicates a name of an objective function. The objective function specifies a measure of model error (performance) to be used to identify a best configuration of the hyperparameters among those evaluated. The eleventh indicator may be received by training application 122 after selection from a user interface window or after entry by a user into a user interface window. The neural network model type hyperparameters further may include an L1 norm regularization parameter (regL1) that is greater than or equal to zero with a default value of zero. The neural network model type hyperparameters further may include an L2 norm regularization parameter (regL2) that is greater than or equal to zero with a default value of zero. A value for each of these hyperparameters is defined in each hyperparameter configuration for the neural network model type. The methodology by which the values are determined is based on the tuning search method discussed further below and the values, if any, indicated in operation 516. See also Fig. 14)
	- determining, by the computing system (see at least Koch: Fig. 1 & Fig. 18.), a plurality of validation errors respectively for the plurality of machine-learned models (see at least Koch: ¶ [0076] & ¶ [0079]. Koch teaches that a tuneDecisionTree action selects different hyperparameter configurations to run a dtreeTrain action, optionally a dtreePrune action, and a dtreeScore action (an assess action may be run after each dtreeScore action) multiple times to train and validate a decision tree model as it searches for a model that has reduced validation error. Also a tuneForest action selects different hyperparameter configurations to run a forestTrain action and a forestScore action multiple times to train and validate the forest model as it searches for a model that has reduced validation error. See also Figs. 16-17 of Koch.), wherein the validation error for each machine-learned model describes a performance of the machine-learned model relative to a validation metric (see at least Koch: ¶ [0233] & ¶ [0239]. Koch teaches that some of the challenges of hyperparameter tuning discussed earlier can be seen referring to FIG. 10, which shows the error for the hyperparameter configurations evaluated in the first iteration of tuning that used LHS to obtain an initial sample of the space. The majority of the evaluated hyperparameter configurations produced a validation error larger than that of the default configuration and shown as default value 1000, which is 2.57%. Numerous different hyperparameter configurations produced very similar error rates.  A fourth curve 1606 shows results for each dataset using the forest model type. A fifth curve 1608 shows results for each dataset using the gradient boosting model type. For dataset A, the neural network and the support vector machine model types provided the worst results. The other four model types produced very similar errors of around 10%. See Figs. 16-17 of Koch.)
	- for each of one or more optimization iterations respectively associated with one or more of the machine-learned models (see at least Koch: ¶ [0077] & ¶ [0138]. Koch notes that the Forest model type creates a decision tree recursively by choosing an input variable and using it to create a rule to split the data into two or more subsets. The process is repeated in each subset, and again in each new subset, and so on until a constraint is met.  A GSS search method can provide a measure of local optimality that is very useful in performing multimodal optimization. The GSS search method may add additional “growth steps” to the GA search method whenever the hyperparameter is a continuous variable. These additional growth steps may be performed each iteration to permit selected hyperparameter configurations of the population (based on diversity and fitness) to benefit from local optimization over the continuous variables.)
		Koch system to learn improved loss function doesn’t explicitly teach or suggest the following:
	- attempting to optimize, by the computing system, a cost function to learn a vector of variable hyperparameter values subject to a constraint, wherein the cost function evaluates a sum, for all of the machine-learned models, of an absolute or squared error between a respective loss function for each machine-learned model and the validation error for such machine-learned model, wherein the respective loss function for each machine-learned model comprises the feature vector for the machine-learned model respectively multiplied by the vector of variable hyperparameter values, and wherein the constraint requires that the vector of variable hyperparameter values be such that minimization of the respective loss for each machine-learned model returns a current machine-learned model associated with a current optimization iteration; 
	- if the cost function is successfully optimized subject to the constraint, providing the vector of 	variable hyperparameter values as an output
		Xiong method / non-transitory computer readable medium however in the analogous art to learn improved loss function teaches the following:
- attempting to optimize (see at least Xiong: ¶ [0026]. Xiong teaches that during the training stage, the neural network optimizes weights for each feature detector. After learning, the optimized weight configuration can then be applied to test data.), by the computing system (see at least Xiong: Fig. 1.), a cost function to learn a vector of variable hyperparameter values subject to a constraint (see at least Xiong: ¶ [abstract] & ¶ [0002]. Xiong notes that each unit has a weight vector that is determined during learning, which can be referred to as the training stage. In the training stage, the feature vectors are first initialized by a pre-determined random or pseudo-random algorithm. After that, a training set of data (a training set of inputs each having a known output) is used by a learning algorithm to adjust the feature vectors in the neural network.  ¶ [abstract]: “A hyper-parameter that controls the variance of the ensemble predictors is used to address overfitting. For larger values of the hyper-parameter, the predictions from the ensemble have more variance, so there is less overfitting. This technique can be applied to ensemble learning with various cost functions, structures and parameter sharing. A cost function is provided and a set of techniques for learning are described.”) wherein the cost function evaluates a sum, for all of the machine-learned models (see at least Xiong:  ¶ [0046] & ¶ [0059]. Xiong notes that a new random predictor {tilde over (f)}(x, m, w) is provided, which is the variance-adjusted version of f(x, m, w), and an adjusted cost function that is based on the new predictor is provided, as follows: f ~  ( x , m , w ) = f _  ( x , w ) + α  ( f  ( x , m , w ) - f _  ( x , w ) )   C = - 1 M  ∑ m  l  ( f ~  ( x , m , w ) , t ) ( 6 ). Also models in the ensemble, each of which may comprise a neural network, do not have tied weights and may have different architectures. Each model in the ensemble is generally optimized independently. This may be considered to be roughly equivalent to optimizing the average of the cost function of models in the ensemble: C = - 1 n  ∑ i = 1 n  l  ( f i  ( x , w i ) , t ) ( 8 ) where fi(x, wi) is the i-th model with parameters wi. When testing the ensemble, predictions from the models in the ensemble are averaged. See also ¶ [abstract] of Xiong.), of an absolute or squared error between a respective loss function for each machine-learned model and the validation error for such machine-learned model (see at least Xiong: (Dependent Claim 19 of Xiong) & ¶ [0052]. Xiong notes that at block (314) f(x, w) is added, providing the new predictor {tilde over (f)}(x, m, w). At block (316), {tilde over (f)}(x, m, w) is then compared to the target (referred to as a desired output) and the error is computed using {tilde over (f)}(x, m, w). The difference between the variance-adjusted output and the desired output may be computed using squared error, absolute error, log-likelihood or cross-entropy. The error is then back-propagated at block (318) through the two networks according to equation 7. After that, the network parameters are updated at block (320) and the training procedure may move on to the next mini-batch. See also Dependent Claim 19 of Xiong “computing the sum of the squared errors for the further outputs for the validation data items.”) wherein the respective loss function for each machine-learned model (see at least Xiong: ¶ [0058]. Xiong notes that what we refer to neural networks include embodiments of neural networks with different depths, structures, activation functions, loss functions, parameter regularization, parameter sharing and optimization methods. These embodiments include linear regression, ridge regression, lasso, polynomial regression, logistic regression, multinomial regression, convolutional networks and recurrent neural networks.) comprises the feature vector for the machine-learned model (see at least Xiong: ¶ [0002]. Xiong notes that “after that, a training set of data (a training set of inputs each having a known output) is used by a learning algorithm to adjust the feature vectors in the neural network. It is intended that the neural network learn how to provide an output for new input data by generalizing the information it learns in the training stage from the training data. Generally, during different stages of training, a validation set is processed by the neural network to validate the results of training and to select hyper-parameters used for training.”) respectively multiplied by the vector of variable hyperparameter values (see at least Xiong: ¶ [0033] & Fig. 4. Xiong notes that “once the training set has been learned by the neural network, the switch may enable all feature detectors and normalize their outgoing weights (204). Normalization comprises reducing the outgoing weights of each feature detector or input by multiplying them by the probability that the feature detector or input was not disabled. In an example, if the feature detectors of each hidden layer were selectively disabled with probability 0.5 in the training stage, the outgoing weights are halved for the test case since approximately twice as many feature detectors will be enabled.” See Fig. 4 of Xiong denoting the multiplication by the vector of variable hyperparameter values at Fig. 4 item 312 where the hyperparameters are represented by the “α” values.), and wherein the constraint requires that the vector of variable hyperparameter values be such that minimization of the respective loss for each machine-learned model returns a current machine-learned model associated with a current optimization iteration (see at least Xiong: ¶ [0019-0020] & ¶ [0042]. Xiong notes that the standard training technique is to minimize the discrepancy between the prediction of each model to the target, without regard to the predictions of other members in the ensemble. This variance is instrumental in preventing the network from overfitting to training data, by encouraging parameter settings that minimize the distance of all f(x, m, w) to the target, as opposed to just the distance of f(x, w) to the target. Also Xiong teaches “a feedforward neural network training system comprising an extra hyper-parameter that controls the variance of the ensemble predictors and generalizes ensemble learning. The hyper-parameter can be smoothly adjusted to vary the behavior of the method from a single model learning to a family of ensemble learning comprising a plurality of interacting models.” See also ¶ [0047] of Xiong.)
	- if the cost function is successfully optimized subject to the constraint (see at least Xiong: ¶ [0026] & ¶ [0058-0059]. Xiong notes that the neural network optimizes weights for each feature detector. After learning, the optimized weight configuration can then be applied to test data.), providing the vector of variable hyperparameter values as an output (see at least Xiong: ¶ [0046-0047] & ¶ [0059]. Xiong notes that models in the ensemble, each of which may comprise a neural network, do not have tied weights and may have different architectures. Each model in the ensemble is generally optimized independently. This may be considered to be roughly equivalent to optimizing the average of the cost function of models in the ensemble: C = - 1 n  ∑ i = 1 n  l  ( f i  ( x , w i ) , t ) ( 8 ) where fi(x, wi) is the i-th model with parameters wi. When testing the ensemble, predictions from the models in the ensemble are averaged. Also a new random predictor {tilde over (f)}(x, m, w) is provided, which is the variance-adjusted version of f(x, m, w), and an adjusted cost function that is based on the new predictor is provided, as follows:f ~  ( x , m , w ) = f _  ( x , w ) + α  ( f  ( x , m , w ) - f _  ( x , w ) )   C = - 1 M  ∑ m  l  ( f ~  ( x , m , w ) , t ) ( 6 ). By using the mean network approximation described above in equation 5, the new predictor {tilde over (f)}(x, m, w) adjusts the variance of f(x, m, w) by a factor of α2. Variance-adjustable dropout training thus makes use of an additional hyper-parameter α, which may be set using cross-validation. The search for a may be combined with the search for other hyper-parameters, to produce a jointly optimal hyper-parameter setting, using a grid search, random search, Bayesian hyper-parameter optimization, or a combination of these. See also Fig. 3 of Xiong reference.)
It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Koch system to learn improved loss functions with the aforementioned teachings regarding attempting to optimize, by the one or more computing devices / by the computing system, a cost function to learn a vector of variable hyperparameter values subject to a constraint, wherein the cost function evaluates a sum, for all of the machine-learned models, of an absolute or squared error between a respective loss function for each machine-learned model and the validation error for such machine-learned model, wherein the respective loss function for each machine-learned model comprises the feature vector for the machine-learned model respectively multiplied by the vector of variable hyperparameter values, and wherein the constraint requires that the vector of variable hyperparameter values be such that minimization of the respective loss for each machine-learned model returns a current machine-learned model associated with a current optimization iteration & if the cost function is successfully optimized subject to the constraint, providing the vector of variable hyperparameter values as an output in view of Xiong, in order to improve the performance of dropout training by adjusting the variance of predictions introduced by dropout during training, and can be referred to for convenience as variance-adjustable dropout. Dropout training of a single neural network has the effect of creating an exponentially large ensemble of neural networks with different structures, but with shared parameters, which may provide improved computational efficiency at training time and at testing time (see at least Xiong: ¶ [0017-0018]).
Further, the claimed invention is merely a combination of old elements in a similar field to learn improved loss functions, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Xiong, the results of the combination were predictable.

Regarding Dependent Claims 2 and 16, Koch / Xiong method / system to learn improved loss functions teaches the limitations of Independent Claims 1 and 15 above, and Xiong further teaches the method / system to learn improved loss functions comprising:
	- wherein the cost function further evaluates a sum, for all of the machine-learned models (see at least Xiong: ¶ [0005] & ¶ [0034-0035]. Xiong notes computing a variance-adjusted output for each selected neural network by summing the aggregate output with a fixed multiple (a) of a difference between the output of the selected neural network and the aggregate output for the at least one training data item. Dropout training may then proceed by stochastically obtaining terms in the gradient: dC dw = - 1 M  ∑ m  d df  l  ( f  ( x , m , w ) , t )  d dw  f  ( x , m , w ) ( 2 ) where d df  l  ( f  ( x , m , w ) , t ) is the gradient of the log-likelihood function with respect to the prediction f(x, m, w) and d dw  ( f  ( x , m , w ) is the gradient of the prediction of the neural network with mask m with respect to the parameters. During dropout training, a joint setting of m is sampled for each presentation of a training case and this corresponds to a randomly sampled element in the sum in the above equation. See Formula (2) shown at ¶ [0034-0035] and Formula (7) shown at ¶ [0050].), of an absolute or squared error between a gradient of the respective feature vector for each machine-learned model and a gradient of the validation error for such machine-learned model (see at least Xiong: ¶ [0052] & Fig. 4. Xiong notes that the difference between the variance-adjusted output and the desired output may be computed using squared error, absolute error, log-likelihood or cross-entropy. The error is then back-propagated at block (318) through the two networks according to equation 7. After that, the network parameters are updated at block (320) and the training procedure may move on to the next mini-batch. The parameters may be adjusted using the gradient or the Hessian of a log-likelihood function or a squared error function, evaluated using the desired output and the variance-adjusted output. The parameters may further be adjusted using the parameters are adjusted using gradient descent, stochastic gradient descent, momentum, Nesterov's accelerated momentum, AdaGrad, RMSProp, conjugate gradient, or a combination of these. See Fig. 4 showing computing gradient of cost function w.r.t. f vector at 316. See Formula (2) shown at ¶ [0034-0035] and Formula (7) shown at ¶ [0050].)
It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Koch method to learn improved loss functions with the aforementioned teachings regarding  wherein the cost function further evaluates a sum, for all of the machine-learned models, of an absolute or squared error between a gradient of the respective feature vector for each machine-learned model and a gradient of the validation error for such machine-learned model in view of Xiong, in order to improve the performance of dropout training by adjusting the variance of predictions introduced by dropout during training, and can be referred to for convenience as variance-adjustable dropout. Dropout training of a single neural network has the effect of creating an exponentially large ensemble of neural networks with different structures, but with shared parameters, which may provide improved computational efficiency at training time and at testing time (see at least Xiong: ¶ [0017-0018]).
Further, the claimed invention is merely a combination of old elements in a similar field to learn improved loss functions, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Xiong, the results of the combination were predictable.

Regarding Dependent Claims 3 and 17, Koch / Xiong method / system to learn improved loss functions teaches the limitations of Independent Claims 1 and 15 above, and Koch further teaches the method / system to learn improved loss functions comprising:
- further comprising, for each learning iteration (see at least Koch: Fig. 11 & ¶ [0130-0131]. Koch notes that the one or more tuning search methods may be indicated to run simultaneously and/or successively. When executed successively, objective function values from one or more previous iterations are used to determine a next iteration of a set of hyperparameter configurations to be evaluated. See also ¶ [0213] of Koch.)
	- ordering, by the one or more computing devices (see at least Koch: Fig. 1 & ¶ [0040]. Koch notes one or more computing devices shown in Fig. 1.), the machine-learned models into a sequence with ascending order of validation error (see at least Koch: Fig. 12 & ¶ [0235]. Koch notes Fig. 12 shows an ordering or ranking of machine learning models of various number of trees. Furthermore, a tuner results table 1200 summarizes a comparison between the default hyperparameter configuration and the ten best hyperparameter configurations of the 2,555 unique model configurations evaluated as measured by the misclassification error percentage (MISC) objective function. Evaluation number 2551 included the hyperparameter configuration with the best final error value of 1.74%. A review of tuner results table 1200 provides alternative hyperparameter configurations that have comparable objective function performance. For example, if fewer trees was desired for the number of trees hyperparameter, evaluation number 2540 may be selected by the user for selected model data 320 because the hyperparameter value selected for the number of trees hyperparameter is 136 instead of 142.) See also ¶ [0074]: “A fixed, predefined value may be used for the fraction value unless the number of folds F is defined by the user. In cross validation, each model evaluation requires F−1 number of training executions and scoring executions with different training subsets as discussed previously. Thus, the evaluation time is increased by approximately a factor of F−1. For small to medium sized input datasets or for unbalanced input datasets, cross validation provides on average a better representation of error across the entire input dataset.”)
		Koch method / system to learn improved loss functions does not teach or suggest the following:
	- wherein the one or more optimization iterations proceed through the sequence until the cost function is successfully optimized subject to the constraint
		However, Xiong method / system to learn improved loss functions teaches the following:
	- wherein the one or more optimization iterations proceed through the sequence until the cost function is successfully optimized subject to the constraint (see at least Xiong: ¶ [0056] & ¶ [0058-0059]. Xiong teaches that the cost function and the gradient may not be able to be evaluated mathematically. However, the normal back-propagation technique can still be applied to compute a surrogate of derivatives used for learning, even though the cost function is not well defined. In particular, for the softmax activation function, the error of the softmax inputs may be the target minus the adjusted prediction. Alternatively, variance of the pre-nonlinearity value, which is usually unbounded, can be adjusted. Also what we refer to neural networks include embodiments of neural networks with different depths, structures, activation functions, loss functions, parameter regularization, parameter sharing and optimization methods. Each model in the ensemble is generally optimized independently. This may be considered to be roughly equivalent to optimizing the average of the cost function of models in the ensemble:  C = - 1 n  ∑ i = 1 n  l  ( f i  ( x , w i ) , t ) ( 8 ) where fi(x, wi) is the i-th model with parameters wi. When testing the ensemble, predictions from the models in the ensemble are averaged. See Figures 2 and Fig. 4 of Xiong reference.)
It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Koch method to learn improved loss functions with the aforementioned teachings regarding wherein the one or more optimization iterations proceed through the sequence until the cost function is successfully optimized subject to the constraint in view of Xiong, wherein the neural network optimizes weights for each feature detector. After learning, the optimized weight configuration can then be applied to test data (see at least Xiong: ¶ [0026]).
Further, the claimed invention is merely a combination of old elements in a similar field to learn improved loss functions, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Xiong, the results of the combination were predictable.

Regarding Dependent Claims 4 and 18, Koch / Xiong method / system to learn improved loss functions teaches the limitations of Independent Claims 1 and 15 above, and Xiong further teaches the method / system to learn improved loss functions comprising:
	- training (see at least Xiong: ¶ [0032-0033]. Xiong notes that each training case is then processed by the neural network, one or a mini-batch at a time (202). For each such training case, the switch may reconfigure the neural network by selectively disabling each linked feature detector. Once the training set has been learned by the neural network, the switch may enable all feature detectors and normalize their outgoing weights (204).), by the one or more computing devices (see at least Xiong: ¶ [0015]. Xiong teaches one or more computing devices.), an additional machine-learned model via optimization of an inferred loss function (see at least Xiong: ¶ [0058] & ¶ [0060]. Xiong notes that what we refer to neural networks include embodiments of neural networks with different depths, structures, activation functions, loss functions, parameter regularization, parameter sharing and optimization methods. From both the modelling and the optimization aspects, regular ensemble learning using neural networks may be considered to be a very large neural network that encompasses the individual networks of the ensemble, i.e. of a plurality of neural networks. These individual networks may be connected to the inputs in parallel and produce n “pre-outputs”, fi(x, wi), i=1, . . . n. ), the inferred loss function comprises the vector of variable hyperparameter values multiplied with a vector of the plurality of loss function components (see at least Xiong: ¶ [abstract] & ¶ [0033]. Xiong teaches that once the training set has been learned by the neural network, the switch may enable all feature detectors and normalize their outgoing weights (204). Normalization comprises reducing the outgoing weights of each feature detector or input by multiplying them by the probability that the feature detector or input was not disabled. In an example, if the feature detectors of each hidden layer were selectively disabled with probability 0.5 in the training stage, the outgoing weights are halved for the test case since approximately twice as many feature detectors will be enabled. A similar approach is applied to the input layers. The test set may then be processed by the neural network (206). A hyper-parameter that controls the variance of the ensemble predictors is used to address overfitting. For larger values of the hyper-parameter, the predictions from the ensemble have more variance, so there is less overfitting. This technique can be applied to ensemble learning with various cost functions, structures and parameter sharing. See also Fig. 4 of Xiong.)
It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Koch method to learn improved loss functions with the aforementioned teachings regarding training, by the one or more computing devices, an additional machine-learned model via optimization of an inferred loss function, the inferred loss function comprises the vector of variable hyperparameter values multiplied with a vector of the plurality of loss function components in view of Xiong, in order to improve the performance of dropout training by adjusting the variance of predictions introduced by dropout during training, and can be referred to for convenience as variance-adjustable dropout. Dropout training of a single neural network has the effect of creating an exponentially large ensemble of neural networks with different structures, but with shared parameters, which may provide improved computational efficiency at training time and at testing time (see at least Xiong: ¶ [0017-0018]).
Further, the claimed invention is merely a combination of old elements in a similar field to learn improved loss functions, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Xiong, the results of the combination were predictable.

Regarding Dependent Claims 5 and 19, Koch / Xiong method / system to learn improved loss functions teaches the limitations of Independent Claims 1 and 15 above, and Xiong further teaches the method / system to learn improved loss functions comprising:
	- wherein attempting to optimize, by the one or more computing devices (see at least Xiong: ¶ [0015] & Fig. 1.), the cost function comprises solving, by the one or more computing devices, a quadratic program (see at least Xiong: ¶ [0059] & Fig. 3. Xiong notes that models in the ensemble, each of which may comprise a neural network, do not have tied weights and may have different architectures. Each model in the ensemble is generally optimized independently. This may be considered to be roughly equivalent to optimizing the average of the cost function of models in the ensemble: C = - 1 n  ∑ i = 1 n  l  ( f i  ( x , w i ) , t ) ( 8 ) where fi(x, wi) is the i-th model with parameters wi. When testing the ensemble, predictions from the models in the ensemble are averaged. Fig. 3 shows a gaussian distribution function which is a squared polynomial and hence quadratic. Also at ¶ [0052]: “The parameters may be adjusted using the gradient or the Hessian of a log-likelihood function or a squared error function, evaluated using the desired output and the variance-adjusted output.”)
It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Koch method to learn improved loss functions with the aforementioned teachings regarding wherein attempting to optimize, by the one or more computing devices, the cost function comprises solving, by the one or more computing devices, a quadratic program in view of Xiong, in order to improve the performance of dropout training by adjusting the variance of predictions introduced by dropout during training, and can be referred to for convenience as variance-adjustable dropout. Dropout training of a single neural network has the effect of creating an exponentially large ensemble of neural networks with different structures, but with shared parameters, which may provide improved computational efficiency at training time and at testing time (see at least Xiong: ¶ [0017-0018]).
Further, the claimed invention is merely a combination of old elements in a similar field to learn improved loss functions, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Xiong, the results of the combination were predictable.

Regarding Dependent Claim 6, Koch / Xiong method to learn improved loss functions teaches the limitations of Independent Claim 1 above, and Koch further teaches the method to learn improved loss functions comprising:
	- wherein at least a first machine-learned model of the plurality of machine-learned models was trained using a first training loss function that differs from a second training loss function that was used to train at least a second machine-learned model of the plurality of machine-learned models (see at least Koch: ¶ [0093-0094] & ¶ [0097]. Koch notes in Table 1 at ¶ [0097] shows training machine-learned models with one machine learning model using a Regularization L1 loss function and a second machine learning model using a Regularization L2 loss function. The Regularization L1 Loss function is different type of loss function when compared to the Regularization L2 loss function shown in Koch reference. See also ¶ [0093-0094] of Koch.)

Regarding Dependent Claim 7, Koch / Xiong method to learn improved loss functions teaches the limitations of Independent Claim 1 above, and Koch further teaches the method to learn improved loss functions comprising:
- wherein the plurality of loss function components comprise: a logloss loss function component (see at least Koch: ¶ [0101]. Koch teaches that MCLL uses a multiclass log loss as the objective function (nominal type only); MISC uses a misclassification error percentage as the objective function (nominal type only).); an L1 regularization loss function component and an L2 regularization loss function component (see at least Koch: ¶ [0093-0094]. Koch notes that the gradient boosting tree model type hyperparameters may include an L1 norm regularization parameter (lasso) that is greater than or equal to zero with a default value of zero. The gradient boosting tree model type hyperparameters further may include a learning rate (learningRate) that is between zero and one, inclusive, with a default value of 0.1. The gradient boosting tree model type hyperparameters further may include a number of trees (nTree) to grow with a default value of 100. The gradient boosting tree model type hyperparameters further may include an L2 norm regularization parameter (ridge) that is greater than or equal to zero with a default value of zero. The neural network model type hyperparameters further may include an L1 norm regularization parameter (regL1) that is greater than or equal to zero with a default value of zero. The neural network model type hyperparameters further may include an L2 norm regularization parameter (regL2) that is greater than or equal to zero with a default value of zero.  See also Table 1 on ¶ [0096] showing RegL1 and RegL2 neural network (PROC_NNET).)

Regarding Dependent Claim 11, Koch / Xiong method to learn improved loss functions teaches the limitations of Independent Claim 1 above, and Xiong further teaches the method to learn improved loss functions comprising:
- the one or more learning iterations comprise a plurality of learning iterations (see at least Xiong: ¶ [0006]. Xiong notes that computing a plurality of training outputs by repeatedly applying the neural network to the at least one training data item while disabling at least one of the hidden units or input units randomly with the predetermined probability, or pseudo-randomly with the predetermined probability, or using a predetermined set of binary masks that use the predetermined probability and where each mask is used only once, or according to a fixed pattern that uses the predetermined probability.)
- each learning iteration further (see at least Xiong: ¶ [0006]. Xiong notes that computing a plurality of training outputs by repeatedly applying the neural network to the at least one training data item while disabling at least one of the hidden units or input units randomly with the predetermined probability, or pseudo-randomly with the predetermined probability, or using a predetermined set of binary masks that use the predetermined probability and where each mask is used only once, or according to a fixed pattern that uses the predetermined probability.) comprises:
- training, by the one or more computing devices (see at least Xiong: Fig. 1 & ¶ [0015].), an additional machine-learned model via optimization of an inferred loss function, the inferred loss function comprising the vector of variable hyperparameter values multiplied with a vector of the plurality of loss function components (see at least Xiong: ¶ [0033] & Figs. 3-4. Xiong notes that normalization comprises reducing the outgoing weights of each feature detector or input by multiplying them by the probability that the feature detector or input was not disabled. In an example, if the feature detectors of each hidden layer were selectively disabled with probability 0.5 in the training stage, the outgoing weights are halved for the test case since approximately twice as many feature detectors will be enabled. A similar approach is applied to the input layers. The test set may then be processed by the neural network (206). See also Figs. 3-4 of Xiong reference.)
- evaluating, by the one or more computing devices (see at least Xiong: Fig. 1 & ¶ [0015].), an additional validation error for the additional machine-learned model (see at least Xiong: ¶ [0047] & Fig. 5.), wherein the validation error for the additional machine-learned model is supplied as input for a next sequential learning iteration of the plurality of learning iterations (see at least Xiong: ¶ [0065] & Fig. 5. Xiong notes that FIG. 5 shown therein are possible results for an error rate for held out test data (10,000 cases), when variance-adjustable dropout is used to train networks on handwritten MNIST digits. Such results would indicate that setting α>1 leads to better classification performance.)
It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Koch method to learn improved loss functions with the aforementioned teachings regarding training, by the one or more computing devices, an additional machine-learned model via optimization of an inferred loss function, the inferred loss function comprising the vector of variable hyperparameter values multiplied with a vector of the plurality of loss function components & the one or more learning iterations comprise a plurality of learning iterations & evaluating, by the one or more computing devices, an additional validation error for the additional machine-learned model, wherein the validation error for the additional machine-learned model is supplied as input for a next sequential learning iteration of the plurality of learning iterations in view of Xiong, in order to improve the performance of dropout training by adjusting the variance of predictions introduced by dropout during training, and can be referred to for convenience as variance-adjustable dropout. Dropout training of a single neural network has the effect of creating an exponentially large ensemble of neural networks with different structures, but with shared parameters, which may provide improved computational efficiency at training time and at testing time (see at least Xiong: ¶ [0017-0018]).
Further, the claimed invention is merely a combination of old elements in a similar field to learn improved loss functions, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Xiong, the results of the combination were predictable.

11. 		Claims 8-9 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Koch / 	Xiong, and in further view of Foreign Patent Application (WO 2018/212710 A1) to He.
		Regarding Dependent Claims 8 and 13, Koch / Xiong method to learn improved loss functions does not teach or suggest the following:
	- wherein the plurality of loss function components respectively correspond to a plurality of 	different augmentation operations performed on a set of training data (see Dependent Claim 	8);
	- wherein the validation error for each machine-learned model approximates a test error for the 	machine-learned model (see Dependent Claim 13)
		He however in the analogous art to learn improved loss functions teaches the following:
- wherein the plurality of loss function components (see at least He: Pages 11, Lns. 14-23 & Page 12, Lns. 15-25.) respectively correspond to a plurality of different augmentation operations performed on a set of training data (see at least He: Page 1, Lns. 29-34; Page 2, Lns. 1-4 & Page 23, Lns. 10-15. He teaches that “to leverage the interactions between features, one common solution is to explicitly augment a feature vector with products of features (also known as cross features), as in polynomial regression (PR) where a weight for each cross feature is also learned.” Also “The attention score*interaction score of each feature interaction of three test examples. We can see that among all three interactions, the item-tag interaction is the most important. However, FM assigns the same importance score for all interactions, resulting in a large prediction error. By augmenting FM with the attention network (cf. rows FM+A), the item-tag interaction is assigned a higher importance score, and the prediction error is reduce.”) (see Dependent Claim 8)
- wherein the validation error for each machine-learned model approximates a test error for the machine-learned model (see at least He: Page 22, Lns. 16-23 & Figs. 8A-8B. He teaches that Figures 8A and 8B compare the training and test error of AFM and FM of each epoch for the Frappe and MovieLens datasets respectively. We observe that AFM converges faster than FM. On Frappe, both the training and test error of AFM are much lower than that of FM, indicating that AFM can better fit the data and lead to more accurate prediction. On MovieLens, although AFM achieves a slightly higher training error than FM, the lower test error shows that AFM generalizes better to unseen data. Also Table 2: “Test error and number of parameters of different methods on embedding size 256. We have the following observations: It can be seen that AFM achieves the best performance among all methods. Specifically, AFM betters LibFM with an 8.6% relative improvement by using less than 0.1 M additional parameters; and AFM outperforms the second best method Wide&Deep with 4.3%, while using much fewer model parameters. This demonstrates the effectiveness of AFM, which, although is a shallow model, achieves better performance than deep learning methods.”) (see Dependent Claim 13)
It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Koch / Xiong method to learn improved loss functions with the aforementioned teachings regarding wherein the plurality of loss function components respectively correspond to a plurality of different augmentation operations performed on a set of training data and wherein the validation error for each machine-learned model approximates a test error for the machine-learned model in view of He, whereas dropout is disabled during testing and the whole network is used for prediction, dropout has another side effect of performing model averaging with smaller neural networks, which may potentially improve the performance (see at least He: [last ¶ of page 12]).
Further, the claimed invention is merely a combination of old elements in a similar field to learn improved loss functions, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by He, the results of the combination were predictable.

Regarding Dependent Claim 9, Koch / Xiong / He method to learn improved loss functions teaches the limitations of Independent Claims 1 and 15 above, and Xiong further teaches the method to learn improved loss functions comprising:
	- determining, by the one or more computing devices (see at least Xiong: Fig. 1 & ¶ [0015].), an 	optimal probability distribution for the 	plurality of different augmentation operations based at 	least in part on the vector of variable hyperparameter values (see at least Xiong: Fig. 3 & ¶ 	[0042]. Xiong teaches when implemented using suitable parameters, can be considered as 	improving generalization performance by introducing a distribution of networks with different 	structures during training. Networks from this distribution produce different predictions f(x, m, 	w) for the same x and w. During training, these predictions are randomly sampled to produce 	errors that are then back propagated to generate gradients for updating the network 	parameters. The prediction f(x, m, w) of a training example can be considered as a random 	variable with a distribution centered around f(x, w), as illustrated in FIG. 3.)

12. 		Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Koch / Xiong, and in further view of NPL Document Yu, Jiaqian, and Matthew Blaschko. "A convex surrogate operator for general non-modular loss functions." Artificial Intelligence and Statistics. hereinafter Yu.
		Regarding Dependent Claim 10, Koch/ Xiong method to learn improved loss functions does not teach or suggest the following:
	- wherein the plurality of loss function components comprise a plurality of convex, non-negative, piecewise linear functions
		Yu however in the analogous art to learn improved loss functions teaches the following:
	- wherein the plurality of loss function components comprise a plurality of convex, non-negative, piecewise linear functions (see at least Yu:  Tables 2-3 on Page 1038. Yu teaches on Table 2: For the face classification task, the cross comparison of average loss values (with standard error) using different surrogate operator and losses as in Equation (33) to Equation (36) during training, respectively. For the cases that the submodular component is non-negative, i.e. using ∆1 and ∆2, the lowest empirical error is achieved when using BD. The plot of the four loss functions used in our experiments as in Equations (33) to (36). The x axis is the number of mispredictions for each track (we show here the loss functions corresponding to track length equal to 10 as an example), and the y axis is the value of loss function. The original losses are drawn in red; the supermodular components are drawn in green, and the submodular components in blue.)
		It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Koch / Xiong method to learn improved loss functions with the aforementioned teachings regarding wherein the plurality of loss function components comprise a plurality of convex, non-negative, piecewise linear functions in further view of Yu, wherein a convex surrogate operator BD can achieve a comparable convergence rate to an SVM, demonstrating that optimization is very fast in practice and the method scales well to large datasets (see at least Yu: Fig. 5 caption on page 1039.)
Further, the claimed invention is merely a combination of old elements in a similar field to learn improved loss functions, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Yu, the results of the combination were predictable.

13.		Claims 12 and 14 and rejected under 35 U.S.C. 103 as being unpatentable over Koch / Xiong, and in further view of US Patent Application (US 2016/0247089 A1) to Zhao.
		Regarding Dependent Claim 12, Koch / Xiong method to learn improved loss functions does not teach or suggest the following:
	- wherein the training, by the one or more computing devices, the additional machine-learned model via optimization of the inferred loss function comprises warm start training, by the one or more computing devices, the additional machine-learned model via optimization of the inferred loss function (see Dependent Claim 12);
	- wherein the plurality of machine-learned models comprise a plurality of different versions of a same machine-learned model respectively saved at a plurality of training checkpoints of a training process (see Dependent Claim 14)
		Zhao however in the analogous art to learn improved loss functions teaches the following:
	- wherein the training, by the one or more computing devices (see at least Zhao: Fig. 2.), the additional machine-learned model via optimization of the inferred loss function comprises warm start training (see at least Zhao: ¶ [0167] & Table 3. Zhao teaches that Table 3, below, shows an example of the total run time of the L1L2-SVM with no acceleration, with acceleration using the proposed technique (e.g., exclusion of inactive features) and with acceleration using the proposed technique and a warm start technique. The warm start technique involves computing the w*.sub.k by using the w*.sub.k-1 obtained in the (k−1)th iteration as the initial point for searching the optimal solution. The total run times shown in Table 3 are shown in seconds.), by the one or more computing devices (see at least Zhao: Fig. 2.), the additional machine-learned model via optimization of the inferred loss function (see at least Zhao: ¶ [0124] & ¶ [0132]. Zhao teaches accelerating an SPSVM by precisely identifying inactive features in the optimal solution of a L1 norm regularized SPSVM, such as an L1 norm regularized L2 loss support vector machine (L1L2-SVM), and removing those inactive features before training the L1L2-SVM. Model selection for an SPSVM involves solving a series of models with a series of differing regularization parameters before selecting the best model to use. Variational inequality for constructing a tight convex set can be used to compute bounds for safely screening inactive features in different situations, such that every feature indicated as inactive is guaranteed to be inactive in the optimal solution for a particular model. In other words, the results of one model can be used to predict which inactive features to exclude when solving the subsequent model. An L1-regularized sparse predictive modeling algorithm can be formulated as minw loss(w)+λ∥w∥1, where w ∈ 
    PNG
    media_image1.png
    37
    29
    media_image1.png
    Greyscale
 m contains the model coefficients, loss(w) is a loss function, and λ≧0 is the regularization parameter that balances between the loss function and the regularizer. When the hinge loss or its square form is used as the loss function, the resulting sparse model is the L1-regularized support vector machine (L1-SVM).) (see Dependent Claim 12)
	- wherein the plurality of machine-learned models comprise a plurality of different versions (see at least Zhao: ¶ [0124] & ¶ [0158]. Zhao notes that a dataset can be accessed. The dataset can track a number of features across a number of samples. A series of values for a regularization parameter of a sparse support vector machine model can be determined. The sparse support vector machine model can be a L1 norm regularized L2 loss support vector machine model. The values can be predefined values. At [0124]: “Model selection for an SPSVM involves solving a series of models with a series of differing regularization parameters before selecting the best model to use. Variational inequality for constructing a tight convex set can be used to compute bounds for safely screening inactive features in different situations, such that every feature indicated as inactive is guaranteed to be inactive in the optimal solution for a particular model.”.) of a same machine-learned model respectively saved at a plurality of training checkpoints of a training process (see at least Zhao:  ¶ [0084] & ¶ [0126]. Zhao notes that at ¶ [0126]: “A series of models must be solved in order to determine an optimal solution for the SPSVM. The SPSVM training can be greatly accelerated by using a solution for a model (e.g., the kth model) of the series of models to determine what features must be inactive in the next of the series of models (e.g., the kth+1 model). By determining the inactive features, the SPSVM training (e.g., model solving) can exclude those inactive features when attempting to solve the next model in the series of model.” Also at ¶ [0154]: “By identifying inactive features, the SPSVM training process can leverage the ability to filter out inactive features during model selection. Thus, an algorithm can be used to remove inactive features before each step of the model selection process to reduce the search space for fitting or training an L1L2-SVM model.” ) (see Dependent Claim 14)
It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Koch / Xiong method to learn improved loss functions with the aforementioned teachings regarding wherein the training, by the one or more computing devices, the additional machine-learned model via optimization of the inferred loss function comprises warm start training, by the one or more computing devices, the additional machine-learned model via optimization of the inferred loss function & wherein the plurality of machine-learned models comprise a plurality of different versions of a same machine-learned model respectively saved at a plurality of training checkpoints of a training process in further view of Zhao, wherein improvements to the model selection process can be made to enable more efficient model selection, thus allowing SPSVMs to be used for certain datasets where SPSVMs would have otherwise been computationally prohibitive due to computational cost, time, or other reasons (see at least Zhao: ¶ [0131]).
Further, the claimed invention is merely a combination of old elements in a similar field to learn improved loss functions, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Zhao, the results of the combination were predictable.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DERICK HOLZMACHER whose telephone number is (571) 270-7853. The examiner can normally be reached on Monday-Friday 9:00 AM – 6:30 PM EST. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, Applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.  
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Brian Epstein can be reached on 571-270-5389. The fax phone number for the organization where this application or proceeding is assigned is 571-270-8853.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

		/DERICK J HOLZMACHER/		Patent Examiner, Art Unit 3623                                                                                                                                                                                                        
/BRIAN M EPSTEIN/Supervisory Patent Examiner, Art Unit 3683