DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Information Disclosure Statement
The information disclosure statement (IDS) submitted on 03/17/2019 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.


Specification
The disclosure is objected to because of the following informalities: 
The present specification recites “Amijo line search”, instead it should be spelled as “Armijo line search”. Since Amijo is not a term of art, Examiner suggested to correct the spelling throughout the specification. Appropriate correction is required.

Claim Objections
Claims 1-16 are objected to because of the following informalities:  
Claim 1 recites “matrix-free CG solver to obtain…”. Since “CG” is the first time it is introduced on independent claim 1, Examiner suggested to write the long form of what “CG” stands for as cited in the specification paragraph [0006] Conjugate Gradient (CG). Appropriate correction is required.
Claim 3 recites “Amijo line search”, instead it should be spelled as “Armijo line search”. Appropriate correction is required.
Claims 2-7 are objected for dependency of independent claim 1. 
Claim 8 recites “SteihaugCG solver”. Examiner suggested to add space between “Steihaug” and CG. In addition, Examiner also suggested to write the long form of what “CG” stands for as cited in the specification paragraph [0006] Conjugate Gradient (CG). Appropriate correction is required.
Claims 9-16 are objected for dependency of independent claim 8. 



Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-7 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites the limitation “obtain an inexact solution to a linear system…”, it is unclear what is considered inexact or correct and how does one measure what is considered (imprecise/approximate or inexact). For the purpose of examination, the above claim limitation has been interpreted as “obtain a solution to a linear system…” 
Claims 2-7 are rejected for dependency of independent claim 1. 


Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-16 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The analysis of the claims will follow the 2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50-57(January 7, 2019) (“2019 PEG”). 
Regarding claim 1 
Step 1:  The claim recites a method; therefore, it falls into the statutory category of manufacture.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
 “…the method comprising: defining a loss function corresponding to the deep neural network; …perform an optimization method which iteratively minimizes the loss function over a plurality of iterations, wherein each iteration comprises: calculating a steepest direction of the loss function by determining the gradient of the loss function at the current parameter values,”
This limitation, under its broadest reasonable interpretation in light of the specification recites mathematical concept per MPEP 2106.04.
“selecting a batch of samples included in the plurality of training samples, apply a matrix-free CG solver to obtain an inexact solution to a linear system defined by the steepest direction of the loss function and a stochastic Hessian matrix with respect to the batch of samples,”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could perform selecting batch of sample and train a linear system using a loss function and stochastic hessian matrix using pen and paper.
“determining a descent direction based on the inexact solution to the linear system and the steepest direction of the loss function,” 
This limitation, under its broadest reasonable interpretation in light of the specification recites mathematical concept per MPEP 2106.04.

Step 2A Prong 2: This judicial exception is not integrated into a practical. In particular, the claim only recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). The additional element of “computer-implemented… deep neural network” as drafted, is reciting generic computer components. The generic computer components in these steps are recited at a high-level of generality (i.e., as a generic computer component performing a generic computer function) such that it amounts no more than mere instructions to apply the exception using a generic computer component. In addition, the claim recites additional elements “receiving a training dataset comprising a plurality of training samples; setting current parameter values to initial parameter values;… and updating the current parameter values based on the descent direction; and following the optimization method, storing the current parameter values in relationship to the deep neural network.” the act of data manipulation which is adding an insignificantly extra-solution activity to the judicial exception see MPEP 2106.05(g). Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The only remaining limitation of the claim, “receiving a training dataset comprising a plurality of training samples; …storing the current parameter values in relationship to the deep neural network.” constitute storing and retrieving information in memory, which the courts have found to be well-understood, routine, and conventional. See MPEP 2106.05(d)(II); Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015); the additional limitation “setting current parameter values to initial parameter values;… and updating the current parameter values based on the descent direction; and following the optimization method,” as explained by the Supreme Court, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood or conventional. See MPEP 2106.05(g). Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.  

Regarding claim 2
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
“wherein the current parameter values are updated based on the descent direction and a learning rate calculated using the steepest direction of the loss function and the descent direction.”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could update the descent direction based on the learning rate calculation of the loss function using pen and paper.
Step 2A Prong 2: The claim does not appear to recite additional elements that might integrate the judicial exception into a practical application.
	Based on the determination in Step 2A of the analysis that the claims are directed to a judicial exception, it must be determined if the claim contain any element or combination of elements sufficient to ensure that the claim amounts to significantly more than the judicial exception. In this case, after considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount of significantly more than the judicial exception. 

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 

Regarding claim 3
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
“wherein the learning rate is calculated using an Amijo line search method.”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could performed Amrijo line search method using pen and paper. 
Step 2A Prong 2: The claim does not appear to recite additional elements that might integrate the judicial exception into a practical application.
	Based on the determination in Step 2A of the analysis that the claims are directed to a judicial exception, it must be determined if the claim contain any element or combination of elements sufficient to ensure that the claim amounts to significantly more than the judicial exception. In this case, after considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount of significantly more than the judicial exception. 

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 

Regarding claim 4
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
“wherein the learning rate is calculated using a Goldstein line- search method.”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could performed Goldstein line search method using pen and paper. 
Step 2A Prong 2: The claim does not appear to recite additional elements that might integrate the judicial exception into a practical application.
	Based on the determination in Step 2A of the analysis that the claims are directed to a judicial exception, it must be determined if the claim contain any element or combination of elements sufficient to ensure that the claim amounts to significantly more than the judicial exception. In this case, after considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount of significantly more than the judicial exception. 
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 


Regarding claim 5
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
“wherein the batch of samples comprises a random sampling of the plurality of training samples”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could performed random sampling using plurality of training samples as capable of performing in the human mind or using pen and paper. 

Step 2A Prong 2: The claim does not appear to recite additional elements that might integrate the judicial exception into a practical application.
	Based on the determination in Step 2A of the analysis that the claims are directed to a judicial exception, it must be determined if the claim contain any element or combination of elements sufficient to ensure that the claim amounts to significantly more than the judicial exception. In this case, after considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount of significantly more than the judicial exception. 
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 

Regarding claim 6
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
“wherein the random sampling the plurality of training samples is resampled during each of the plurality of iterations.”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could resample the training samples using repeating the training steps. 
Step 2A Prong 2: The claim does not appear to recite additional elements that might integrate the judicial exception into a practical application.
	Based on the determination in Step 2A of the analysis that the claims are directed to a judicial exception, it must be determined if the claim contain any element or combination of elements sufficient to ensure that the claim amounts to significantly more than the judicial exception. In this case, after considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount of significantly more than the judicial exception. 

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 

Regarding claim 7
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
“wherein the optimization method is performed using a parallel computing platform and computing operations associated with the optimization method are performed in parallel across a plurality of … included in the parallel …”
This limitation just places restrictions on the optimization method is performed using parallel computing platform and doesn't change the fact that the underlying manipulations could be mental.

Step 2A Prong 2: This judicial exception is not integrated into a practical. In particular, the claim recites additional elements “computing platform… processors” the act of data manipulation which is adding an insignificantly extra-solution activity to the judicial exception see MPEP 2106.05(g). Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 

Regarding claim 8
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
“…the method comprising: defining a loss function corresponding to the deep neural network; …perform an optimization method which iteratively minimizes the loss function over a plurality of iterations, wherein each iteration comprises: calculating a steepest direction of the loss function by determining the gradient of the loss function at the current parameter values,”
This limitation, under its broadest reasonable interpretation in light of the specification recites mathematical concept per MPEP 2106.04.
“selecting a batch of samples included in the plurality of training samples, apply a matrix-free CG solver to obtain an inexact solution to a linear system defined by the steepest direction of the loss function and a stochastic Hessian matrix with respect to the batch of samples,”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could perform selecting batch of sample and train a linear system using a loss function and stochastic hessian matrix using pen and paper.
“determining a descent direction based on the inexact solution to the linear system and the steepest direction of the loss function,” 
This limitation, under its broadest reasonable interpretation in light of the specification recites mathematical concept per MPEP 2106.04.
“and conditionally updating the current parameter values and the trust region radius based on a comparison of (i) a true reduction value provided by the loss function given the current parameter values, and (ii) a predicted reduction value provided by the descent direction;”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could update parameter values of the trust region radius based on the comparison using evaluation and observation method.

Step 2A Prong 2: This judicial exception is not integrated into a practical. In particular, the claim only recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). The additional element of “computer-implemented, deep neural network” as drafted, is reciting generic computer components. The generic computer components in these steps are recited at a high-level of generality (i.e., as a generic computer component performing a generic computer function) such that it amounts no more than mere instructions to apply the exception using a generic computer component. In addition, the claim recites additional elements “receiving a training dataset comprising a plurality of training samples; setting current parameter values to initial parameter values;… and updating the current parameter values based on the descent direction; and following the optimization method, storing the current parameter values in relationship to the deep neural network.” the act of data manipulation which is adding an insignificantly extra-solution activity to the judicial exception see MPEP 2106.05(g). Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The only remaining limitation of the claim, “receiving a training dataset comprising a plurality of training samples; …storing the current parameter values in relationship to the deep neural network.” constitute storing and retrieving information in memory, which the courts have found to be well-understood, routine, and conventional. See MPEP 2106.05(d)(II); Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015); the additional limitation “setting current parameter values to initial parameter values;… and updating the current parameter values based on the descent direction; and following the optimization method,” as explained by the Supreme Court, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood or conventional. See MPEP 2106.05(g). Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.  

Regarding claim 9
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
“wherein the batch of samples comprising a random sampling of the plurality of training samples.”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could perform random sampling of plurality of training samples using pen and paper or observation and evaluation method. 
Step 2A Prong 2: The claim does not appear to recite additional elements that might integrate the judicial exception into a practical application.
	Based on the determination in Step 2A of the analysis that the claims are directed to a judicial exception, it must be determined if the claim contain any element or combination of elements sufficient to ensure that the claim amounts to significantly more than the judicial exception. In this case, after considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount of significantly more than the judicial exception. 

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 

Regarding claim 10
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
“wherein the random sampling the plurality of training samples is resampled during each of the plurality of iterations.”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could perform re sampling method of plurality of training samples by repeating the training steps. 

Step 2A Prong 2: The claim does not appear to recite additional elements that might integrate the judicial exception into a practical application.
	Based on the determination in Step 2A of the analysis that the claims are directed to a judicial exception, it must be determined if the claim contain any element or combination of elements sufficient to ensure that the claim amounts to significantly more than the judicial exception. In this case, after considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount of significantly more than the judicial exception. 

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 
Regarding claim 11
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
“wherein the trust region radius corresponds as a spherical area in which the trust region subproblem lies.”
This limitation just places restrictions on trust region radius corresponds as a spherical area and doesn't change the fact that the underlying manipulations could be mental.

Step 2A Prong 2: The claim does not appear to recite additional elements that might integrate the judicial exception into a practical application.
	Based on the determination in Step 2A of the analysis that the claims are directed to a judicial exception, it must be determined if the claim contain any element or combination of elements sufficient to ensure that the claim amounts to significantly more than the judicial exception. In this case, after considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount of significantly more than the judicial exception. 

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 

Regarding claim 12
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
 “wherein the trust region subproblem is a bounded quadratic minimization problem.”
This limitation just places restrictions on trust region subproblem is bounded with a quadratic minimization and doesn't change the fact that the underlying manipulations could be mental.

Step 2A Prong 2: The claim does not appear to recite additional elements that might integrate the judicial exception into a practical application.
	Based on the determination in Step 2A of the analysis that the claims are directed to a judicial exception, it must be determined if the claim contain any element or combination of elements sufficient to ensure that the claim amounts to significantly more than the judicial exception. In this case, after considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount of significantly more than the judicial exception. 

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 

Regarding claim 13
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
 “wherein the current parameter values are updated by: selecting a learning rate for the descent direction; determining a first set of parameters based on the product of the descent direction and the learning rate; determining a momentum descent direction at the first set of parameters;”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could select learning rate and determine set of parameters based on descent direction and learning rate along with determining a momentum descent direction using observation and evaluation method. 
“selecting a momentum rate for the momentum descent direction; and updating the current parameter values based on the first set of parameters and the product of the momentum descent direction and the momentum rate.”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could further select momentum rate and make update to the current parameter values based on the product of the momentum descent direction using observation and evaluation method. 

Step 2A Prong 2: The claim does not appear to recite additional elements that might integrate the judicial exception into a practical application.
	Based on the determination in Step 2A of the analysis that the claims are directed to a judicial exception, it must be determined if the claim contain any element or combination of elements sufficient to ensure that the claim amounts to significantly more than the judicial exception. In this case, after considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount of significantly more than the judicial exception. 

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 

Regarding claim 14
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  
“wherein the learning rate is determined using a backtracking line search based on the loss function, the current parameter values, and the descent direction.”
This limitation, under its broadest reasonable interpretation in light of the specification, a human could identify learning rate using the backtracking line search which is based on the loss function using observation and pen with paper. 

Step 2A Prong 2: The claim does not appear to recite additional elements that might integrate the judicial exception into a practical application.
	Based on the determination in Step 2A of the analysis that the claims are directed to a judicial exception, it must be determined if the claim contain any element or combination of elements sufficient to ensure that the claim amounts to significantly more than the judicial exception. In this case, after considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount of significantly more than the judicial exception. 

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 

Regarding claim 15
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
 “wherein the momentum rate is determined using a backtracking line search based on the loss function, the first set of parameters, and the momentum descent direction.”
END820170223US01 33 of 37”

This limitation, under its broadest reasonable interpretation in light of the specification, a human could identify momentum rate using the backtracking line search which is based on the loss function using observation and pen with paper. 

Step 2A Prong 2: The claim does not appear to recite additional elements that might integrate the judicial exception into a practical application.
	Based on the determination in Step 2A of the analysis that the claims are directed to a judicial exception, it must be determined if the claim contain any element or combination of elements sufficient to ensure that the claim amounts to significantly more than the judicial exception. In this case, after considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount of significantly more than the judicial exception. 

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 

Regarding claim 16
Step 1:  The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1:  The claim recites multiple mental processes, as explained below.  The claim recites, inter alia:
END820170223US01 33 of 37”

“wherein the optimization method is performed using a parallel computing platform and computing operations associated with the optimization method are performed in parallel across a plurality of … included in the parallel …”
This limitation just places restrictions on the optimization method is performed using parallel computing platform and doesn't change the fact that the underlying manipulations could be mental.
Step 2A Prong 2: This judicial exception is not integrated into a practical. In particular, the claim recites additional elements “computing platform… processors” the act of data manipulation which is adding an insignificantly extra-solution activity to the judicial exception see MPEP 2106.05(g). Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Thus, the claim is not patent eligible. 



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-3 and 5-7 are rejected under 35 U.S.C. 103 as being unpatentable over Horesh et al. (US 2015/0161987 A1) in view of Byrd et al. (“On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning”, hereinafter: Byrd).

Regarding claim 1
Horesh teaches a computer-implemented method for training a deep neural network, (para [0023] “Exemplary embodiments of the invention will now be discussed in further detail with regard to systems and methods for training a deep neural network and, in particular, systems and methods for accelerating Hessian-free optimization of deep neural networks using implicit preconditioning and sampling.”)
the method comprising: defining a loss function corresponding to the deep neural network; (para [0047] “Let θ denote the network parameters, C (θ) denote a loss function, Δ (θ) denote the gradient of the loss with respect to the parameters, d denote a search direction, and B(θ) denote a matrix characterizing the curvature of the loss around θ (i.e., a Hessian approximation).”)
receiving a training dataset comprising a plurality of training samples; (para [0074] “A geometrically increasing sample size can be used, which is adopted, in an embodiment of the present invention, for the gradient and CG iteration samples in each iteration.” Also see para [0078] “The speech data is collected through the speech data collector 403, which may be a storage repository for the speech being processed by the system 400. The speech data collector 403 sends the speech data to an input/formatting component 412 of a training component 410.”)
setting current parameter values to initial parameter values; (FIG. 1 algorithm 1 shows initializing theta parameter values)
perform an optimization method which iteratively minimizes the loss function over a plurality of iterations, (para [0047] “Let θ denote the network parameters, C (θ) denote a loss function,  Δ (θ) denote the gradient of the loss with respect to the parameters, d denote a search direction, and B(θ) denote a matrix characterizing the curvature of the loss around θ (i.e., a Hessian approximation)… and to minimize this approximation using Krylov subspace methods, such as, for example, conjugated gradient (CG), which access the curvature matrix implicitly through matrix-vector products”)
wherein each iteration comprises: calculating a steepest direction of the loss function by determining the gradient of the loss function at the current parameter values, (para [0047] “Let 0 denote the network parameters, C (0) denote a loss function,  Δ (0) denote the gradient of the loss with respect to the parameters, d denote a search direction, and B(0) denote a matrix characterizing the curvature of the loss around 0 (i.e., a Hessian approximation)… and to minimize this approximation using Krylov subspace methods, such as, for example, conjugated gradient (CG), which access the curvature matrix implicitly through matrix-vector products”)
selecting a batch of samples included in the plurality of training samples, (para [0110] “In accordance with an embodiment of the present invention, computer processor based fast training of a DNN without significant accuracy degradation may comprise the steps of a) pre-training initial weights to make initial weights closer to optimal weights, b) selecting an initial batch of training data having an initial batch size for training, c) per forming training on the initial batch of training data in parallel fashion, d) increasing sample size for a Subsequent batch of training data,”)
apply a matrix-free CG solver… (para [0040] “Embodiments of the present invention use the quasi-Newton L-BFGS method as a preconditioner to a CG solver. While both quasi-Newton approaches and CG exploitan underlying structure of the linear(ized) system, the postulated structural assumptions of a low rank approximation and CG are complementary. Therefore, a combination of a quasi-Newton method as a preconditioner to a CG solver is more effective than dependence upon each one solely.”)
…
and updating the current parameter values based on the descent direction; (para [0048] “Referring to FIG. 1, the implementation of HF optimization, in accordance with an embodiment of the present invention, is illustrated as pseudo-code in Algorithm 1 (100). Gradients are computed over all the training data. Gauss Newton matrix-vector products are computed over a sample (for example, about 1% of the training data) that is taken each time CG-Minimize is called.”)
and following the optimization method, storing the current parameter values in relationship to the deep neural network. (Para [0061] “Using these statistics, the vectors… are stored form iterations of CG and/or PCG, where m is specified by the user. Once m statistics are saved, an L-BFGS matrix H can be defined using the steps in Algorithm 2.”)
Horesh doesn’t explicitly teaches …to obtain an inexact solution to a linear system defined by the steepest direction of the loss function and a stochastic Hessian matrix with respect to the batch of samples,
determining a descent direction based on the inexact solution to the linear system and the steepest direction of the loss function.
Byrd teaches apply a matrix-free CG solver to obtain an inexact solution (abstract “We follow a batch approach, also known in the stochastic optimization literature as a sample average approximation (SAA) approach. Curvature information is incorporated in two sub-sampled Hessian algorithms, one based on a matrix-free inexact Newton iteration and one on a preconditioned limited memory BFGS iteration”) to a linear system defined by the steepest direction of the loss function (pg. 3 second paragraph “In the Newton-CG method, we incorporate sampled (or stochastic) curvature information through a matrix-free conjugate gradient (CG) iteration applied to the Newton equations. We implement this idea by using a subsample that is much smaller than that used for the evaluation of the objective function JX and its gradient ∇JX , in the computation of the Hessian-vector products required by the CG iteration. By coordinating the size of the subsample and the number of CG iterations, the computational cost of this Newton-like iteration is comparable to the cost of a steepest descent step – but the resulting iteration is much more rapidly convergent.”)
and a stochastic Hessian matrix with respect to the batch of samples, (pg. 3 third paragraph “we incorporate stochastic Hessian information through the so-called “initial matrix” employed in limited memory BFGS updating. In the standard L-BFGS method [6], this initial matrix is chosen at every iteration as a multiple of the identify matrix. In the proposed algorithm, the initial matrix is defined implicitly via a conjugate gradient solve of a linear system whose coefficient matrix is given by the stochastic Hessian. We call this technique the stochastically initialized L-BFGS method, and similarly to the approach described above, it is crucial that the stochastic curvature information provided to the algorithm uses a much smaller sample than that used for the evaluation of the objective function and its gradient.”)
determining a descent direction based on the inexact solution to the linear system and the steepest direction of the loss function. (pg. 4 section 2 “This method is quite flexible: by controlling the number of CG iterations, it can resemble the steepest descent method, at one extreme, or the classical (exact) Newton method at the other extreme.” And see pg. 3 second paragraph “We implement this idea by using a subsample that is much smaller than that used for the evaluation of the objective function JX and its gradient ∇JX , in the computation of the Hessian-vector products required by the CG iteration. By coordinating the size of the subsample and the number of CG iterations, the computational cost of this Newton-like iteration is comparable to the cost of a steepest descent step – but the resulting iteration is much more rapidly convergent.”)
Horesh and Byrd are analogous art because they are both directed to machine learning. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combine the method for accelerating hessian-free optimization for deep neural network of Shen with the teaching of applying a matrix-free CG solver to obtain an inexact solution to a linear system defined by the steepest direction of the loss function of Byrd in order to provide “matrix-free conjugate gradient iteration is an effective way of accelerating optimization methods for machine learning” as disclosed by Byrd (pg. 19 section 5 “We have proposed in this paper that Hessian sub-sampling via a matrix-free conjugate gradient iteration is an effective way of accelerating optimization methods for machine learning. Our method avoids sampling the second derivatives directly since this can lead to very noisy estimators, see [15]. We described two methods that can benefit from this approach, one is a variant of Newton-CG and the other of L-BFGS.”).

Regarding claim 2
Horesh in view of Byrd teaches the method of claim 1.
Horesh further teaches wherein the current parameter values are updated based on the descent direction and a learning rate calculated using the steepest direction of the loss function (para [0027] “A Newton optimization method is an example of second order optimization. Second order methods typically converge much faster to a local minimum than their Super-linear, and linear (first order) counterparts. A first order optimization method may include, for example, steepest descent; a Super linear method may include, for example, a quasi-Newton method.”)
and the descent direction. (Para [0068] “Denoting the gradients of the full and subset losses as VJ(w) and VJs(w) respectively, the algorithm ensures that descent made in Js at every iteration must admit a descent direction for the true objective function J. The magnitude (2-norm) of the difference between the sample gradient and the actual gradient is expressed by Equation 4.”)

Regarding claim 3
Horesh in view of Byrd teaches the method of claim 2.
Horesh further teaches wherein the learning rate is calculated using an Amijo line search method. (Para [0048] “The loss, C (0), is computed over a held-out set. CG-Minimize(de(d), do) uses CG to minimize q(d), starting with search direction do. This function returns a series of steps {d, d, ...,dx} that are then used in a line search procedure. The parameter update, 0<-0+C.d. is based on an Armijo rule backtracking line search. Distributed computation to computer gradients and curvature matrix-vector products is done using a master/worker architecture.”)

Regarding claim 5
Horesh in view of Byrd teaches the method of claim 1.
Horesh further teaches wherein the batch of samples comprises a random sampling of the plurality of training samples. (Para [0004] “Because sequence training uses information from time-sequential lattices corresponding to utterances, sequence training is performed using utterance randomization rather than frame randomization. For mini-batch stochastic gradient descent (SGD), which is often used for CE training, frame randomization in Some cases, has been shown to perform better than utterance randomization.”)

Regarding claim 6
Horesh in view of Byrd teaches the method of claim 5.
Byrd further teaches wherein the random sampling the plurality of training samples is resampled during each of the plurality of iterations. (Pg. 2 third paragraph “In order to reduce the computational cost of the optimization, and given that the training set is often highly redundant, it is common to consider only a random sample of the training points, i.e., to include only a subset of the summation terms in (1.1) in the optimization process, thus following a sample average approximation (SAA) framework. If we define D = {1, 2, · · · , m} and let X ⊆ D be a random sample consisting of |X | training instances (yi , xi)i∈X , we can define a stochastic approximation of the true objective”) 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combine the method for accelerating hessian-free optimization for deep neural network of Shen with the teaching of applying a matrix-free CG solver to obtain an inexact solution to a linear system defined by the steepest direction of the loss function of Byrd in order to provide “matrix-free conjugate gradient iteration is an effective way of accelerating optimization methods for machine learning” as disclosed by Byrd (pg. 19 section 5 “We have proposed in this paper that Hessian sub-sampling via a matrix-free conjugate gradient iteration is an effective way of accelerating optimization methods for machine learning. Our method avoids sampling the second derivatives directly since this can lead to very noisy estimators, see [15]. We described two methods that can benefit from this approach, one is a variant of Newton-CG and the other of L-BFGS.”).
Regarding claim 7
Horesh in view of Byrd teaches the method of claim 1.
Horesh further teaches wherein the optimization method is performed using a parallel computing platform (para [0005] “HF optimization techniques for sequence training can be slow, requiring, for example, about 3 weeks for training a 300-hour Switchboard task using 64 parallel machines. There are at least two reasons why training is slow.”)
and computing operations associated with the optimization method are performed in parallel across a plurality of processors included in the parallel computing platform. (Para [0108] “More parallel machines (e.g., 64) were used for SWB compared to BN. As a result, it was not possible to exclusively reserve machines for timing calculations. Therefore, training time is estimated by calculating a total number of accessed data points for training, which is correlated to timing. Table 3 shows the total accessed data points for the baseline and speedup techniques.”)

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Horesh et al. (US 2015/0161987 A1) in view of Byrd et al. (“On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning”, hereinafter: Byrd) and further in view of Bortoletti et al. (“A New Class of Quasi-Newtonian Methods for Optimal Learning in MLP-Networks”, hereinafter: Bortoletti).
Regarding claim 4
Horesh in view of Byrd teaches the method of claim 2.
Horesh in view of Byrd does not teach wherein the learning rate is calculated using a Goldstein line-search method.  
Bortoletti teaches wherein the learning rate is calculated using a Goldstein line-search method. (Pg. 269 “We have implemented -BFGS, with the Armijo–Goldstein line search technique, for several values of. Recall that the time and space complexity of -BFGS depends on, i.e., the number of vector pairs”)
Horesh and Byrd are analogous art because they are both directed to machine learning. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combine the method for accelerating hessian-free optimization for deep neural network of Shen in view of Byrd with calculating the learning rate using a Goldstein line-search method of Bortoletti in order to provide “quasi-Newton methods for the effective learning in large multilayer perceptron (MLP)-networks” as disclosed by Bortoletti (abstract “In this paper, we present a new class of quasi-Newton methods for the effective learning in large multilayer perceptron (MLP)-networks. The algorithms introduced in this work, named QN, utilize an iterative scheme of a generalized BFGS-type method, involving a suitable family of matrix algebras . The main advantages of these innovative methods are based upon the fact that they have an ( log ) complexity per step and that they require ( ) memory allocations.”)


Claim(s) 8-10, 12 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Horesh et al. (US 2015/0161987 A1) in view of Xu et al. (“Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study”, hereinafter: Xu) and further in view of Chen et al. (“Stochastic optimization using a trust-region method and random models”).
Regarding claim 8
Horesh teaches a computer-implemented method for training a deep neural network, (para [0023] “Exemplary embodiments of the invention will now be discussed in further detail with regard to systems and methods for training a deep neural network and, in particular, systems and methods for accelerating Hessian-free optimization of deep neural networks using implicit preconditioning and sampling.”)
the method comprising: defining a loss function corresponding to the deep neural network; (para [0047] “Let θ denote the network parameters, C (θ) denote a loss function, Δ (θ) denote the gradient of the loss with respect to the parameters, d denote a search direction, and B(θ) denote a matrix characterizing the curvature of the loss aroundθ (i.e., a Hessian approximation).”)
receiving a training dataset comprising a plurality of training samples; (para [0074] “A geometrically increasing sample size can be used, which is adopted, in an embodiment of the present invention, for the gradient and CG iteration samples in each iteration.” Also see para [0078] “The speech data is collected through the speech data collector 403, which may be a storage repository for the speech being processed by the system 400. The speech data collector 403 sends the speech data to an input/formatting component 412 of a training component 410.”)
setting current parameter values to initial parameter values; (FIG. 1 algorithm 1 shows initializing theta parameter values)
using a computing platform to perform an optimization method which iteratively minimizes the loss function over a plurality of iterations, (para [0047] “Let θ denote the network parameters, C (θ) denote a loss function,  Δ (θ) denote the gradient of the loss with respect to the parameters, d denote a search direction, and B(θ) denote a matrix characterizing the curvature of the loss around θ (i.e., a Hessian approximation).”)
wherein each iteration comprises: calculating a gradient for the loss function at the current parameter values; (para [0047] “Let θ denote the network parameters, C (θ) denote a loss function, Δ (θ) denote the gradient of the loss with respect to the parameters, d denote a search direction, and B(θ) denote a matrix characterizing the curvature of the loss around θ (i.e., a Hessian approximation).”)
selecting a batch of samples included in the plurality of training samples, (para [0110] “In accordance with an embodiment of the present invention, computer processor based fast training of a DNN without significant accuracy degradation may comprise the steps of a) pre-training initial weights to make initial weights closer to optimal weights, b) selecting an initial batch of training data having an initial batch size for training, c) per forming training on the initial batch of training data in parallel fashion, d) increasing sample size for a Subsequent batch of training data,”)
…
and following the optimization method, storing the current parameter values in relationship to the deep neural network. (Para [0061] “Using these statistics, the vectors… are stored form iterations of CG and/or PCG, where m is specified by the user. Once m statistics are saved, an L-BFGS matrix H can be defined using the steps in Algorithm 2.”)
Horesh does not teach constructing a trust region subproblem that approximates the loss function using the gradient and a stochastic Hessian matrix of the loss function with respect to the batch of samples, -18-WO 2019/199307PCT/US2018/027215 
determining a descent direction by applying a SteihaugCG solver to the trust region subproblem given a trust region radius, 
and conditionally updating the current parameter values and the trust region radius based on a comparison of (i) a true reduction value provided by the loss function given the current parameter values, and (ii) a predicted reduction value provided by the descent direction.
Xu teaches constructing a trust region subproblem that approximates the loss function using the gradient and a stochastic Hessian matrix of the loss function with respect to the batch of samples, -18-WO 2019/199307PCT/US2018/027215(pg. 3 “In particular, in the context of image classification (Section 3.2.1) and deep auto-encoder (Section 3.2.2), we study the efficiency of sub-sampled TR method, which incorporates inexactness in both Hessian and the sub-problem solver, as compared with hand-tuned SGD with momentum.”  Section 3.2.2 “we consider the deep auto-encoder problem [13] and use the same model architectures as well as loss functions as in Martens [24]. The dataset and network architectures are given in Table 3. The experiments in Figures 2 and 3 are each done with initialization to a vector drawn from standard normal distribution as well as the all-zeros vector.”.  Examiner notes that FIG.2 shows constructing trust region)
determining a descent direction by applying a SteihaugCG solver to the trust region subproblem given a trust region radius, (pg. 12 Re Q.5 “In particular, CG-Steihaug used for the sub-problem (1a) of Algorithm 1 typically terminates in a handful of iterations whereas the generalized Lanczos method for solving the sub-problem (1b) of Algorithm 2 usually exhausts the allotted 250 iterations.”)
Horesh and Xu are analogous art because they are both directed to machine learning. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combine the method for accelerating hessian-free optimization for deep neural network of Shen with constructing trust region subproblem that approximates the loss function using the gradient and a stochastic Hessian matrix of the loss function of Xu in order to provide better performance with robust hyperparameter settings which “allows them to seamlessly escape flat regions and saddle point” as disclosed by Xu (-18-WO 2019/199307PCT/US2018/027215abstract “In doing so, we demonstrate that these methods not only can be computationally competitive with hand-tuned SGD with momentum, obtaining comparable or better generalization performance, but also they are highly robust to hyper-parameter settings. Further, in contrast to SGD with momentum, we show that the manner in which these Newton-type methods employ curvature information allows them to seamlessly escape flat regions and saddle point”).
Horesh in view of Xu does not teach and conditionally updating the current parameter values and the trust region radius based on a comparison of (i) a true reduction value provided by the loss function given the current parameter values, and (ii) a predicted reduction value provided by the descent direction.
Chen teaches and conditionally updating the current parameter values and the trust region radius based on a comparison of (i) a true reduction value provided by the loss function given the current parameter values, (algorithm 1 step 6 shows “Trust-region radius update):” and it is comparing whether pk ≥ n1 and the trust-region model uses an objective function(x) as evidence by abstract “Our framework utilizes random models of an objective function f (x), obtained from stochastic observations of the function or its gradient. Our method also utilizes estimates of function values to gauge progress that is being made. The convergence analysis relies on requirements that these models and these estimates are sufficiently accurate with high enough, but fixed, probability.”)
and (ii) a predicted reduction value provided by the descent direction. (pg. 476 “In this case, for each value of x, the classifier is obtained by solving an optimization problem, up to a certain accuracy, on a given training set. Then the classifier is evaluated on the testing set. If a randomized coordinate descent or a stochastic gradient descent is applied to train the classifier for a given vector x, then the resulting classifier is sufficiently close to the optimal classifier with some known probability.”)
 Horesh, Xu and Chen are analogous art because they are all directed to machine learning. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combine the method for accelerating hessian-free optimization for deep neural network of Shen in view of Xu with updating the current parameter values and the trust region radius of Chen in order to solve unconstrained stochastic optimization problems effectively as disclosed by Chen (abstract “Our framework utilizes random models of an objective function f (x), obtained from stochastic observations of the function or its gradient. Our method also utilizes estimates of function values to gauge progress that is being made. The convergence analysis relies on requirements that these models and these estimates are sufficiently accurate with high enough, but fixed, probability.”).

Regarding claim 9
Horesh in view of Xu with Chen teaches the method of claim 8.
Horesh further teaches wherein the batch of samples comprising a random sampling of the plurality of training samples. (Para [0004] “Because sequence training uses information from time-sequential lattices corresponding to utterances, sequence training is performed using utterance randomization rather than frame randomization. For mini-batch stochastic gradient descent (SGD), which is often used for CE training, frame randomization in Some cases, has been shown to perform better than utterance randomization.”)

Regarding claim 10
Horesh in view of Xu with Chen teaches the method of claim 9.
Horesh further teaches wherein the random sampling the plurality of training samples is resampled during each of the plurality of iterations. (Para [0042] “With respect to the other regime, sample approximation techniques compute the gradient on a large sample of data. While this computation can be expensive, the gradient estimates are more reliable than stochastic approximation methods, and the objective function progresses relatively well during later training iterations. Embodiments of the present invention use a hybrid method that captures the benefits of both stochastic and sample approximation methods, by increasing the amount of sampled data used for gradient and CG calculations.” Also see para [0050] “A possible issue with this HF technique is that CG algorithms used to obtain an approximate Solution to the Hessian require many iterations. FIG. 2 indicates that as HF training iterations increase, training time per iteration is dominated by CG iterations. FIG. 2 is a graph 200 plotting time (minutes) versus number of iterations.”)

Regarding claim 12
Horesh in view of Xu with Chen teaches the method of claim 8.
Chen further teaches wherein the trust region subproblem is a bounded quadratic minimization problem. (pg. 454 “We consider quadratic models for simplicity of the presentation and because they are the most common. The model mk (x) is minimized (approximately) in B(xk , δk ) to produce a step sk and (random) estimates of f (xk ) and f (xk+sk ) are obtained” and also see pg. 481 “We will illustrate this idea through a simple example. Consider the minimization of the simple quadratic function… The minimizer uniquely occurs at the vector of all 1s. Now consider the minimization of this function under our setting of computation failure where we vary the probability parameter”)
Horesh, Xu and Chen are analogous art because they are all directed to machine learning. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combine the method for accelerating hessian-free optimization for deep neural network of Shen in view of Xu with updating the current parameter values and the trust region radius of Chen in order to solve unconstrained stochastic optimization problems effectively as disclosed by Chen (abstract “Our framework utilizes random models of an objective function f (x), obtained from stochastic observations of the function or its gradient. Our method also utilizes estimates of function values to gauge progress that is being made. The convergence analysis relies on requirements that these models and these estimates are sufficiently accurate with high enough, but fixed, probability.”).

Regarding claim 16
Horesh in view of Xu with Chen teaches the method of claim 8.
Horesh further teaches wherein optimization method is performed using a parallel computing platform (para [0005] “HF optimization techniques for sequence training can be slow, requiring, for example, about 3 weeks for training a 300-hour Switchboard task using 64 parallel machines. There are at least two reasons why training is slow.”)
and computing operations associated with the optimization method are performed in parallel across a plurality of processors included in the parallel computing platform. (Para [0108] “More parallel machines (e.g., 64) were used for SWB compared to BN. As a result, it was not possible to exclusively reserve machines for timing calculations. Therefore, training time is estimated by calculating a total number of accessed data points for training, which is correlated to timing. Table 3 shows the total accessed data points for the baseline and speedup techniques.”)

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Horesh et al. (US 2015/0161987 A1) in view of Xu et al. (“Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study”, hereinafter: Xu) in view of Chen et al. (“Stochastic optimization using a trust-region method and random models”) and further in view of Sun et al. (“Complete Dictionary Recovery Using Nonconvex Optimization”).
Regarding claim 11
Horesh in view of Xu with Chen teaches the method of claim 8.
Horesh in view of Xu with Chen does not teach wherein the trust region radius corresponds as a spherical area in which the trust region subproblem lies.  
Sun teaches wherein the trust region radius corresponds as a spherical area in which the trust region subproblem lies. (Abstract “Our algorithm is based on nonconvex optimization with a spherical constraint, and hence is naturally phrased in the language of manifold optimization. Our proofs give a geometric characterization of the high-dimensional objective landscape, which shows that with high probability there are no spurious local minima.”)
Horesh, Xu, Chen and Sun are analogous art because they are all directed to machine learning. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combine the method for accelerating hessian-free optimization for deep neural network of Horesh in view of Xu with Chen with incorporating trust region radius as a spherical area of Sun to provide “nonconvex optimization with a spherical constraint” in order to solve the problem of recovering a complete square and invertible dictionary as disclosed by Sun (abstract “We give the first efficient algorithm that provably recovers A0 when X0 has O (n) nonzeros per column, under suitable probability model for X0. Prior results provide recovery guarantees when X0 has only O ( √ n) nonzeros per column. Our algorithm is based on nonconvex optimization with a spherical constraint, and hence is naturally phrased in the language of manifold optimization. Our proofs give a geometric characterization of the high-dimensional objective landscape, which shows that with high probability there are no spurious local minima.”).

Claims 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over Horesh et al. (US 2015/0161987 A1) in view of Xu et al. (“Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study”, hereinafter: Xu) in view of Chen et al. (“Stochastic optimization using a trust-region method and random models”) and further in view of Zeiler et al. (“ADADELTA: AN ADAPTIVE LEARNING RATE METHOD”).
Regarding claim 13
Horesh in view of Xu with Chen teaches the method of claim 8.
Xu further teaches wherein the current parameter values are updated by: selecting a learning rate for the descent direction; (pg. 4 “The main hyper-parameter is understood as the one for which, in practice, there is no “typical” value. For SGD with momentum, the learning rate is considered as the main parameter, since the momentum parameter is typically set to ≈ 0.9. For trust region, the main hyper-parameter is the initial trust region, as there are typical values for other parameters of the algorithm.”)
determining a first set of parameters based on the product of the descent direction and the learning rate; (pg. 12 “The benefits of non-uniform sampling over uniform alternative are far more pronounced in the performance of Algorithm 2 than Algorithm 1. This can be attributed mainly to their respective sub-problem solvers in terms of total number of performed Hessian-vector products. In particular, CG-Steihaug used for the sub-problem (1a) of Algorithm 1 typically terminates in a handful of iterations whereas the generalized Lanczos method for solving the sub-problem (1b) of Algorithm 2 usually exhausts the allotted 250 iterations.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combine the method for accelerating hessian-free optimization for deep neural network of Shen with constructing trust region subproblem that approximates the loss function using the gradient and a stochastic Hessian matrix of the loss function of Xu in order to provide better performance with robust hyperparameter settings which “allows them to seamlessly escape flat regions and saddle point” as disclosed by Xu (-18-WO 2019/199307PCT/US2018/027215abstract “In doing so, we demonstrate that these methods not only can be computationally competitive with hand-tuned SGD with momentum, obtaining comparable or better generalization performance, but also they are highly robust to hyper-parameter settings. Further, in contrast to SGD with momentum, we show that the manner in which these Newton-type methods employ curvature information allows them to seamlessly escape flat regions and saddle point”).
Horesh in view of Xu with Chen does not teach determining a momentum descent direction at the first set of parameters; selecting a momentum rate for the momentum descent direction; and updating the current parameter values based on the first set of parameters and the product of the momentum descent direction and the momentum rate.
Zeiler teaches determining a momentum descent direction at the first set of parameters; (pg. 2 section 2.2.1 “The main idea behind momentum is to accelerate progress along dimensions in which gradient consistently point in the same direction and to slow progress along dimensions where the sign of the gradient continues to change… The gradients along the valley, despite being much smaller than the gradients across the valley, are typically in the same direction and thus the momentum term accumulates to speed up progress. In SGD the progress along the valley would be slow since the gradient magnitude is small and the fixed global learning rate shared by all dimensions cannot speed up progress. Choosing a higher learning rate for SGD may help but the dimension across the valley would then also make larger parameter updates which could lead to oscillations back as forth across the valley. These oscillations are mitigated when using momentum because the sign of the gradient changes and thus the momentum term damps down these updates to slow progress across the valley”)
selecting a momentum rate for the momentum descent direction; (pg. 2 “In SGD the progress along the valley would be slow since the gradient magnitude is small and the fixed global learning rate shared by all dimensions cannot speed up progress. Choosing a higher learning rate for SGD may help but the dimension across the valley would then also make larger parameter updates which could lead to oscillations back as forth across the valley. These oscillations are mitigated when using momentum because the sign of the gradient changes and thus the momentum term damps down these updates to slow progress across the valley”)
and updating the current parameter values based on the first set of parameters and the product of the momentum descent direction and the momentum rate. (Pg. 2 section 2.2.1 “The main idea behind momentum is to accelerate progress along dimensions in which gradient consistently point in the same direction and to slow progress along dimensions where the sign of the gradient continues to change. This is done by keeping track of past parameter updates with an exponential decay… where ρ is a constant controlling the decay of the previous parameter updates. This gives a nice intuitive improvement over SGD when optimizing difficult cost surfaces such as a long narrow valley. The gradients along the valley, despite being much smaller than the gradients across the valley, are typically in the same direction and thus the momentum term accumulates to speed up progress.”) 
Horesh, Xu, Chen and Zeiler are analogous art because they are all directed to machine learning. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combine the method for accelerating hessian-free optimization for deep neural network of Horesh in view of Xu with Chen to incorporate the teaching of Zeiler to in order to provide a method or system “dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent” as disclosed by Zeiler (abstract “We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters”).

Regarding claim 14
Horesh in view of Xu with Chen and Zeiler teaches the method of claim 13.
Horesh further teaches wherein the learning rate is determined using a backtracking line search based on the loss function, (Para [0068] “Denoting the gradients of the full and subset losses as VJ(w) and VJs(w) respectively, the algorithm ensures that descent made in Js at every iteration must admit a descent direction for the true objective function J. The magnitude (2-norm) of the difference between the sample gradient and the actual gradient is expressed by Equation 4.”)
the current parameter values, and the descent direction. (Para [0048] “The loss, C (0), is computed over a held-out set. CG-Minimize(de(d), do) uses CG to minimize q(d), starting with search direction do. This function returns a series of steps {d, d, ...,dx} that are then used in a line search procedure. The parameter update, 0<-0+C.d. is based on an Armijo rule backtracking line search. Distributed computation to computer gradients and curvature matrix-vector products is done using a master/worker architecture.”)

Regarding claim 15
Horesh in view of Xu with Chen and Zeiler teaches the method of claim 13.
Horesh further teaches wherein the momentum rate is determined using a backtracking line search based on the loss function, the first set of parameters, (Para [0048] “The loss, C (0), is computed over a held-out set. CG-Minimize(de(d), do) uses CG to minimize q(d), starting with search direction do. This function returns a series of steps {d, d, ...,dx} that are then used in a line search procedure. The parameter update, 0<-0+C.d. is based on an Armijo rule backtracking line search. Distributed computation to computer gradients and curvature matrix-vector products is done using a master/worker architecture.”)
and the momentum descent direction. (Para [0068] “Denoting the gradients of the full and subset losses as VJ(w) and VJs(w) respectively, the algorithm ensures that descent made in Js at every iteration must admit a descent direction for the true objective function J. The magnitude (2-norm) of the difference between the sample gradient and the actual gradient is expressed by Equation 4.”)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VAN C MANG whose telephone number is (571)270-7598. The examiner can normally be reached Mon - Fri 8:00-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on 5712729767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/V.M./Examiner, Art Unit 2126                                                                                                                                                                                                        /ANN J LO/Supervisory Patent Examiner, Art Unit 2126