Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

2.	Claims  1, 5, 8, 9, 13, 16, 17, 21 and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Baydin ("Automatic Differentiation in Machine Learning: a Survey"), (hereinafter, "Baydin") in view of Leventhal ("Randomized Hessian Estimation and Directional Search"), (hereinafter, “Leventhal”).
3.	As per claim 1, Baydin teaches a device for neural network training, the device comprising:
an interface (Baydin, page 21, Section 5.2, “These implementations provide extensions to programming languages that automate the decomposition of equations into AD-enabled elementary operations. They are typically executed as preprocessors to transform the input in the extended language into the original language” discloses interaction with a programming language used for automatic differentiation, AD, operations through inputs. The programming language acts as an interface in which one more interact with to run AD operations) to receive a training set for a neural network (Baydin, Page 3, Figure 1, “Training inputs xi are fed forward, generating corresponding activations yi” discloses training input that are fed into the neural network to generate activations), the neural network comprising a set of nodes arranged in layers and a set of inter-node weights between nodes in the set of nodes (Baydin, Page 3, Fig. 1 illustrates a set of nodes arranged in layers. Between the Hidden Later and Output Layer are a set of internode weight depicted by w1, w2, and w3 respectively); a set of hardware processing nodes (Baydin, page 20, section 5, “naively allocating data structures holding dual numbers will involve memory access and allocation for every arithmetic operation, which are usually more expensive than arithmetic operations on modern computers” discloses computers, which are composed of processing circuitry to execute instruction, being used to hold dual numbers for arithmetic operations. Additionally, Baydin, page 5, Figure 2, “The range of approaches for differentiating mathematical expressions and computer code…automatic differentiation (lower left) is as accurate as symbolic differentiation with only a constant factor of overhead” discloses computer code for the automatic differentiation which must be implemented on processing hardware such as a computer) to train the neural network to create a trained neural network via iterations (Baydin, page 17, “Training of neural networks is an optimization problem with respect to a set of weights, which can in principle be addressed via any method including gradient descent, stochastic gradient descent” discloses using a gradient descent techniques for training a neural network. Baydin, page 14, “gradient descent comes with asymptotic rate of convergence, where the method increasingly ‘zigzags’ towards the minimum in a slowing down fashion. The convergence rate is usually improved by adaptive step size techniques that adjust the step size n on every iteration” discloses a gradient descent technique which is an iterative process that can be used to train a neural network) at each of the hardware processing nodes (Baydin, page 20, section 5, “naively allocating data structures holding dual numbers will involve memory access and allocation for every arithmetic operation, which are usually more expensive than arithmetic operations on modern computers” discloses computers, which are composed of processing circuitry to execute instruction, being used to hold dual numbers for arithmetic operations. Thus, computers can be used to hold dual numbers for arithmetic operation in the training process of neural networks), an iteration in the iterations performed (Baydin, page 14, “gradient descent comes with asymptotic rate of convergence, where the method increasingly ‘zigzags’ towards the minimum in a slowing down fashion. The convergence rate is usually improved by adaptive step size techniques that adjust the step size n on every iteration” discloses a gradient descent technique which is an iterative process that can be used to train a neural network) on each of the hardware processing nodes (Baydin, page 20, section 5, “naively allocating data structures holding dual numbers will involve memory access and allocation for every arithmetic operation, which are usually more expensive than arithmetic operations on modern computers” discloses computers, which are composed of processing circuitry to execute instruction, being used to hold dual numbers for arithmetic operations which can be involved in training process of network networks), 
the estimated gradient represented by a dual number (Baydin, page 10, “Forward mode AD is 
forward mode AD would require n evaluations to compute the gradient… forward mode AD (represented by the left and right hand sides in Table 2) can be viewed as using dual numbers” discloses using forward Automatic Differentiation, AD, to compute gradient in which dual numbers are used);

the updating of the inter-node weight (Baydin, page 3, “error propagates back, through updates where a ratio of the gradient…is subtracted from weight” discloses using updating the weights).
Baydin fails to teach the iteration including:
	generation of a random unit vector;
	creation of an update vector by calculating a magnitude for random unit vector based on a degree that the random unit vector coincides with an estimated gradient between a member of the training set and an objective function for the neural network
	and update of a parameter vector for an inter-node weight by subtracting the update vector from a previous parameter vector.
Leventhal, however, teaches generation of a random unit vector (Leventhal, page 14, “when using the fixed step size mentioned above, each iteration take exactly five function evaluations…where v_k and d_k are the search direction and the random unit vector, respectively” discloses the random unit vector d_k which is used in an iterative search technique); creation of an update vector (Leventhal, page 14, “Since B_k is our Hessian approximation at the current iterate x_k, a reasonable initial step size is given by x_k+1 = x_k – t_k * v_k…The advantage to this approach is that each iteration requires only directional derivatives and, being highly iterative, this interpolates nicely with the Hessian update derived in Second 2” discloses update of the Hessian through iteratively adjusting the step size of x_k+1) by calculating a magnitude for the random unit vector based on a degree that the random unit vector coincides with an estimated gradient between a member of the training set and an objective function for the neural network (Leventhal, page 14, “a reasonable initial step size is given by x_k+1 = x_k – t_k * v_k, where t_k…corresponding to an exact line search in the direction v_k of the quadratic model. The advantage to this approach is that each iteration requires only directional derivatives” discloses t_k corresponding to a line search along the direction of v_k through directional derivatives. Directional derivatives involves taking dot products between a gradient of a function and vector and thus, with result in an angle that between the gradient and the vector. The angle corresponds to how much the vector and gradient correlate with each other. Additionally, B_k, which involved in the determining t_k, is determined using the random unit vector, d_k, as disclosed on Baydin, page 14, Algorithm 3.1, “3. Compute B_k+1 according to Equation 2.2, letting d_k be uniformly distributed on the unit sphere and computing…”), and update of a parameter vector for an inter-node weight by subtracting the update vector from a previous parameter vector (Leventhal, page 14, “Since B_k is our Hessian approximation at the current iterate x_k, a reasonable initial step size is given by x_k+1 = x_k – t_k * v_k…The advantage to this approach is that each iteration requires only directional derivatives and, being highly iterative, this interpolates nicely with the Hessian update derived in Second 2” discloses calculating the next vector through subtraction using the previous vector)

4.	As per claim 5,the combination of Baydin and Leventhal teaches the device of claim 1, wherein the iterations include an analytical gradient iteration, the analytical gradient iteration occurring every T iterations of the iterations (Baydin, page 7, section 2.2, “This interleaving idea forms the basis of AD and provides and account of its simplest form: apply symbolic differentiation at the elementary operation level and keep intermediate numerical results, in lockstep with the evaluation of the main function” discloses automatic differentiation, denoted as AD, as having elements of analytical differentiation, denoted as the symbolic differentiation, in elementary operations. Baydin, page 8, section 3, “AD can differentiate not only mathematical expressions in the classical sense, but also algorithms making use of control flow such as…loops” discloses automatic differentiation, denoted as AD, being used in loops. It is then possible to use automatic differentiation for elementary symbolic operations in iterations of a loop).

As per claim 8, the combination of Baydin and Leventhal teaches the device of claim 1, wherein the training set is a subset of a complete training set (Baydin, Page 3, Figure 1, “(a) Training inputs xi are fed forward, generating corresponding activations yi. Error between the actual output y3 and the target output t is computed. (b) The error is propagated backward, giving the gradient…which is subsequently used in a gradient descent procedure” discloses training inputs that are used to generate outputs. These outputs are used to calculate an error in which the error is used to further train the neural network through the gradient descent procedure. It should be noted that a complete set is a subset of itself and thus, the training input used is subset of itself) .

6.	Claim 9 is a method claim in correspondence with claim 1 with the additional limitation classifying data using the trained neural network (Baydin, page 18, section 4.3, “Pock et al. (2007) introduce AD to computer vision…and noting the usefulness of AD in identifying sparsity patterns” discloses AD being useful in identifying patterns in computer vision. Thus, the AD can be used to classify objects by identifying patterns unique to an object). Therefore, claim 9 is rejected for the same reasons as claim 1.

7.	Claim 13 is a method claim in correspondence with claim 5. Therefore, claim 13 is rejected for the same reasons as claim 5. 
8.	Claim 16 is a method claim in correspondence with claim 8. Therefore, claim 16 is rejected for the same reasons as claim 8.

Claim 17 is a non-transitory machine readable medium claim in correspondence with claim 1 with the additional limitations of machine readable medium claim including instructions, that, when executed by processing circuitry, causes the processing circuit to perform operations (Baydin, page 20, section 5, “naively allocating data structures holding dual numbers will involve memory access and allocation for every arithmetic operation, which are usually more expensive than arithmetic operations on modern computers” discloses computers, which are composed of processing circuitry to execute instruction, being used to hold dual numbers for arithmetic operations) and classify data using the trained neural network (Baydin, page 18, section 4.3, “Pock et al. (2007) introduce AD to computer vision…and noting the usefulness of AD in identifying sparsity patterns” discloses AD being useful in identifying patterns in computer vision. Thus, the AD can be used to classify objects by identifying patterns unique to an object). Therefore, claim 17 is rejected for the same reasons as claim 1. 

10.	Claim 21 is a medium claim in correspondence with claim 5. Therefore, claim 21 is rejected for the same reasons as claim 5.
11.	Claim 24 is a medium claim in correspondence with claim 8. Therefore, claim 24 is rejected for the same reasons as claim 8.

12.	Claim 6, 14, and 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Baydin in view of Leventhal shown above, further in view of U.S. Pub. No. US 2019/0171949 A1 to Chiang (hereinafter, Chiang).21
As per claim 6, the combination of Baydin and Leventhal as shown above teaches the device of claim 5. 
The combination of Baydin and Leventhal fails to explicitly teach the device wherein Adaptive Moment Estimation (Adam) is modified by the update vector to create the trained neural network. 
However, Chiang teaches a device wherein Adaptive Moment Estimation (Adam) is modified by the update vector to create the trained neural network (Chiang, paragraph [0031], “in step S160, the model modifying unit 160 updates the second model…via a deep learning algorithm. For example, the deep learning algorithm may be…an Adaptive Moment Estimation (ADAM) algorithm”, discloses a modifying unit that updates a model through a deep learning algorithm such an Adaptive Moment Estimation where the model receives data as disclosed on Chiang, paragraph [0023], “The inputting data X0 may be transmitted to…the second model”, and is used for neural network training as disclosed on Chiang, paragraph [0041], “the model training systems 1000, 2000, 3000 adopt the deep learning technology to the portable electronic device for personalized application through…the second model”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Baydin and Leventhal’s technique of updating vectors by incorporating Chiang’s model which is updated through Adaptive Moment Estimation (ADAM) to result vectors that are updated through ADAM  . One of ordinary skill in the art would have been motivated in doing so in order to have a robust method of updating the parameter vector which would reduce numerous computations and as result, reduce computer memory and power usage. 
Claim 14 is a method claim in correspondence with claim 6. Therefore, claim 14 is rejected for the same reasons as claim 6. 
15.	Claim 22 is a medium claim in correspondence with claim 6. Therefore, claim 22 is rejected for the same reasons as claim 6.

Response to Amendment
16.	Applicant’s arguments filed on March 31, 2021 have been fully considered and are considered persuasive.
17.	The previous rejection is moot and claims are rejected on the grounds of new references.
Allowable Subject Matter
18. 	Claim 2-4,7,10-12,15,18-20 and 23 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Atta A. Boateng Sr whose telephone number is 571-272-8267. The examiner can normally be reached on Monday-Thursday from 8:00 AM to 4:00 PM.
Li B Zhen, can be reached at telephone number 571-272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://portal.uspto.gov/external/portal. Should you have questions about access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
/A.A.B./Examiner, Art Unit 2121    

                                                                                                                                                                                              
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121