DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Amendments
Claims 1-2, 4-5, 7, 10, 12-15, and 17-18 are amended. Claims 19-20 are new. Claim 11 is canceled. Claims 1-10 and 12-20 are pending and have been examined. Claim 4 has the status identifier “Previously Presented” but it contains amendments. The claim will be interpreted as if the status identifier was “Currently Amended”. See MPEP 714, subsection II, part C.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 09/01/2021 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Objections
Claims 1, 18 and 19 are objected to because the last word ends in a semicolon.
Claims 4 and 15 are objected to because the mathematical symbols and the mathematical expressions are illegible. For purposes of examination, Examiner interprets the mathematical expressions in claims 4 and 15 to correspond to those recited in instant specification paragraphs [0008] and [0019], respectively.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that use the word “means” or “step” but are nonetheless not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph because the claim limitation(s) recite(s) sufficient structure, materials, or acts to entirely perform the recited function.  Such claim limitation(s) is/are: “time step” in Claims 1-2, 4, 7, 10, and 13-15.
Because this/these claim limitation(s) is/are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are not being interpreted to cover only the corresponding structure, material, or acts described in the specification as performing the claimed function, and equivalents thereof.
If applicant intends to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to remove the structure, materials, or acts that performs the claimed function; or (2) present a sufficient showing that the claim limitation(s) does/do not recite sufficient structure, materials, or acts to perform the claimed function.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 9, 10 and 15 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 9 recites the limitation “parameters” in line 2. It is unclear to Examiner whether this limitation refer to the model parameters or the RNN parameters as recited by claim 1. For purposes of examination, Examiner interprets this limitation as referring to the model parameters.
Claim 10 recites the limitation “across different coordinates” in the last 2 lines. It is unclear to Examiner is unclear what this limitation means. For purposes of examination, Examiner interprets this limitation to mean across different coordinates of the objective function.
Claim 15 is indefinite because the function             
                
                    
                        E
                    
                    
                        f
                    
                
            
         in line 2 is undefined in the disclosure. For purposes of examination, Examiner is interpreting the function             
                
                    
                        E
                    
                    
                        f
                    
                
            
         to mean an expected value.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-4, 6-8, 13-14, 17-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Hochreiter et al. (“Learning to Learn Using Gradient Descent”, see IDS filed 04/28/2021, NPL doc. .

Regarding CLAIM 1, Hochreiter I teaches: A method implemented by one or more computers, comprising: obtaining a machine learning model (recurrent neural net in Fig. 1 on p. 88), wherein (i) the machine learning model has one or more model parameters, and (ii) the machine learning model is trained using gradient descent techniques to optimize an objective function; (RNN biases and weights are taught by p. 91, § 3, lines 3-4; and training is taught by p. 88, eighth to sixth lines from the end)
for each time step in a plurality of time steps: (p. 89, § 2.1, line 4 teaches “time step j”; and Fig. 1 on p. 88 teaches time steps j and j-1)
determining an update rule for the model parameters for the time step using a supervisory system having a plurality of supervisory system parameters, wherein the supervisory system is different from the machine learning model and the supervisory system parameters are different from the model parameters, and wherein the determining comprises: p. 87, § 1, ¶ 2 continues to page 88, In Fig. 1 on p. 88, a supervisory system is a fixed learning algorithm such as BPTT or RTRL. Parameters include target y(j).)
processing, using the supervisory system and in accordance with values of the supervisory system parameters for the time step, an input for the time step that comprises a gradient of the objective function with respect to the model parameters for the time step to generate a respective supervisory system output for the time step that specifies the update rule for the model parameters for the time step; (In Fig. 1 on p. 88, the RNN output and the target y(j) comprise a gradient as claimed, which the supervisory system receives and uses to review and modify the training algorithm (p. 87, § 1, ¶ 2, lines 2-5).)
applying the update rule for the time step generated by the supervisory system to values of the model parameters for the time step to update the values of the model parameters; and (p. 87, § 1, ¶ 2, lines 2-5)
training the supervisory system on a supervisory system objective function that depends on respective values of the model parameters at the time step and at each of one or more preceding time steps in the plurality of time steps, comprising determining an update to the supervisory system at the time step that minimizes the supervisory system objective function for the time step using gradient descent techniques; (“Objective function” is taught on p. 88, lines 1-3 and P. 91, § 3, line 1. Time steps are taught at p. 89, § 2.1, line 4. The system is updated according to p. 89, bottom paragraph, lines 1-4. The fixed learning algorithms BPTT and RTRL use gradient descent.)	Although Hochreiter I teaches a meta-learning system comprising a supervisory system having its own parameters different than the machine learning model.  Hochreiter I do not specifically disclose the supervisory system is a RNN.
	But Hochreiter II teaches: using a recurrent neural network (RNN) and RNN parameters (Hochreiter II, p. 1745, Fig. 2 teaches an RNN being a long short-term memory (LSTM) neural network, which is further described in lines 1-2 of the caption. A plurality of RNN parameters are weights, denoted “w” in the memory cell in Fig. 1 on p. 1744.)
processing, using the RNN and in accordance with values of the RNN parameters for the time step, an input to generate a respective RNN output (Each LSTM cell outputs                         
                            
                                
                                    y
                                
                                
                                    
                                        
                                            o
                                            u
                                            t
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            
                                
                                    t
                                
                            
                        
                     as taught on p. 1743 § 4.3, by the first equation, and shown by the output layer in Fig. 2 on p. 1745.)
training the RNN on a RNN objective function, comprising determining an update to the values of the RNN parameters at the time step that minimizes the RNN objective function (Hochreiter II, p. 
	Hochreiter II is in the same field of endeavor as the claimed invention, namely, machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Hochreiter I’s supervisory system by replacing the fixed learning algorithms of back propagation through time (BPTT) and real-time recurrent learning (RTRL) with Hochreiter II’s LSTM. A motivation for the combination is that LSTM leads to many more successful runs and learns much faster than RTRL and BPTT, and LSTM solves complex, artificial long-time-lag tasks which the other algorithms cannot solve. (Hochreiter II, p. 1735, Abstract, last 6 lines)

	Regarding CLAIM 2, the combination of Hochreiter I and II teaches: The method of claim 1, 
Hochreiter I teaches: wherein applying the update rule for a final time step in the plurality of time steps to the model parameters generates trained values of the model parameters. (The BRI of “final time step” includes the final time step for a given sequence in the set of sequences                         
                            
                                
                                    
                                        
                                            s
                                        
                                        
                                            k
                                        
                                    
                                
                            
                        
                     as taught by p. 89, § 2.1, lines 2-3. The limitation is taught by p. 89, last paragraph, last sentence.)

Regarding CLAIM 3, the combination of Hochreiter I and II teaches: The method of claim 1, 
Hochreiter I teaches: wherein the machine learning model comprises a neural network. (P. 88, eighth to sixth lines from the end and Fig. 1 caption, line 2 teaches a recurrent neural network)

Regarding CLAIM 4, the combination of Hochreiter I and II teaches: The method of claim 1, 
Hochreiter I teaches: wherein                         
                            
                                
                                    θ
                                
                                
                                    t
                                
                            
                        
                     represents values of the model parameters at time                         
                            t
                        
                    , (P. 91, § 3, lines 3-4 teach biases and weights.                         
                            
                                
                                    θ
                                
                                
                                    t
                                
                            
                        
                     corresponds to the weights and biases at time step j-1 in Fig. 1 and its caption on p. 88.)
                        
                            ∇
                            f
                            
                                
                                    
                                        
                                            θ
                                        
                                        
                                            t
                                        
                                    
                                
                            
                        
                     represents the gradient of objective function                         
                            f
                        
                    , (The BRI of a “gradient of the objective function” includes the supervisory system in Fig. 1 on p. 88 receiving both the RNN output and the target y(j). Also taught by p. 88, eighth to sixth lines from the end.)
Further, Hochreiter I teaches                         
                            
                                
                                    θ
                                
                                
                                    t
                                    +
                                    1
                                
                            
                        
                     (This corresponds to the weights and biases at time step j in Fig. 1 and its caption on p. 88.) 
However, Hochreiter I does not explicitly teach: wherein the determined update rule for the model parameters that minimizes the objective function is given by
                
                    
                        
                            θ
                        
                        
                            t
                            +
                            1
                        
                    
                    =
                    
                        
                            θ
                        
                        
                            t
                        
                    
                    +
                    
                        
                            g
                        
                        
                            t
                        
                    
                    
                        
                            ∇
                            f
                            
                                
                                    
                                        
                                            θ
                                        
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                            ϕ
                        
                    
                
            
wherein                         
                            ϕ
                        
                     represents RNN parameters and                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                     represents the RNN output for a time step                         
                            t
                        
                    .
But Hochreiter II teaches: wherein the determined update rule for the model parameters that minimizes the objective function is given by
                
                    
                        
                            θ
                        
                        
                            t
                            +
                            1
                        
                    
                    =
                    
                        
                            θ
                        
                        
                            t
                        
                    
                    +
                    
                        
                            g
                        
                        
                            t
                        
                    
                    
                        
                            ∇
                            f
                            
                                
                                    
                                        
                                            θ
                                        
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                            ϕ
                        
                    
                
            
wherein                         
                            ϕ
                        
                     represents RNN parameters and                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                     represents the RNN output for a time step                         
                            t
                        
                    . (“RNN parameters” are LSTM weights, denoted “w” in the memory cell in Fig. 1 on p. 1744, and the LSTM output is equation                         
                            
                                
                                    y
                                
                                
                                    
                                        
                                            c
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            (
                            t
                            )
                        
                    . The broadest reasonable interpretation of the claimed formula is that Hochreiter II’s LSTM updates model parameters at each time step by applying an LSTM function to LSTM inputs (denoted                         
                            n
                            e
                            
                                
                                    t
                                
                                
                                    i
                                    
                                        
                                            n
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            (
                            t
                            )
                        
                     on p. 1743, § 4.1, first line of equations and p. 1744, Fig. 1) and LSTM weights.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have updated the weights and biases of Hochreiter I’s RNN at each time step by inputting Hochreiter I’s gradient (i.e., RNN output and the target y(j)) and Hochreiter II’s LSTM weights into an LSTM function. A motivation for the combination is that LSTM leads to many more successful runs and learns much faster than RTRL and BPTT, and LSTM solves complex, artificial long-time-lag tasks which the other algorithms cannot solve. (Hochreiter II, p. 1735, Abstract, last 6 lines)

	Regarding CLAIM 6, the combination of Hochreiter I and II teaches: The method of claim 1, 
	Hochreiter I teaches: each model parameter (P. 91, § 3, lines 3-4 teaches biases and weights)
	However, Hochreiter I does not explicitly teach: wherein the RNN implements separate activations for each input.
	But Hochreiter II teaches: wherein the RNN implements separate activations for each input. (The BRI of this limitation includes an LSTM having multiple gates. Hochreiter II teaches at least input gate                         
                            i
                            
                                
                                    n
                                
                                
                                    j
                                
                            
                        
                     and output gate                         
                            o
                            u
                            
                                
                                    t
                                
                                
                                    j
                                
                            
                        
                     in the Fig. 1 caption on p. 1744.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have used at least input and output gates in the LSTM. A motivation for the combination is that LSTM leads to many more successful runs and learns much faster than RTRL and BPTT, and LSTM solves complex, artificial long-time-lag tasks which the other algorithms cannot solve. (Hochreiter II, p. 1735, Abstract, last 6 lines)

	Regarding CLAIM 7, the combination of Hochreiter I and II teaches: The method of claim 1, 
	However, Hochreiter I does not explicitly teach: wherein the RNN is a long short-term memory (LSTM) neural network.
	But Hochreiter II teaches: wherein the RNN is a long short-term memory (LSTM) neural network. (Hochreiter II, p. 1745, Fig. 2 and the caption, lines 1-2.)

	Regarding CLAIM 8, the combination of Hochreiter I and II teaches: The method of claim 7, 
	However, Hochreiter I does not explicitly teach: wherein the LSTM neural network comprises two LSTM layers.
 wherein the LSTM neural network comprises two LSTM layers. (Two layers are taught by the two pairs of cell blocks in the hidden layer of Fig. 2 on p. 1745, and by the caption – “two memory cell blocks of size 2”)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have included two pairs of cell blocks to form two LSTM layers because “It is up to the user to define the network topology”. (Hochreiter II, p. 1744, end of first full paragraph)

	Regarding CLAIM 13, the combination of Hochreiter I and II teaches: The method of claim 1, 
	However, Hochreiter I does not explicitly teach: further comprising providing a previous hidden state of the RNN as input to the RNN at each time step.
	But Hochreiter II teaches: further comprising providing a previous hidden state of the RNN as input to the RNN at each time step. (P. 1744, Fig. 1 caption teaches: “The self-recurrent connection (with weight 1.0) indicates feedback with a delay of one time step.” Also, the last equation on p. 1744 teaches the internal state at time t is a function of the internal state at time t-1 for t > 0. )
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to used an RNN with a self-recurrent connection because it builds the basis of constant error carrousel (CEC) which is the LSTM’s central feature. (P. 1744, Fig. 1 caption, line 3 and P. 1742, § 3.2.2)

	Regarding CLAIM 14, the combination of Hochreiter I and II teaches: The method of claim 1, 
Hochreiter I teaches: wherein, at each time step, the update rule for the time step (Fig. 1 on p. 88 teaches a supervisory system (attendant learning algorithm) for determining update rule for the RNN at each time step j.)
depends on a hidden state of the RNN for the time step.
But Hochreiter II teaches: the update rule depends on a hidden state of the RNN for the time step. (P. 1744, equation                         
                            
                                
                                    y
                                
                                
                                    
                                        
                                            c
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            (
                            t
                            )
                        
                     which depends on the internal state. The update rule is an output of the LSTM. Note: The LSTM taught by Hochreiter II is the supervisory system while the RNN in Hochreiter I is the subordinate system.)

CLAIM 17 recites: A system comprising one or more computers and one or more storage devices storing instructions that are operable to cause the computer to perform operations from the method of claim 1. Hochreiter I teaches experiments performed on a computer in § 3, pages 91-93. Claim 17 is rejected under 35 U.S.C. 103 for the reasons set forth in the rejection of claim 1.

	CLAIM 18 recites: One or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations from the method of claim 1. Hochreiter I teaches experiments performed on a computer in § 3, pages 91-93. Claim 18 is rejected under 35 U.S.C. 103 for the reasons set forth in the rejection of claim 1.

	CLAIM 20 is rejected under 35 U.S.C. 103 for the reasons set forth in the rejection of claim 6.

Claims 5, 9, 12, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Hochreiter et al. (“Learning to Learn Using Gradient Descent”, see IDS filed 04/28/2021, NPL doc. 2), hereinafter Hochreiter I, in view of Hochreiter et al. (“Long Short-Term Memory”, see PTO-892 filed with this office action), hereinafter Hochreiter II, and Wright (“Coordinate Descent Algorithms”). 

Regarding CLAIM 5, the combination of Hochreiter I and II teaches: The method of claim 1, 
	Hochreiter I teaches: wherein the supervisory system operates… on the model parameters. (P. 89, lines 1-4 of last paragraph, and where “model parameters” are subordinate RNN weights and biases as taught by p. 91, § 3, lines 3-4.) 
However, Hochreiter I does not explicitly teach: the RNN 
	But Hochreiter II teaches: the RNN (Hochreiter II, p. 1745, Fig. 2 teaches an RNN being a LSTM neural network, which is further described in lines 1-2 of the caption.)
	However, Hochreiter I and II do not explicitly teach: coordinate-wise
	But Wright teaches: coordinate-wise (Pages 3-4, § 1, first paragraph)
	Wright is in the same field of endeavor as the claimed invention, namely machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have performed iterative methods in which each iterate is obtained by fixing most components of the variable model parameters at their values from the current iteration, and approximately minimizing the objective with respect to the remaining components. A motivation for the combination is to break the full optimization problem into subproblems of a lower-dimensional (even scalar) minimization problem, and thus can typically solve it more easily than the full problem (Wright, p. 4, lines 3-4). Another motivation for the combination is that coordinate descent algorithms have the advantage over general stochastic gradient methods in that descent in f can be guaranteed at every iteration. (Wright, p. 9, first sentence under the first equation) 

Regarding CLAIM 9, the combination of Hochreiter I and II teaches: The method of claim 7, 		Hochreiter I teaches: parameters (P. 91, § 3, lines 3-4 teaches biases and weights)
wherein the LSTM neural network shares variables across different coordinates of the objective function.
	But Hochreiter II teaches: LSTM neural network (Hochreiter II, p. 1745, Fig. 2 and lines 1-2 of the caption.)
	However, Hochreiter I and II do not explicitly teach: wherein the model shares inputs across different coordinates of the objective function.
	But Wright teaches: wherein the model shares variables across different coordinates of the objective function. (The BRI of this limitation includes performing a coordinate descent algorithm, which is taught by Wright, pp. 3-4, § 1, first paragraph.)
Wright is in the same field of endeavor as the claimed invention, namely machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have performed iterative methods in which each iterate is obtained by fixing most components of the variable model parameters at their values from the current iteration, and approximately minimizing the objective with respect to the remaining components. A motivation for the combination is to break the full optimization problem into subproblems of a lower-dimensional (even scalar) minimization problem, and thus can typically solve it more easily than the full problem (Wright, p. 4, lines 3-4). Another motivation for the combination is that coordinate descent algorithms have the advantage over general stochastic gradient methods in that descent in f can be guaranteed at every iteration. (Wright, p. 9, first sentence under the first equation)

	Regarding CLAIM 12, the combination of Hochreiter I and II teaches: The method of claim 1,
	 Hochreiter I teaches: the model parameters (P. 91, § 3, lines 3-4 teaches biases and weights)
	However, Hochreiter I does not explicitly teach: wherein the RNN is invariant to an order of the variables.
the RNN (Hochreiter II, p. 1745, Fig. 2 and lines 1-2 of the caption.)
	However, Hochreiter I and II does not explicitly teach: wherein the model is invariant to an order of the variables.
	But Wright teaches: wherein the model is invariant to an order of the variables. (P. 15, § 3.3, top of the page to “We prove a convergence result for the randomized algorithm”.)
Wright is in the same field of endeavor as the claimed invention, namely machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied Hochreiter II’s LSTM to perform randomized coordinate descent on the Hochreiter I’s weights and biases. . A motivation for the combination is to break the full optimization problem into subproblems of a lower-dimensional (even scalar) minimization problem, and thus can typically solve it more easily than the full problem (Wright, p. 4, lines 3-4). Another motivation for the combination is that coordinate descent algorithms have the advantage over general stochastic gradient methods in that descent in f can be guaranteed at every iteration. (Wright, p. 9, first sentence under the first equation)

	CLAIM 19 is rejected for the reasons set forth in claim 5.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Hochreiter et al. (“Learning to Learn Using Gradient Descent”, see IDS filed 04/28/2021, NPL doc. 2), hereinafter Hochreiter I, in view of Hochreiter et al. (“Long Short-Term Memory”, see PTO-892 filed with this office action), hereinafter Hochreiter II, Baldazzi et al. (“Strongly-Typed Recurrent Neural Networks”, see PTO-892 filed 08/11/2021), and Wright (“Coordinate Descent Algorithms”).

Regarding CLAIM 10, the combination of Hochreiter I and II teaches: The method of claim 7, 
wherein a subset of cells in each of one or more LSTM layers of the LSTM neural network comprise global average units, wherein a global average unit is a unit whose update includes averaging activations of the global average units globally at each time step across different coordinates.
	But Baldazzi teaches: wherein a subset of cells in each of one or more LSTM layers of the LSTM neural network comprise global average units, wherein a global average unit is a unit whose update includes averaging activations of the global average units globally at each time step (P. 5, col. 2, § “Dynamic Temporal Convolutions”, lines 1-3, and also the end of col. 2 which states “T-LSTMs store expectations in private memory cells that are reweighted by the output gate when publicly broadcast.”)
	Baldazzi is in the same field of endeavor as the claimed invention, namely machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Baldazzi’s system into the combination of HochreiterI and II's system by performing average-pooled convolution, with a motivation to stop gradients from exploding. (P. 6 col. 1: “It follows immediately that gradients will not explode for T-RNNs or LSTMs.”)
	However, neither Hochreiter I, Hochreiter II, nor Baldazzi explicitly teaches: across different coordinates.
	But Wright teaches: across different coordinates. (The BRI of this limitation includes performing a coordinate descent algorithm, which is taught by Wright, pp. 3-4, § 1, first paragraph.)
Wright is in the same field of endeavor as the claimed invention, namely machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have performed iterative methods in which each iterate is obtained by fixing most components of the variable model parameters at their values from the current iteration, and f can be guaranteed at every iteration. (Wright, p. 9, first sentence under the first equation)

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Hochreiter et al. (“Learning to Learn Using Gradient Descent”, see IDS filed 04/28/2021, NPL doc. 2), hereinafter Hochreiter I, in view of Hochreiter et al. (“Long Short-Term Memory”, see PTO-892 filed with this office action), hereinafter Hochreiter II, and Miranda et al. (“Multi-Objective Optimization for Self-Adjusting Weighted Gradient in Machine Learning Tasks”).

Regarding CLAIM 15, the combination of Hochreiter I and II teaches: The method of claim 1, 
	Hochreiter I teaches: where                         
                            
                                
                                    θ
                                
                                
                                    t
                                    +
                                    1
                                
                            
                            =
                            
                                
                                    θ
                                
                                
                                    t
                                
                            
                            +
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                    , …                         
                            f
                            (
                            
                                
                                    θ
                                
                                
                                    t
                                
                            
                        
                    ) represents the objective function that depends on the model parameters                         
                            θ
                        
                     at time                         
                            t
                        
                    , (P. 91, § 3, lines 3-4 teaches biases and weights. This corresponds to                         
                            
                                
                                    θ
                                
                                
                                    t
                                
                            
                        
                    , the weights and biases at time step j-1 in Fig. 1 and its caption on p. 88. Updating is taught broadly by lines 1-5 in the last paragraph on p. 89, and more specifically in Table 1, col. “update”, rows “Rec.” The objective function is a subordinate routine or learning algorithm, taught on p. 88, eighth to sixth lines from the end)
                        
                            
                                
                                    w
                                
                                
                                    t
                                
                            
                            ∈
                            
                                
                                    R
                                
                                
                                    ≥
                                    0
                                
                            
                        
                     represents weights associated with each time step                         
                            t
                        
                    ,  (P. 91, § 3, lines 3-4 teaches weights)
                        
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                     represents a supervisory system output for time                         
                            t
                        
                     (P. 87, ¶ 2, lines 2-7)
and                         
                            
                                
                                    ∇
                                
                                
                                    t
                                
                            
                            =
                            
                                
                                    ∇
                                
                                
                                    θ
                                
                            
                            f
                            
                                
                                    
                                        
                                            θ
                                        
                                        
                                            t
                                        
                                    
                                
                            
                        
                    . (The BRI of a gradient of the objective function includes the supervisory system in Fig. 1 on p. 88 receiving both the RNN output and the target y(j). Also taught by p. 88, eighth to sixth lines from the end.)
	However, Hochreiter I does not explicitly teach: wherein the RNN objective function is given by
                
                    L
                    
                        
                            ϕ
                        
                    
                    =
                    
                        
                            E
                        
                        
                            f
                        
                    
                    
                        
                            
                                
                                    ∑
                                    
                                        t
                                        =
                                        1
                                    
                                    
                                        T
                                    
                                
                                
                                    
                                        
                                            w
                                        
                                        
                                            t
                                        
                                    
                                    f
                                    
                                        
                                            
                                                
                                                    θ
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                
            
where                         
                            
                                
                                    
                                        
                                            
                                                
                                                    
                                                        g
                                                    
                                                    
                                                        t
                                                    
                                                
                                            
                                        
                                        
                                            
                                                
                                                    
                                                        h
                                                    
                                                    
                                                        t
                                                        +
                                                        1
                                                    
                                                
                                            
                                        
                                    
                                
                            
                            =
                            m
                            (
                            
                                
                                    ∇
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    h
                                
                                
                                    t
                                
                            
                            ,
                            ϕ
                            )
                        
                    ,                         
                            ϕ
                        
                     represents the RNN parameters,                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                     represents a RNN output for time                         
                            t
                        
                    ,                         
                            
                                
                                    h
                                
                                
                                    t
                                
                            
                        
                     represents a hidden state of the RNN at time                         
                            t
                        
                    ,                         
                            m
                        
                     represents the RNN 
	But Hochreiter II teaches: where                         
                            
                                
                                    
                                        
                                            
                                                
                                                    
                                                        g
                                                    
                                                    
                                                        t
                                                    
                                                
                                            
                                        
                                        
                                            
                                                
                                                    
                                                        h
                                                    
                                                    
                                                        t
                                                        +
                                                        1
                                                    
                                                
                                            
                                        
                                    
                                
                            
                            =
                            m
                            (
                            
                                
                                    ∇
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    h
                                
                                
                                    t
                                
                            
                            ,
                            ϕ
                            )
                        
                    ,                         
                            ϕ
                        
                     represents the RNN parameters,                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                     represents a RNN output for time                         
                            t
                        
                    ,                         
                            
                                
                                    h
                                
                                
                                    t
                                
                            
                        
                     represents a hidden state of the RNN at time                         
                            t
                        
                    ,                         
                            m
                        
                     represents the RNN (The BRI of this limitation includes a LSTM having weights, a hidden state, and whose output and next hidden state are a function of the input (interpreted here as                         
                            
                                
                                    ∇
                                
                                
                                    t
                                
                            
                        
                    ), the hidden state, and the weights of the LSTM. The cell in Fig. 1 and the network in Fig. 2 on pp. 1744-5 teach these limitations.)
	However, neither Hochreiter I nor Hochreiter II explicitly teaches: wherein the RNN objective function is given by
                
                    L
                    
                        
                            ϕ
                        
                    
                    =
                    
                        
                            E
                        
                        
                            f
                        
                    
                    
                        
                            
                                
                                    ∑
                                    
                                        t
                                        =
                                        1
                                    
                                    
                                        T
                                    
                                
                                
                                    
                                        
                                            w
                                        
                                        
                                            t
                                        
                                    
                                    f
                                    
                                        
                                            
                                                
                                                    θ
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                
            
But Miranda teaches this limitation by equation 1 on p. 2 and the sentence surrounding it. The BRI of                         
                            
                                
                                    E
                                
                                
                                    f
                                
                            
                        
                     is interpreted as the expected value over functions (See 112(b) rejection). Since the claim allows for a single objective function, the limitation has been met.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Miranda's system into the combination of Hochreiter I and II’s system by solving for the optimization problem using Miranda’s equation 1, with a motivation to efficiently solve the optimization problem when the Pareto frontier is convex. (Miranda, p. 3, lines 1-2.)

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Hochreiter et al. (“Learning to Learn Using Gradient Descent”, see IDS filed 04/28/2021, NPL doc. 2), hereinafter Hochreiter I, in . 

Regarding CLAIM 16, the combination of Hochreiter I and II teaches: The method of claim 1, 		Hochreiter I teaches: gradients (The BRI of “gradients” includes the supervisory system in Fig. 1 on p. 88 receiving both the RNN output and the target y(j))
	However, Hochreiter I does not explicitly teach: further comprising preprocessing the input to the RNN to disregard inputs that are smaller than a predetermined threshold.
	But Hochreiter II teaches: RNN (Hochreiter II, p. 1745, Fig. 2 teaches an RNN being a long short-term memory (LSTM) neural network, which is further described in lines 1-2 of the caption.)
	However, neither Hochreiter I nor Hochreiter II explicitly teaches: further comprising preprocessing the input to the model to disregard inputs that are smaller than a predetermined threshold. 
	But Shih teaches: further comprising preprocessing the input to the model to disregard inputs that are smaller than a predetermined threshold. (Shih ¶ [0077], lines 7-10 and ¶ [0082], lines 5-9.).
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Shih’s system into the combination of Hochreiter I and II’s system by ignoring data smaller than a predetermined threshold, with a motivation to optimize a cost and/or a completion time of a task. (¶ [0089]], last 7 lines).

Response to Arguments
	Examiner will respond to Applicant’s remarks, claim amendments, specification amendment, and replacement abstract filed 11/11/2021.

Objections to the Drawings: The objection to the drawings has been withdrawn due to the amended specification paragraph [0040].
Objections to the Abstract: The objection to the abstract has been withdrawn due to the replacement abstract.
Objections to the Claims: Applicant did not respond to the objections to claims 4 and 15 on page 4 of the non-final rejection. Applicant is required to respond to every objection and rejection made in an action. The objections to claims 4 and 15 are maintained. The objection to claim 10 has been withdrawn due to the claim amendments.
Claim Rejections Under 35 U.S.C. § 112: The rejections of claims 4, 10, and 12-14 and the rejection of claim 15 regarding “the machine learning objective function that depends on the machine learning model parameters             
                θ
            
        ” have been withdrawn due to the claim amendments. The rejection of claim 15 regarding the function             
                
                    
                        E
                    
                    
                        f
                    
                
            
         is maintained. The rejection of claim 11 is moot because the claim is canceled. 
Claim Rejections Under 35 U.S.C. § 102 and 103: Applicant’s arguments with respect to claims 1-10 and 12-18 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Applicant’s arguments with respect to claim 11 are moot because the claim is canceled.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Asher H. Jablon whose telephone number is (571)270-7648. The examiner can normally be reached Monday - Friday, 9:00 am - 6:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Al Kawsar can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ASHER H. JABLON/Examiner, Art Unit 2127                                                                                                                                                                                                        
/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127