Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 05/17/2022 has been entered.

Amendments
Claims 1, 9, 10, 15, and 17-18 are amended. Claims 21-22 are new. Claims 5 and 19 are canceled. Claims 1-4, 6-10, 12-18, and 20-22 are pending and have been considered.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-4, 6-9, 12-14, 17-18 and 20-22 are rejected under 35 U.S.C. 103 as being unpatentable over Hochreiter et al. (“Learning to Learn Using Gradient Descent”, see IDS filed 04/28/2021, NPL doc. 2), hereinafter Hochreiter I, in view of Hochreiter et al. (“Long Short-Term Memory”, see PTO-892 filed 02/14/2022), hereinafter Hochreiter II, and Gregor et al. (“Learning Fast Approximations of Sparse Coding”).

Regarding CLAIM 1, Hochreiter I teaches the bolded limitations: A method implemented by one or more computers, comprising: obtaining a machine learning model, wherein (i) the machine learning model has a plurality of model parameters, and (ii) the machine learning model is trained using gradient descent techniques to optimize an objective function; and (P. 88, sentence in lines 6-9 and Fig. 1. A “machine learning model” as claimed corresponds to Hochreiter I’s subordinate system which is a recurrent neural network (RNN). RNN biases and weights are taught by p. 91, § 3, lines 3-4; and training using gradient descent methods is taught by p. 88, eighth to sixth lines from the end)
for each time step in a plurality of time steps: (p. 89, § 2.1, line 4 teaches “time step j” and lines 7-10 teaches a previous time step j-1; Fig. 1 on p. 88 depicts time steps j and j-1)
determining an update rule for the model parameters for the time step using a supervisory system having a plurality of supervisory system parameters, wherein the supervisory system is different from the machine learning model and the supervisory system parameters are different from the model parameters, … and wherein the determining comprises: (Refer to the second paragraph on p. 87, § 1 which continues onto p. 88. This paragraph and the left side of Fig. 1 teach a meta-learning system. A supervisory system supervises a subordinate system. The supervisory system has parameters “target y(j)” which are different from the subordinate system’s parameters. The supervisory system may be the learning algorithms of back propagation through time (BPTT) or real-time recurrent learning (RTRL), according to p. 89, second paragraph, lines 1-4.)
… processing, using the supervisory system and in accordance with values of the supervisory system parameters for the time step, a… input for the time step that comprises a gradient of the objective function with respect to the… model parameter for the time step to generate a respective supervisory system output for the time step that specifies the update rule for the… model parameter for the time step; (Refer to p. 87, § 1, ¶ 2, lines 2-5 and the left side of Fig. 1 on p. 88. The supervisory system receives a target output y(j), which comprises a gradient. The supervisory system outputs an adjustment to the subordinate system, as taught on p. 88, lines 1-3.)
applying the update rule for the time step generated by the supervisory system to values of the model parameters for the time step to update the values of the model parameters; and (p. 87, § 1, ¶ 2, lines 2-5; p. 88, lines 1-3)
training the supervisory system on an supervisory system objective function that depends on respective values of the model parameters at the time step and at each of one or more preceding time steps in the plurality of time steps, comprising determining an update to the values ofsupervisory system parameters at the time step that minimizes the supervisory system objective function for the time step using gradient descent techniques. (“Objective function” is taught on p. 88, lines 1-3 and P. 91, § 3, line 1. Time steps are taught at p. 89, § 2.1, line 4. The system is updated according to p. 89, bottom paragraph, lines 1-4. The fixed learning algorithms BPTT and RTRL use gradient descent.)	Although Hochreiter I teaches a meta-learning system comprising a supervisory system having its own parameters different from the machine learning model, Hochreiter I do not specifically disclose the supervisory system is an RNN. Hochreiter I does not explicitly teach the following bolded limitations: 
a recurrent neural network (RNN) having a plurality of RNN parameters
	wherein the RNN is configured to operate coordinate-wise with respect to the model parameters
	for each particular model parameter of the plurality of model parameters, processing, using the RNN and in accordance with values of the RNN parameters for the time step, a parameter-specific input for the time step that comprises a gradient of the objective function with respect to the particular model parameter for the time step to generate a respective RNN output
	training the RNN on an RNN objective function… comprising determining an update to the values of the RNN parameters at the time step that minimizes the RNN objective function
But Hochreiter II teaches: using a recurrent neural network (RNN) having a plurality of RNN parameters (P. 1736, ¶ 2, lines 1-3; on p. 1743, § 4 is titled “The concept of Long Short-Term Memory” and § 4.1 teaches the architecture of an LSTM. A plurality of RNN parameters are weights, denoted “w” in the memory cell in Fig. 1 on p. 1744. According to p. 1739, lines 1-2,                         
                            
                                
                                    w
                                
                                
                                    i
                                    j
                                
                            
                        
                     is the weight on the connection from unit                         
                            j
                        
                     to                         
                            i
                        
                    .)
using the RNN and in accordance with values of the RNN parameters for the time step, a… input… to generate a respective RNN output (P. 1744, teaches an equation for the LSTM cell’s output                         
                            
                                
                                    y
                                
                                
                                    
                                        
                                            c
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            (
                            t
                            )
                        
                    . This output is a function of input weights, as taught by the equations in § 4.1 on pp. 1743-1744. These inputs and outputs are depicted in Fig. 1 on p. 1744.)
training the RNN on an RNN objective function… comprising determining an update to the values of the RNN parameters at the time step that minimizes the RNN objective function (P. 1746, § 4.5 teaches training the RNN. P. 1772, § A.1.3, ¶ 1 and equations A. 14-16 teach a truncated gradient version of the LSTM algorithm.)
	Hochreiter II is in the same field of endeavor as the claimed invention, namely, machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Hochreiter I’s supervisory system by replacing its fixed learning algorithms of BPTT and RTRL with Hochreiter II’s RNN-LSTM. A motivation for the combination is that LSTM leads to many more successful runs and learns much faster than RTRL and BPTT, and LSTM solves complex, artificial long-time-lag tasks which the other algorithms cannot solve. (Hochreiter II, p. 1735, Abstract, last 6 lines)
	Hochreiter II teaches an RNN. However, neither Hochreiter I nor Hochreiter II explicitly teaches: wherein the RNN is configured to operate coordinate-wise with respect to the model parameters
for each particular model parameter of the plurality of model parameters, processing… a parameter-specific input… ;
	But Gregor teaches: wherein the RNN is configured to operate coordinate-wise with respect to the model parameters (P. 2, col. 2, § 2, lines 4-7 starting at “the more efficient” See also P. 3, § 2.2, whole first paragraph, particularly col. 2, lines 1-3 and 7-10. Each column of S is a coordinate.)
for each particular model parameter of the plurality of model parameters, processing… a parameter-specific input… ; (P. 3, col. 2, lines 1-3 and 7-10. Minimizing the energy with respect to one code component means gradients for a particular code component are the inputs.)
Gregor also teaches an RNN on p. 5, col. 2, first sentence. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have used Gregor’s Coordinate Descent method to update the components one at a time in Hochreiter I/Hochreiter II’s system. A motivation for the combination is that Coordinate Descent produces a better approximation than updating all the coordinates at the same time for the same amount of computation. (Gregor, P. 3, § 2.2 to the end of col. 1)

	Regarding CLAIM 2, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: 
The method of claim 1, 
Hochreiter I teaches: wherein applying the update rule for a final time step in the plurality of time steps to the model parameters generates trained values of the model parameters. (The BRI of “final time step” includes the final time step for a given sequence in the set of sequences                         
                            
                                
                                    
                                        
                                            s
                                        
                                        
                                            k
                                        
                                    
                                
                            
                        
                     disclosed on p. 89, § 2.1, lines 2-3. P. 89, last paragraph, last sentence teaches generating trained values of the model parameters.)

Regarding CLAIM 3, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: 
The method of claim 1, 
Hochreiter I teaches: wherein the machine learning model comprises a neural network. (P. 88, lines 4-6 from the end)

Regarding CLAIM 4, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 1, 
Hochreiter I teaches: wherein                         
                            
                                
                                    θ
                                
                                
                                    t
                                
                            
                        
                     represents values of the model parameters at time                         
                            t
                        
                    , (P. 91, § 3, lines 3-4 teach biases and weights.                         
                            
                                
                                    θ
                                
                                
                                    t
                                
                            
                        
                     corresponds to the weights and biases at time step j-1, as taught by p. 89, § 2.1, and lines 7-10, and on p. 88 in Fig. 1 and its caption, line 5)
                        
                            ∇
                            f
                            
                                
                                    
                                        
                                            θ
                                        
                                        
                                            t
                                        
                                    
                                
                            
                        
                     represents the gradient of objective function                         
                            f
                        
                    , (P. 87, § 1, ¶ 2, lines 2-5 and the left side of Fig. 1 on p. 88 teach the supervisory system receives a target output y(j), which comprises a gradient.)
                        
                            
                                
                                    θ
                                
                                
                                    t
                                    +
                                    1
                                
                            
                        
                     (This corresponds to the weights and biases at time step j, as taught by p. 89, § 2.1, line 4 and on p. 88, Fig. 1 and its caption)
However, Hochreiter I does not explicitly teach: wherein the determined update rule for the model parameters that minimizes the objective function is given by
                
                    
                        
                            θ
                        
                        
                            t
                            +
                            1
                        
                    
                    =
                    
                        
                            θ
                        
                        
                            t
                        
                    
                    +
                    
                        
                            g
                        
                        
                            t
                        
                    
                    
                        
                            ∇
                            f
                            
                                
                                    
                                        
                                            θ
                                        
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                            ϕ
                        
                    
                
            
wherein                         
                            ϕ
                        
                     represents RNN parameters and                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                     represents the RNN output for a time step                         
                            t
                        
                    .
But Hochreiter II teaches: wherein the determined update rule for the model parameters that minimizes the objective function is given by
                
                    
                        
                            θ
                        
                        
                            t
                            +
                            1
                        
                    
                    =
                    
                        
                            θ
                        
                        
                            t
                        
                    
                    +
                    
                        
                            g
                        
                        
                            t
                        
                    
                    
                        
                            ∇
                            f
                            
                                
                                    
                                        
                                            θ
                                        
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                            ϕ
                        
                    
                
            
wherein                         
                            ϕ
                        
                     represents RNN parameters and                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                     represents the RNN output for a time step                         
                            t
                        
                    . (“RNN parameters” are RNN-LSTM weights, denoted “w” in the memory cell in Fig. 1 on p. 1744, and the RNN-LSTM output is equation                         
                            
                                
                                    y
                                
                                
                                    
                                        
                                            c
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            (
                            t
                            )
                        
                    . According to p. 1739, lines 1-2,                         
                            
                                
                                    w
                                
                                
                                    i
                                    j
                                
                            
                        
                     is the weight on the connection from unit                         
                            j
                        
                     to                         
                            i
                        
                    . The broadest reasonable interpretation of the claimed formula is that Hochreiter II’s LSTM updates model parameters at each time step by applying LSTM functions to current weights and input activations. Input gate activations are taught by p. 1743, § 4.1, ¶ 2, last 2 lines before the equations.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have updated the weights and biases of Hochreiter I’s RNN at each time step by inputting Hochreiter I’s target y(j) representing a gradient and Hochreiter II’s LSTM weights into Hochreiter II’s LSTM function. A motivation for the combination is that LSTM leads to many more successful runs and learns much faster than RTRL and BPTT, and LSTM solves complex, artificial long-time-lag tasks which the other algorithms cannot solve. (Hochreiter II, p. 1735, Abstract, last 6 lines)

	Regarding CLAIM 6, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 1, 
	Hochreiter I teaches: … each model parameter (P. 91, § 3, lines 3-4 teaches biases and weights are model parameters)
	However, Hochreiter I does not explicitly teach: wherein the RNN implements separate activations for each model parameter.
	But Hochreiter II teaches: wherein the RNN implements separate activations for each model parameter. (P. 1743, § 4.1, last sentence before the equations teaches input gate activation and output gate activation for each time step.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have used at least input and output gate activations in the LSTM. A motivation for the combination is that LSTM leads to many more successful runs and learns much faster than RTRL and BPTT, and LSTM solves complex, artificial long-time-lag tasks which the other algorithms cannot solve. (Hochreiter II, p. 1735, Abstract, last 6 lines) 

	Regarding CLAIM 7, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 1, 
	However, Hochreiter I nor Gregor explicitly teaches: wherein the RNN is a long short-term memory (LSTM) neural network.
	But Hochreiter II teaches: wherein the RNN is a long short-term memory (LSTM) neural network. (P. 1736, ¶ 2, lines 1-3.)
	
	Regarding CLAIM 8, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 7, 
	However, neither Hochreiter I nor Gregor explicitly teaches: wherein the LSTM neural network comprises two LSTM layers.
But Hochreiter II teaches: wherein the LSTM neural network comprises two LSTM layers. (Two layers are taught by the two pairs of cell blocks in the hidden layer of Fig. 2 on p. 1745, and by the caption – “two memory cell blocks of size 2”)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have included two pairs of cell blocks to form two LSTM layers because “Different types of units may convey useful information about the current state of the net” and because “It is up to the user to define the network topology”. (Hochreiter II, p. 1744, middle paragraph)

	Regarding CLAIM 9, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 7, 
Hochreiter I teaches: wherein the supervisory system shares one or more of the plurality of supervisory system parameters…  (P. 91, § 3, lines 3-4 teaches biases and weights)
However, Hochreiter I does not explicitly teach: wherein the LSTM neural network shares one RNN parameters across different coordinates of the objective function.
But Hochreiter II teaches: wherein the LSTM neural network shares one RNN parameters (P. 1736, ¶ 2, lines 1-3; § 4 is titled “The concept of Long Short-Term Memory” and § 4.1 teaches the architecture of an LSTM. A plurality of RNN parameters are weights, denoted “w” in the memory cell in Fig. 1 on p. 1744. According to p. 1739, lines 1-2,                         
                            
                                
                                    w
                                
                                
                                    i
                                    j
                                
                            
                        
                     is the weight on the connection from unit                         
                            j
                        
                     to                         
                            i
                        
                    .)
However, neither Hochreiter I nor Hochreiter II explicitly teaches: shares… across different coordinates of the objective function.
But Gregor teaches: shares… across different coordinates of the objective function. (Gregor teaches Coordinate Descent method at P. 2, col. 2, § 2 and P. 3, § 2.2, first paragraph. Although only one component updates while the other components remain constant, the constant components still participate in the optimization calculations.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have used Gregor’s Coordinate Descent method to update the components one at a time. A motivation for the combination is that Coordinate Descent produces a better approximation than updating all the coordinates at the same time for the same amount of computation. (Gregor, P. 3, § 2.2 col. 1)

	Regarding CLAIM 12, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 1, 
	Hochreiter I teaches: the model parameters. (P. 91, § 3, lines 3-4 teaches biases and weights are model parameters. 
	However, Hochreiter I does not explicitly teach: wherein the RNN is invariant to an order of the model parameters.
	But Hochreiter II teaches: wherein the RNN is invariant to an order of the model parameters (P. 1736, ¶ 2, lines 1-3 teaches an RNN P. 1738, § 3.1.1 teaches                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                            
                                
                                    t
                                
                            
                            =
                            
                                
                                    f
                                
                                
                                    i
                                
                            
                            
                                
                                    n
                                    e
                                    
                                        
                                            t
                                        
                                        
                                            i
                                        
                                    
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     is the activation of noninput unit i with differentiable activation function                         
                            
                                
                                    f
                                
                                
                                    i
                                
                            
                        
                    ,                         
                            n
                            e
                            
                                
                                    t
                                
                                
                                    i
                                
                            
                            
                                
                                    t
                                
                            
                            =
                            
                                
                                    ∑
                                    
                                        j
                                    
                                
                                
                                    
                                        
                                            w
                                        
                                        
                                            i
                                            j
                                        
                                    
                                    
                                        
                                            y
                                        
                                        
                                            j
                                        
                                    
                                    
                                        
                                            t
                                            -
                                            1
                                        
                                    
                                
                            
                        
                     is unit i’s current net input, and                         
                            
                                
                                    w
                                
                                
                                    i
                                    j
                                
                            
                        
                     is the weight on the connection from unit j to i. Each connection has a specific corresponding weight.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date to have applied Hochreiter II’s mathematical descriptions of net input to Hochreiter I’s subordinate RNN. 

	Regarding CLAIM 13, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 1, 
	However, Hochreiter I does not explicitly teach: further comprising providing a previous hidden state of the RNN as input to the RNN at each time step.
	But Hochreiter II teaches: further comprising providing a previous hidden state of the RNN as input to the RNN at each time step. (P. 1744, Fig. 1 caption teaches: “The self-recurrent connection (with weight 1.0) indicates feedback with a delay of one time step.” Also see p. 1744, middle paragraph, sentence in lines 6-7. Lastly, the last equation on p. 1744 teaches the internal state at time t is a function of the internal state at time t-1 for t > 0. )
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to used an RNN with a self-recurrent connection because it builds the basis of constant error carrousel (CEC) which is the LSTM’s central feature. (P. 1744, Fig. 1 caption, line 3 and P. 1742, § 3.2.2)
	Gregor also teaches providing a previous hidden state of the RNN as input to the RNN at each time step by the RNN back-propagation algorithm disclosed at p. 4, col. 2 (“time-unfolded recurrent neural network”), Fig. 1b, and p. 5, col. 2 to § 3.4.) 

	Regarding CLAIM 14, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 1, 
Hochreiter I teaches: wherein, at each time step, the update rule for the time step depends on a hidden state of the supervisory system for the time step (The update rule for the supervisory system depends on the BPTT or RTRL. Refer to P. 89, second paragraph, lines 1-4; §2.1, lines 4-7 and ¶ 2, lines 1-4.)
However, neither Hochreiter I nor Gregor explicitly teaches: a hidden state of the RNN for the time step
But Hochreiter II teaches: a hidden state of the RNN for the time step (P. 1744, last 2 equations teaches the output of an RNN is                         
                            
                                
                                    y
                                
                                
                                    
                                        
                                            c
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            (
                            t
                            )
                        
                     . and that it depends on a summation                         
                            n
                            e
                            
                                
                                    t
                                
                                
                                    
                                        
                                            c
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            (
                            t
                            )
                        
                    . The first 4 lines of p. 1477 teach this summation may depend on depend on conventional hidden units.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated a hidden state of the RNN for each time step because a hidden unit “may convey useful information about the current state of the net.” (Hochreiter II, P. 1744, middle paragraph, lines 2-4)

	Claim 17 recites the same features as claim 1. Claim 17 also recites a system comprising one or more computers and one or more storage devices storing instructions that are operable to cause the computer to perform operations. Hochreiter I teaches experiments performed on a computer in § 3, pages 91-93. Hochereiter I’s computer inherently teaches a system comprising one or more computers and one or more storage devices storing instructions that are operable to cause the computer to perform operations. Claim 17 is rejected for the reasons set forth in the rejection of claim 1.

	Claim 18 recites the same features as claim 1. Claim 18 also recites one or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations. Hochreiter I teaches experiments performed on a computer in § 3, pages 91-93. Hochreiter I’s computer inherently teaches one or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations. 
Claim 18 is rejected for the reasons set forth in the rejection of claim 1.
	
	Claim 20 recites the same features as claim 6. Claim 20 is dependent on claim 18, which also recites one or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations. Hochreiter I teaches these limitations; please see the rejection of claim 18 for citations.
	Claim 20 is rejected under 35 U.S.C. 103 for the reasons set forth in the rejection of claim 6.

Regarding CLAIM 21, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 1, 
Hochreiter I teaches: the plurality of model parameters (P. 88, sentence in lines 6-9 and Fig. 1 teach “model parameters” are parameters of Hochreiter I’s subordinate system, which is a recurrent neural network (RNN). The network has biases and weights as taught by p. 91, § 3, lines 3-4.)
However, neither Hochreiter I nor Gregor explicitly teaches: wherein the RNN shares one or more of the plurality of RNN parameters across the plurality of model parameters.
	But Hochreiter II teaches: wherein the RNN shares one or more of the plurality of RNN parameters across the plurality of model parameters. (A plurality of RNN parameters are weights, denoted “w” in the memory cell in Fig. 1 on p. 1744. According to p. 1739, lines 1-2,                         
                            
                                
                                    w
                                
                                
                                    i
                                    j
                                
                            
                        
                     is the weight on the connection from unit                         
                            j
                        
                     to                         
                            i
                        
                    . The equations throughout § 4.1 disclose applying RNN weights to inputs to generate outputs.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied Hochreiter II’s LSTM weights to Hochreiter I’s subordinate system, where Hochreiter II’s LSTM is Hochreiter I’s supervisory system. A motivation for the combination is that LSTM leads to many more successful runs and learns much faster than RTRL and BPTT, and LSTM solves complex, artificial long-time-lag tasks which the other algorithms cannot solve. (Hochreiter II, p. 1735, Abstract, last 6 lines)

Regarding CLAIM 22, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 1, 
Hochreiter I teaches: the plurality of model parameters (P. 88, sentence in lines 6-9 and Fig. 1 teach “model parameters” are parameters of Hochreiter I’s subordinate system, which is a recurrent neural network (RNN). The network has biases and weights as taught by p. 91, § 3, lines 3-4.)
However, Hochreiter I does not explicitly teach: wherein the RNN shares one or more of the plurality of RNN parameters across the plurality of model parameters and maintains a separate hidden state for each of the plurality of model parameters.
	But Hochreiter II teaches: wherein the RNN shares one or more of the plurality of RNN parameters across the plurality of model parameters. (A plurality of RNN parameters are weights, denoted “w” in the memory cell in Fig. 1 on p. 1744. According to p. 1739, lines 1-2,                         
                            
                                
                                    w
                                
                                
                                    i
                                    j
                                
                            
                        
                     is the weight on the connection from unit                         
                            j
                        
                     to                         
                            i
                        
                    . The equations throughout § 4.1 disclose applying RNN weights to inputs to generate outputs.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied Hochreiter II’s LSTM weights to Hochreiter I’s subordinate system, where Hochreiter II’s LSTM is Hochreiter I’s supervisory system. A motivation for the combination is that LSTM leads to many more successful runs and learns much faster than RTRL and BPTT, and LSTM solves complex, artificial long-time-lag tasks which the other algorithms cannot solve. (Hochreiter II, p. 1735, Abstract, last 6 lines)
However, neither Hochreiter I nor Hochreiter II explicitly teaches: wherein the RNN… maintains a separate hidden state for each of the plurality of model parameters.
	But Gregor teaches: wherein the RNN… maintains a separate hidden state for each of the plurality of model parameters. (P. 3, col. 2, lines 1-3 and 7-10 teaches updating one component at a time, which implies the hidden state for only the given component is updated.)

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Hochreiter et al. (“Learning to Learn Using Gradient Descent”, see IDS filed 04/28/2021, NPL doc. 2), hereinafter Hochreiter I, in view of Hochreiter et al. (“Long Short-Term Memory”, see PTO-892 filed 02/14/2022), hereinafter Hochreiter II, Gregor et al. (“Learning Fast Approximations of Sparse Coding”), and Balduzzi et al. (“Strongly-Typed Recurrent Neural Networks”, see PTO-892 filed 08/11/2021).

Regarding CLAIM 10, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 7, 
Hochreiter II teaches multiple cells in LSTM layers in Fig. 2 on p. 1745. However, neither Hochreiter I nor Hochreiter II explicitly teaches: wherein a subset of cells in each of one or more LSTM layers of the LSTM neural network comprise global average units, wherein a global average unit is a unit whose update includes averaging activations of the global average units globally at each time step across different coordinates of the objective function.
	But Gregor teaches: … at each time step across different coordinates of the objective function. (P. 3, col. 2, lines 1-3 and 7-10. Each column of S is a coordinate.)
	However, neither Hochreiter I, Hochreiter II, nor Gregor wherein a subset of cells in each of one or more LSTM layers of the LSTM neural network comprise global average units, wherein a global average unit is a unit whose update includes averaging activations of the global average units globally
But Balduzzi teaches: wherein a subset of cells in each of one or more LSTM layers of the LSTM neural network comprise global average units, wherein a global average unit is a unit whose update includes averaging activations of the global average units globally (P. 5, col. 1, last 3 lines, and also the end of col. 2 which states “T-LSTMs store expectations in private memory cells that are reweighted by the output gate when publicly broadcast.”)
	Balduzzi is in the same field of endeavor as the claimed invention, namely machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Balduzzi’s system into the combination of Hochreiter I, Hochreiter II, and Gregor's system by performing average-pooled convolution, with a motivation to stop gradients from exploding. (P. 6 col. 1: “It follows immediately that gradients will not explode for T-RNNs or LSTMs.”)

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Hochreiter et al. (“Learning to Learn Using Gradient Descent”, see IDS filed 04/28/2021, NPL doc. 2), hereinafter Hochreiter I, in view of Hochreiter et al. (“Long Short-Term Memory”, see PTO-892 filed 02/14/2022), hereinafter Hochreiter II, Gregor et al. (“Learning Fast Approximations of Sparse Coding”), and Miranda et al. (“Multi-Objective Optimization for Self-Adjusting Weighted Gradient in Machine Learning Tasks”, see PTO-892 filed 02/14/2022).

Regarding CLAIM 15, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 1, 
	Hochreiter I teaches: where                         
                            
                                
                                    θ
                                
                                
                                    t
                                    +
                                    1
                                
                            
                            =
                            
                                
                                    θ
                                
                                
                                    t
                                
                            
                            +
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                    , …                         
                            f
                            (
                            
                                
                                    θ
                                
                                
                                    t
                                
                            
                        
                    ) represents the objective function that depends on the model parameters                         
                            θ
                        
                     at time                         
                            t
                        
                    , (P. 91, § 3, lines 3-4 teaches biases and weights. This citation corresponds to                         
                            
                                
                                    θ
                                
                                
                                    t
                                
                            
                        
                    , the weights and biases at one previous time step j-1 (p. 89, § 2., line 7). Updating is taught by p. 89, last paragraph, lines 1-5. The objective function is a subordinate routine or learning algorithm, taught on p. 88, lines 4-6 from the end.)
                        
                            
                                
                                    w
                                
                                
                                    t
                                
                            
                            ∈
                            
                                
                                    R
                                
                                
                                    ≥
                                    0
                                
                            
                        
                     represents weights associated with each time step                         
                            t
                        
                    ,  (P. 91, § 3, lines 3-4 teaches weights)
                        
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                     represents a supervisory system output for time                         
                            t
                        
                    , (P. 87, ¶ 2, lines 2-7 teaches a supervisory system output over time.)
                        
                            
                                
                                    ∇
                                
                                
                                    t
                                
                            
                            =
                            
                                
                                    ∇
                                
                                
                                    θ
                                
                            
                            f
                            
                                
                                    
                                        
                                            θ
                                        
                                        
                                            t
                                        
                                    
                                
                            
                        
                    , (P. 87, § 1, ¶ 2, lines 2-5 and the left side of Fig. 1 on p. 88 teach the supervisory system receives a target output y(j), which comprises a gradient.)
	However, Hochreiter I does not explicitly teach: wherein the RNN objective function is given by
                
                    L
                    
                        
                            ϕ
                        
                    
                    =
                    
                        
                            E
                        
                        
                            f
                        
                    
                    
                        
                            
                                
                                    ∑
                                    
                                        t
                                        =
                                        1
                                    
                                    
                                        T
                                    
                                
                                
                                    
                                        
                                            w
                                        
                                        
                                            t
                                        
                                    
                                    f
                                    
                                        
                                            
                                                
                                                    θ
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                
            
where                         
                            
                                
                                    
                                        
                                            
                                                
                                                    
                                                        g
                                                    
                                                    
                                                        t
                                                    
                                                
                                            
                                        
                                        
                                            
                                                
                                                    
                                                        h
                                                    
                                                    
                                                        t
                                                        +
                                                        1
                                                    
                                                
                                            
                                        
                                    
                                
                            
                            =
                            m
                            (
                            
                                
                                    ∇
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    h
                                
                                
                                    t
                                
                            
                            ,
                            ϕ
                            )
                        
                    ,                         
                            ϕ
                        
                     represents the RNN parameters,                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                     represents a RNN output for time                         
                            t
                        
                    ,                         
                            
                                
                                    h
                                
                                
                                    t
                                
                            
                        
                     represents a hidden state of the RNN at time                         
                            t
                        
                    ,                         
                            m
                        
                     represents the RNN… and                         
                            
                                
                                    E
                                
                                
                                    f
                                
                            
                        
                    [.] represents an expected value.
	But Hochreiter II teaches: where                         
                            
                                
                                    
                                        
                                            
                                                
                                                    
                                                        g
                                                    
                                                    
                                                        t
                                                    
                                                
                                            
                                        
                                        
                                            
                                                
                                                    
                                                        h
                                                    
                                                    
                                                        t
                                                        +
                                                        1
                                                    
                                                
                                            
                                        
                                    
                                
                            
                            =
                            m
                            (
                            
                                
                                    ∇
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    h
                                
                                
                                    t
                                
                            
                            ,
                            ϕ
                            )
                        
                    ,                         
                            ϕ
                        
                     represents the RNN parameters,                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                        
                     represents an RNN output for time                         
                            t
                        
                    ,                         
                            
                                
                                    h
                                
                                
                                    t
                                
                            
                        
                     represents a hidden state of the RNN at time                         
                            t
                        
                    ,                         
                            m
                        
                     represents the RNN (P. 1743-4, § 4.1 on pp. 1743-4 and Fig. 1 disclose a LSTM network having weight parameters, an output, a hidden state, and whose output and next hidden state are a function of the input (corresponding to                         
                            
                                
                                    ∇
                                
                                
                                    t
                                
                            
                        
                     in the claim), the hidden state, and the weight parameters of the LSTM.)
	However, neither Hochreiter I nor Hochreiter II explicitly teaches: wherein the RNN objective function is given by
                
                    L
                    
                        
                            ϕ
                        
                    
                    =
                    
                        
                            E
                        
                        
                            f
                        
                    
                    
                        
                            
                                
                                    ∑
                                    
                                        t
                                        =
                                        1
                                    
                                    
                                        T
                                    
                                
                                
                                    
                                        
                                            w
                                        
                                        
                                            t
                                        
                                    
                                    f
                                    
                                        
                                            
                                                
                                                    θ
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                
            
and                         
                            
                                
                                    E
                                
                                
                                    f
                                
                            
                        
                    [.] represents an expected value.
But Miranda teaches: wherein the RNN objective function is given by
                
                    L
                    
                        
                            ϕ
                        
                    
                    =
                    
                        
                            E
                        
                        
                            f
                        
                    
                    
                        
                            
                                
                                    ∑
                                    
                                        t
                                        =
                                        1
                                    
                                    
                                        T
                                    
                                
                                
                                    
                                        
                                            w
                                        
                                        
                                            t
                                        
                                    
                                    f
                                    
                                        
                                            
                                                
                                                    θ
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                
            
and                         
                            
                                
                                    E
                                
                                
                                    f
                                
                            
                        
                    [.] represents an expected value. (Refer to Equation 1 on p. 2 and the sentence surrounding it. Since the claim allows for a single objective function, the limitation has been met.)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Miranda's system into the combination of Hochreiter I and II’s system by solving for the optimization problem using Miranda’s equation 1, with a motivation to efficiently solve the optimization problem when the Pareto frontier is convex. (Miranda, p. 3, lines 1-2.)

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Hochreiter et al. (“Learning to Learn Using Gradient Descent”, see IDS filed 04/28/2021, NPL doc. 2), hereinafter Hochreiter I, in view of Hochreiter et al. (“Long Short-Term Memory”, see PTO-892 filed 02/14/2022), hereinafter Hochreiter II, Gregor et al. (“Learning Fast Approximations of Sparse Coding”), and Dursun et al. (US 20180025269 A1). 

Regarding CLAIM 16, the combination of Hochreiter I, Hochreiter II, and Gregor teaches: The method of claim 1, 		
Hochreiter I teaches: gradients (P. 87, § 1, ¶ 2, lines 2-5 and the left side of Fig. 1 on p. 88 teach the supervisory system receives a target output y(j), which comprises a gradient.)
	However, Hochreiter I does not explicitly teach: further comprising preprocessing the input to the RNN to disregard gradients that are smaller than a predetermined threshold.
	But Hochreiter II teaches: the RNN (P. 1736, ¶ 2, lines 1-3)
	However, neither Hochreiter I, Hochreiter II, nor Gregor explicitly teaches: further comprising preprocessing the input to the RNN to disregard gradients that are smaller than a predetermined threshold. 
	But Dursun teaches: further comprising preprocessing the input to the RNN to disregard gradients that are smaller than a predetermined threshold. (¶ [0041]-[0042]teach a pre-processing step that includes removing input data that falls below a predetermined threshold value.)
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have removed Hochreiter I’s gradient inputs based on Dursun’s predetermined threshold, with a motivation to eliminate noisy data. (Dursun ¶ [0041], lines 1-3)

Response to Arguments
Applicant’s arguments filed 05/17/2022 with respect to the 35 U.S.C. 103 rejections of claims 1-4, 6-10, 12-18, and 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Applicant’s arguments with respect to claims 5 and 19 are moot because the claims have been canceled.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Wen et al. (“Latent Factor Guided Convolutional Neural Networks for Age-Invariant Face Recognition”) in Fig. 3 teaches a first convolution unit having a first hidden state and a second frozen convolution unit having a second hidden state.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Asher H. Jablon whose telephone number is (571)270-7648. The examiner can normally be reached Monday - Friday, 9:00 am - 6:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Al Kawsar can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/A.H.J./Examiner, Art Unit 2127                                                                                                                                                                                                        

/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127