DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2.	 This communication is in response to Applicant’s submission filed 20 October 2021 [hereinafter Response], where:
Claims 11, 18, 21, 25-27, 31, and 34 have been amended.
Claims 1-10, 12, 14-17, 22-24, 28-30, and 32 are cancelled.
New claims 35-37 are presented for consideration.
Claims 11, 13, 18-21, 25-27, 31, and 33-37 are pending.
Claims 11, 13, 18-21, 25-27, 31, and 33-37 are allowed.
Claim Rejections - 35 U.S.C. § 112
3.	The rejections under Section 112(b) to claims 11, 21, 27, and 30 are withdrawn in view of Applicant’s amendments to these claims. Claims 15, 17, 22, 24, and 28 have been cancelled by Applicant. Accordingly, the rejections to these claims are now moot.
The rejections under Section 112(b) to claims 13, 18, 19, 20, 29, 31, 33, and 34 are also withdrawn in that the amendments to claims 1, 9, and 16 cure the deficiencies therein.
Allowable Subject Matter
4.	Claims 11, 13, 18-21, 25-27, 31, and 33-37 are allowed.
Reasons for Allowance
5.	The following is the Examiner’s statement of reasons for allowance:
Instant claim 11, used as an exemplar claim, recites, inter alia, an “advantage estimate” for a training action as follows:
A method of training a policy neural network of a reinforcement learning system . . . , the method comprising:
* * *
generating an advantage estimate for the training action that was performed by the agent in response to the training observation from the determined distance in the continuous domain between i) the output action in the set of actions that lie on the continuous domain that is obtained as output from the policy neural network by processing the training observation and ii) the training action that was performed by the agent in response to the training observation, comprising,
processing, by a function parameter neural network of the reinforcement learning system, the training observation to generate an output that defines values of the state-dependent parameters; and
applying a function having state-dependent parameters to the distance between the output action in the set of actions that lie on the continuous domain and the training action that was performed in response to the training observation, wherein the advantage estimate satisfies:
            
                A
                (
                x
                ,
                 
                u
                |
                
                    
                        θ
                    
                    
                        A
                    
                
                )
                =
                 
                -
                
                    
                        1
                    
                    
                        2
                    
                
                
                    
                        
                            
                                u
                                -
                                μ
                                
                                    
                                        x
                                    
                                    
                                        
                                            
                                                θ
                                            
                                            
                                                μ
                                            
                                        
                                    
                                
                            
                        
                    
                    
                        T
                    
                
                P
                (
                x
                |
                
                    
                        θ
                    
                    
                        P
                    
                
                )
                (
                u
                -
                μ
                
                    
                        x
                    
                    
                        
                            
                                θ
                            
                            
                                μ
                            
                        
                    
                
                )
            
        
where             
                
                    
                        
                            
                                u
                                -
                                μ
                                
                                    
                                        x
                                    
                                    
                                        
                                            
                                                θ
                                            
                                            
                                                μ
                                            
                                        
                                    
                                
                            
                        
                    
                    
                        T
                    
                
            
         is a transpose of the distance between the output action in the set of actions that lie on the continuous domain and the training action that was performed in response to the training observation, P is a state-dependent parameter matrix that has entries defined by the values of the set of state-dependent parameters, and             
                (
                u
                -
                μ
                
                    
                        x
                    
                    
                        
                            
                                θ
                            
                            
                                μ
                            
                        
                    
                
                )
            
         is the distance between the output action in the set of actions that lie on the continuous domain and the training action that was performed in response to the training observation; and
generating a Q value for the training action performed in response to the training observation by combining the advantage estimate for the training action performed in response to the training observation and the first value estimate that is an estimate of an expected return resulting from the environment being in the training state characterized by the training observation irrespective of which action is performed in response to the training observation;
processing the subsequent observation using the value neural network to generate a new value estimate for the subsequent state, the new value estimate being an estimate of an expected return resulting from the environment being in the subsequent state; 
* * *
(Instant claim 11 (emphasis added); see also instant claims 21 & 27). Also, in support thereof, the specification relates the “advantage estimate” id directed to the problem of excessive computational overhead requirements in continuous spaces because all possible actions in the subsequent state are uncountable. The specification recites:
In continuous spaces of actions, the set of all possible actions in the subsequent state are uncountable. This often results in identifying the argmax being computationally infeasible or, at the least, very computationally intensive. To address this problem, the reinforcement learning system 100 can calculate the Q value for an action in response to a particular observation based on the value estimate of the particular state. In particular, . . . because of the way the advantage estimates are determined, the advantage estimate for the argmax action is always zero and the reinforcement learning system can determine the target output using only the value estimate, which depends only on the observation and does not require processing multiple actions from the continuous action space. Thus, the reinforcement learning system can effectively train the function parameter subnetwork 110, the value subnetwork 111, and the policy subnetwork 112 using a deep Q learning technique even though the action space is continuous.
(PGPUB1 ¶ 0031 (emphasis added)). That is, the Q value (that is, the expected return) generated by “combining the advantage estimate for the particular action and the value estimate of the current state. In some implementations, the system adds the advantage estimate for a particular action and the value estimate for a particular state to generate the Q value for the particular action in the particular state.” (PGPUB ¶ 0044).
The closest art of record, Das, teaches reward-based learning methodologies, including well-known Reinforcement Learning (RL) techniques, to generate effective policies for management of a system. Specifically, Das teaches the use of distance metrics in (Ns + Na)-dimensional metric space, where Ns is the dimensionality of states, and Na is the dimensionality of actions. (Das ¶ 0018). The distance metric is used to compute a Non-Linear Dimensionality Reduction (NLDR) mapping of (state, action) pairs into a lower-dimensional representation. (Das ¶ 0005).
In an embodiment, Das teaches a Mahalanobis distance ((xi-xj)M(xi-xj)T)1/2 as the distance metric, where M is a positive semi-definite matrix and (xi-xj)T denotes the transpose of the difference vector (xi-xj). (Das ¶ 0020). However, Das does not teach the “advantage estimate” of Applicant’s claims.
Any comments considered necessary by Applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Conclusion
6.	Any inquiry concerning this communication or earlier communications from the Examiner should be directed to KEVIN L. SMITH whose telephone number is (571) 272-5964. Normally, the Examiner is available on Monday-Thursday 0730-1730. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s supervisor, KAKALI CHAKI can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/K.L.S./
Examiner, Art Unit 2122


/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122                                                                                                                                                                                                        



    
        
            
        
            
    

    
        1 US Published Application 20170228662 to Gu et al., entitled “Reinforcement learning using advantage estimates,” filed 09 February 2017 [hereinafter PGPUB].