DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This action is responsive to the original application filed on 6/19/2018 and the claims and AFCP 2.0 request filed on 1/27/2022.

EXAMINER'S AMENDMENT

An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.

Authorization for this examiner’s amendment was given in an interview with Gaspare Randazzo (reg. No. 41528) on 2/9/2022.

The application has been amended as follows: 




Please amend the claim set, filed on 8/18/2021, with the following amendments:
 
1.  (Currently amended)  A computer-implemented method for learning an action policy, comprising:
obtaining, by a processor performing model-free inverse reinforcement learning, environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state;
training, by the processor using the environment dynamics as training data, a dynamics model which obtains a pair of the state and the action as an input and outputs, for each next state, state-transition probabilities; and
learning, by the processor performing behavioral cloning, the action policy using trajectories of expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model using an encoder-decoder network to learn the trajectories of expert states without action information,
wherein weight parameters of the dynamics model are fixed in the learning of the action policy.
 
2.  (Previously presented)  The computer-implemented method of claim 1, wherein said obtaining step comprises:
learning a predictor which predicts a next state using the trajectories of the expert states; and
performing the model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics.
 
3.  (Original)  The computer-implemented method of claim 2, wherein the model-free inverse reinforcement learning is performed during an exploration stage of the method.
 
4.  (Previously presented)  The computer-implemented method of claim 2, wherein the predictor is learned using a machine learning mechanism selected from a group consisting of a Long Short-Term Memory (LSTM) and a Dynamic Boltzmann Machine (DyBM).
 
5.  (Original)  The computer-implemented method of claim 4, wherein the machine learning mechanism comprises a plurality of machine learning mechanisms that, in turn, form a time-series predictive model for predicting the next state using the trajectories of the expert states.
 
6.  (Cancelled)
 
7.  (Original)  The computer-implemented method of claim 1, wherein said training step uses closed-loop training to train the dynamics model.
 
8.  (Original)  The computer-implemented method of claim 1, wherein said obtaining step is performed during a model-free exploration stage of the method.
 
9.  (Original)  The computer-implemented method of claim 1, wherein the error gradients comprise policy gradients with respect to a corresponding action to the policy gradients and in an absence of an expert action corresponding to the policy gradients.
 
10.  (Original)  The computer-implemented method of claim 1, further comprising performing one obstacle avoidance using the trained dynamics model.
 
11.  (Original)  The computer-implemented method of claim 1, further comprising performing transfer learning between at least two agents using the trained dynamic model.
 
12.  (Original)  The computer-implemented method of claim 1, wherein said learning step is performed in an absence of expert actions corresponding to the expert states.
 
13.  (Original)  The computer-implemented method of claim 1, wherein the pair of the state and the action is obtained as the input to the dynamics model from a model-based policy map.
 
            14.  (Original)  The computer-implemented method of claim 1, further comprising controlling a hardware object to perform an action involving movement responsive to the learned action policy.
 
15.  (Currently amended)  A computer program product for learning an action policy, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising:
obtaining, by a processor of the computer performing model-free inverse reinforcement learning, environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state;
training, by the processor using the environment dynamics as training data, a dynamics model which obtains a pair of the state and the action as an input and outputs, for each next state, state-transition probabilities; and
learning, by the processor performing behavioral cloning, the action policy using trajectories of expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model using an encoder-decoder network to learn the trajectories of expert states without action information,
wherein weight parameters of the dynamics model are fixed in the learning of the action policy.
 
16.  (Previously presented)  The computer program product of claim 15, wherein said obtaining step comprises:
learning a predictor which predicts a next state using the trajectories of the expert states; and
performing the model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics.
 
17.  (Cancelled)
 
18.  (Currently amended)  A computer processing system for learning an action policy, comprising:
            a memory for storing program code; and
            a processor, operatively coupled to the memory, for running the program code to
obtain, by performing model-free inverse reinforcement learning, environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state;
train, using the environment dynamics as training data, a dynamics model which obtains a pair of the state and the action as an input and outputs, for each next state, state-transition probabilities; and
learn, by performing behavioral cloning, the action policy using trajectories of expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model using an encoder-decoder network to learn the trajectories of expert states without action information,
wherein weight parameters of the dynamics model are fixed in the learning of the action policy.
 
19.  (Previously presented)  The computer processing system of claim 18, wherein the environment dynamics are obtained by learning a predictor which predicts a next state using the trajectories of the expert states, and performing the model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics.
 
20.  (Cancelled)
 
21.  (Currently amended)  A computer-implemented method for learning an action policy, comprising:
learning, by a processor, a predictor which predicts a next state using trajectories of expert states; and
performing, by the processor, model-free inverse reinforcement learning using rewards estimated by using the predictor to sample environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state;
training, by the processor using the sampled environment dynamics as training data, a dynamics model which obtains a pair of the state and the action as an input and outputs, for each next state, state-transition probabilities to provide a trained dynamics model[[,]];
learning, by the processor, the action policy using the trajectories of the expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model using an encoder-decoder network to learn the trajectories of expert states without action information,
wherein weight parameters of the dynamics model are fixed in the learning of the action policy.
 
22.  (Cancelled)
 
23.  (Cancelled)
 
24.  (Original)  A non-transitory article of manufacture tangibly embodying a computer readable program which when executed causes a computer to perform the steps of claim 21.
 
25.  (Currently amended)  A computer processing system for learning an action policy, comprising:
            a memory for storing program code; and
            a processor, operatively coupled to the memory, for running the program code to
learn a predictor which predicts a next state using trajectories of expert states; and
perform model-free inverse reinforcement learning using rewards estimated by using the predictor to sample environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state;
train, using the sampled environment dynamics as training data, a dynamics model which obtains a pair of the state and the action as an input and outputs, for each next state, state-transition probabilities to provide a trained dynamics mode[[,]];
learn the action policy using the trajectories of the expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model using an encoder-decoder network to learn the trajectories of expert states without action information,
wherein weight parameters of the dynamics model are fixed in the learning of the action policy.

Reasons for Allowance

Claims 1-5, 7-16, 18-19, 21, and 24-25, in view of thee Examiner’s Amendment above, are allowed.

The following is an examiner’s statement of reasons for allowance: None of the prior art teaches the limitations of claims either alone or in combination, particularly:

Claims 1, 15, 18, 21, and 25

Performing model-free inverse reinforcement learning to sample environment dynamics, training a dynamics model, using an encoder-decoder network to learn the trajectories of expert states without action information, and fixing weight parameters in the learning of the action policy, all taught in the context of the remaining elements of the independent claims and when considered as a whole, is not taught by the prior art.  



Accordingly, the 35 USC § 103 rejection of the claims is withdrawn.

	Note that an interview was conducted on 2/9/2022 to clarify that the Applicant wished to include similar claim amendments to claims 21 and 25, as was provided as an amendment to independent claims 1, 15, and 18 in the AFCP 2.0 submission and pursuant to the interview conducted on 1/21/2022, to include “learning, by the processor, the action policy using the trajectories of the expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model using an encoder-decoder network to learn the trajectories of expert states without action information”.  As a result, the claim set filed 1/27/2022 was not entered and an Examiner’s Amendment was authorized by the Applicant to include these features for independent claims 21 and 25 and to cancel claim 23, which is fully addressed in this office action.

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
                                                                                                                                                                                                
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Brent Hoover whose telephone number is (303)297-4403. The examiner can normally be reached Monday - Friday 9-5 MST.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on 571-270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/BRENT JOHNSTON HOOVER/Examiner, Art Unit 2127