DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Amendment
This Office Action is in response to applicant’s communication filed 19 October 2022, in response to the Office Action mailed 11 July 2022.  The applicant’s remarks and any amendments to the claims or specification have been considered, with the results that follow.


Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12 September 2022 has been entered.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs, March 2017, pgs. 1-11) in view of Liu et al. (Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation, July 2017, pgs. 1-21) or, alternatively/additionally, over Liu in view of Li, both as described below; and further in view of Choi et al. (Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions, Dec 2012, pgs. 1-9).

As per claim 1, Li teaches a computer-implemented method for estimating a reward in reinforcement learning, the method comprising: preparing a state prediction model trained to predict a state for an input using visited states in expert demonstrations performed by an expert [an apprenticeship learning system that learns a policy by estimating a reward based upon observations of an expert (pgs. 2-3 and 5, abstract and sections 2.1-2.3, 4.1; etc.)] and to estimate optimal rewards based on transition probabilities predicted using state trajectories of the expert demonstrations [the agent imitates the behavior of an expert policy by matching the generated state-action distribution πE with the expert distribution, minimizing the divergence between them, while the discriminator tries to distinguish state-action pairs from the trajectories (pg. 3, section 2.3) using the transition probability distribution (pg. 3, section 2.1)]; inputting an actual state observed by an agent in reinforcement learning into the state prediction model to calculate a predicted state [the GAN is used to produce a trajectory based upon the input state to attempt to imitate the expert trajectory (pg. 3, sections 2.2-2.3, etc.) which can be augmented with reward to estimate the total future reward (pg. 7, section 5.2, etc.)]; and estimating a reward in the reinforcement learning based, at least in part, on similarity between the predicted state and an actual state observed by the agent, the reward being defined as a function of a similarity measure of a generative model (g(s)), with the similarity between the predicted state and the actual state [the GAN is used to produce a trajectory based upon the input state to attempt to imitate the expert trajectory by determining the distance between the trajectories (similarity) (pg. 3, sections 2.2-2.3; pg. 5, sections 4.1-4.2; etc.) where for the policy network, input visual features are passed through two convolutional layers, and then combined with the auxiliary information vector and (in the case of Info-GAIL) the latent code to produce the expected accumulated future reward (pg. 7, section 5.2, etc.)], and additional state information including a derived state different from the predicted state and the actual state is input into the prediction model [for the policy network, input visual features are passed through two convolutional layers, and then combined with the auxiliary information vector (derived state) and (in the case of Info-GAIL) the latent code to produce the expected accumulated future reward (pg. 7, section 5.2, etc.)].
While Li teaches defining the reward based on a similarity measure (see above) it has not been relied upon for teaching the reward being defined as a function of a similarity measure of a generative model (g(s)) or a temporal sequence prediction model (h(s)), with the similarity measure being a difference between the predicted state and the actual state ||s – g(s)|| or ||s – h(s)||, wherein the reward associated with the temporal sequence prediction model (h(s)) includes a derived state different from any state values input into the temporal sequence prediction model (h(s)).  Furthermore, while Li teaches using derived state as an additional input (see above) it is not relied upon for teaching determining a variant reward signal for the estimating the reward.
Liu teaches a computer-implemented method for estimating a reward in reinforcement learning, the method comprising: preparing a state prediction model trained to predict a state for an input using visited states in expert demonstrations performed by an expert [a reinforcement learning algorithm is used to learn control policies in a selected environment using distance to observed demonstration states to determine a reward function (section 3 describes the distance, sections 5.1-5.2 describe training the model including utilizing the distance, etc.)] and to estimate optimal rewards predicted using states of the expert demonstrations [a reinforcement learning algorithm is used to learn control policies in a selected environment using distance to observed demonstration states to determine a reward function (section 3 describes the distance, sections 5.1-5.2 describe training the model including utilizing the distance, etc.)]; inputting an actual state observed by an agent in reinforcement learning into the state prediction model to calculate a predicted state [the model predicts what future states will look like in a context by looking at an actual state (section 4, etc.)]; and estimating a reward in the reinforcement learning based, at least in part, on similarity between the predicted state and an actual state observed by the agent [a reinforcement learning algorithm is used to learn control policies in a selected environment using distance to observed demonstration states to determine a reward function (section 3 describes the distance, sections 5.1-5.2 describe training the model including utilizing the distance, etc.)], the reward being defined as a function of a similarity measure of a generative model (g(s)) or a temporal sequence prediction model (h(s)) [the determination of the observed states includes utilizing generative models aligned in time (temporal sequence) (section 4, etc.) and calculating a distance (similarity measure) to observed demonstration states (section 3 describes the distance, sections 5.1-5.2 describe training the model including utilizing the distance, etc.)], with the similarity measure being a difference between the predicted state and the actual state ||s – g(s)|| or ||s – h(s)||, and additional state information including a derived state different from the predicted state and the actual state is input into the temporal sequence prediction model (h(s)) to determine a reward associated with the temporal sequence prediction model [the determination of the observed states includes utilizing generative models aligned in time (temporal sequence) (section 4, etc.) and calculating a distance (similarity measure) to observed demonstration states (section 3 describes the distance, sections 5.1-5.2 describe training the model including utilizing the distance, etc.); and the final reward is a weighted combination of a reward based on features and an image tracking reward, where the image tracking reward directly penalizes the policy from producing observations that differ from the translated observations, (section 5.1, etc.)].
Li and Liu are analogous art, as they are within the same field of endeavor, namely machine learning using imitation.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to utilize the models for imitation learning and context translation, including generative and time series models, as taught by Liu, in the imitation learning the total reward in the system taught by Li.
Liu provides motivation as [utilizing a context model is complementary to a GAN/GAIL used for imitation learning (sections 2 and 6.2) and allows the system to utilize demonstrations which may not be in the same context as that of the learner, to compensate for changes in viewpoint, surrounding, etc. (abstract, etc.)].
Alternatively/additionally, while Liu teaches estimating the optimal reward function based on expert states (see above), it has not been relied upon for teaching optimal rewards based on transition probabilities predicted using state trajectories of the expert demonstrations.
Li teaches to estimate optimal rewards based on transition probabilities predicted using state trajectories of the expert demonstrations [the agent imitates the behavior of an expert policy by matching the generated state-action distribution πE with the expert distribution, minimizing the divergence between them, while the discriminator tries to distinguish state-action pairs from the trajectories (pg. 3, section 2.3) using the transition probability distribution (pg. 3, section 2.1)].
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to estimate the rewards based on transition probabilities using state trajectories of the expert demonstrations, as taught by Li, for the estimation of rewards based on the states of the expert demonstrations in the system taught by Liu.
Li provides motivation as [the system can perform well using only raw state inputs using expert trajectories without any further supervision (pg. 1, section 1, etc.)].
While Li/Liu teaches using derived state as an additional input (see above) it is not relied upon for teaching determining a variant reward signal for the estimating the reward.
Choi teaches determining a variant reward signal for estimating a reward [an IRL algorithm is used to estimate multiple reward functions (i.e., at least one variant) (pgs. 1-2, abstract and section 1; etc.)].
Li/Liu and Choi are analogous art, as they are within the same field of endeavor, namely imitation learning.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to estimate multiple reward functions as taught by Choi in the estimation of rewards in the system taught by Li/Liu.
Choi provides motivation as [often IRL algorithms assume that the expert/agent behavior is from a single agent and following a single reward function, but this is not always the case; and by estimating multiple reward functions the system can more accurately learn the correct behaviors for some cases (pgs. 1-2, abstract and section 1; etc.)].

As per claim 2, Li/Liu/Choi teaches training the state prediction model using the visited states in the expert demonstrations without actions executed by the expert in relation to the visited states [the network may be trained with visual inputs (images) of the expert demonstration (i.e., only state information and not the action) (Li: pg. 4, section 3.2, etc.); the method operates on raw observations (states) and does not require actions in the demonstrations (Liu: section 2, etc.)].

As per claim 3, Li/Liu/Choi teaches wherein the state prediction model is a generative model, and both of the actual state defining the similarity and the actual state inputted into the generative model are observed at the same time step [the system utilizes a generative adversarial imitation learning model (Li: pg. 2, abstract, etc.)], the method further comprising: training the generative model so as to minimize an error between a visited state in the expert demonstrations and a reconstructed state from the visited state [the GAN is used to produce a trajectory based upon the input state to attempt to imitate the expert trajectory by minimizing the distance between the trajectories (Li: pg. 3, sections 2.2-2.3; pg. 5, sections 4.1-4.2; etc.); and calculating a distance (similarity measure) to observed demonstration states (Liu: section 3 describes the distance, sections 5.1-5.2 describe training the model including utilizing the distance, etc.) the model is supervised with a squared error loss on the output and trained (Liu: section 4; etc.)].

As per claim 4, Li/Liu/Choi teaches wherein the generative model is an autoencoder that reconstructs a state as the predicted state from an actual state, the similarity being defined between the state reconstructed by the autoencoder and the actual state [we jointly train the translation model encoder Enc1 and decoder Dec as an autoencoder to create the translated observations (Liu: sections 4, 5.1, etc.)].

As per claim 5, Li/Liu/Choi teaches wherein the state prediction model is a temporal sequence prediction model, and the actual state inputted into the temporal sequence prediction model precedes the actual state defining the similarity, the method further comprising: training the temporal sequence prediction model so as to minimize an error between a visited state in the expert demonstrations and an inferred state from one or more preceding visited states in the expert demonstrations [the GAN is used to produce a trajectory based upon the input state to attempt to imitate the expert trajectory by determining the distance between the trajectory (similarity) (Li: pg. 3, sections 2.2-2.3; pg. 5, sections 4.1-4.2; etc.) where for the policy network, input visual features are passed through multiple convolutional layers (a temporal sequence prediction model), and then combined with the auxiliary information vector and (in the case of Info-GAIL) the latent code to produce the expected accumulated future reward (Li: pg. 7, section 5.2, etc.); the determination of the observed states includes utilizing generative models aligned in time (temporal sequence) (Liu: section 4, etc.) and calculating a distance (similarity measure) to observed demonstration states (Liu: section 3 describes the distance, sections 5.1-5.2 describe training the model including utilizing the distance, etc.)].

As per claim 6, Li/Liu/Choi teaches wherein the temporal sequence prediction model is a next state model that infers a next state as the predicted state from an actual current state, the similarity being defined between the next state inferred by the next state model and an actual next state [the GAN is used to produce a trajectory based upon the input state to attempt to imitate the expert trajectory by determining the distance between the trajectory (similarity) (Li: pg. 3, sections 2.2-2.3; pg. 5, sections 4.1-4.2; etc.) where for the policy network, input visual features are passed through multiple convolutional layers (a temporal sequence prediction model), and then combined with the auxiliary information vector and (in the case of Info-GAIL) the latent code to produce the expected accumulated future reward (Li: pg. 7, section 5.2, etc.); the model predicts what future states will look like in a context by looking at an actual state (Liu: section 4, etc.)].

As per claim 7, Li/Liu/Choi teaches wherein the temporal sequence prediction model is a long short term memory (LSTM) based model that infers a next state as the predicted state from an actual state history or an actual current state, the similarity being defined between the next state inferred by the LSTM based model and an actual next state [the model includes certain auxiliary information as internal input to serve as a short-term memory (Li: pg. 6, section 5.2, etc.) where the GAN is used to produce a trajectory based upon the input state to attempt to imitate the expert trajectory by determining the distance between the trajectory (similarity) (Li: pg. 3, sections 2.2-2.3; pg. 5, sections 4.1-4.2; etc.)].

As per claim 8, Li/Liu/Choi teaches wherein the temporal sequence prediction model is a 3-dimensional convolutional neural network (3D-CNN) model that infers a next state as the predicted state from an actual state history or an actual current state, the similarity being defined between the next state inferred by the 3D-CNN based model and an actual next state [the GAN is used to produce a trajectory based upon the input state to attempt to imitate the expert trajectory by determining the distance between the trajectory (similarity) (Li: pg. 3, sections 2.2-2.3; pg. 5, sections 4.1-4.2; etc.) where for the policy network, input visual features are passed through multiple convolutional layers (a temporal sequence prediction model), and then combined with the auxiliary information vector and (in the case of Info-GAIL) the latent code to produce the expected accumulated future reward; utilizing three dimensional image information for the CNN (Li: pg. 7, section 5.2, etc.); the model includes a CNN (Liu: appendix A; etc.)].

As per claim 9, Li/Liu/Choi teaches wherein the expert demonstration represents optimal behavior and the reward is estimated as a higher value as the similarity becomes high [the GAN is used to produce a trajectory based upon the input state to attempt to imitate the expert trajectory by determining the distance between the trajectory (similarity) (Li: pg. 3, sections 2.2-2.3; pg. 5, sections 4.1-4.2; etc.) where for the policy network, input visual features are passed through multiple convolutional layers (a temporal sequence prediction model), and then combined with the auxiliary information vector and (in the case of Info-GAIL) the latent code to produce the expected accumulated future reward (Li: pg. 7, section 5.2, etc.); the determination of the observed states includes utilizing generative models aligned in time (temporal sequence) (Liu: section 4, etc.) and calculating a distance (similarity measure) to observed demonstration states (Liu: section 3 describes the distance, sections 5.1-5.2 describe training the model including utilizing the distance, etc.)].

As per claim 10, Li/Liu/Choi teaches wherein the reward is based further on a cost for an action executed by the agent in the reinforcement learning in addition to the similarity [an additional penalty (cost) term may be added to the agent in the reinforcement learning (Li: pg. 5, section 4.1, etc.); the cost function for GPS is then a squared Euclidean distance in state space, this cost function is also weighted by a quadratic ramp function weighting squared Euclidean distances at later time steps higher than initial ones (Liu: section 5.2, etc.)].

As per claim 11, Li/Liu/Choi teaches wherein the reward is defined function of the similarity, the function is a hyperbolic tangent function, a Gaussian function, or a sigmoid function [the posterior approximation network adopts the same architecture as the discriminator except that the output is a softmax over the discrete latent variables, or factored Gaussian over continuous latent variables (Li: pg. 7, section 5.2, etc.)].

As per claim 12, Li/Liu/Choi teaches updating parameters in the reinforcement learning by using the reward estimated [the networks are updated and optimized (Li: pg. 4, section 3.1-3.2, etc.) we use the trajectory-centric RL method used for local policy optimization in guided policy search (GPS) which is based on fitting locally linear dynamics and performing LQR-based update (Liu: section 5.2, etc.)].

As per claim 13, see the rejection of claim 1, above, wherein Li/Liu/Choi also teaches a computer system comprising: a memory storing program instructions; a processing circuitry in communications with the memory for executing the program instructions, wherein the processing circuitry is configured to perform the steps [the system may be implemented in a client-server framework utilizing several available APIs (Li: pg. 6, section 5.1), which inherently requires a memory storing instructions to be executed by a processor of some kind].

As per claim 14, see the rejection of claim 2, above.

As per claim 15, see the rejection of claim 3, above.

As per claim 16, see the rejection of claim 5, above.

As per claim 17, see the rejection of claim 1, above, wherein Li/Liu/Choi also teaches a computer program product for estimating a reward in reinforcement learning, the compute program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform the method [the system may be implemented in a client-server framework utilizing several available APIs (Li: pg. 6, section 5.1), which inherently requires a memory storing instructions to be executed by a computer of some kind].

As per claim 18, see the rejection of claim 2, above.

As per claim 19, see the rejection of claim 3, above.

As per claim 20, see the rejection of claim 5, above.


Response to Arguments
Applicant's arguments filed 12 September 2022 have been fully considered but they are not persuasive.

Applicant argues that the cited art does not teach estimating optimal rewards based on transition probabilities predicted using state trajectories of the expert demonstrations.
However, while Li does teach using image states as inputs to a CNN, it also teaches that the agent imitates the behavior of an expert policy by matching the generated state-action distribution with the expert distribution, minimizing the divergence between them, while the discriminator tries to distinguish state-action pairs from the trajectories (pg. 3, section 2.3) and where for the policy network, input visual features are passed through two convolutional layers, and then combined with the auxiliary information vector and (in the case of Info-GAIL) the latent code to produce the expected accumulated future reward (pg. 7, section 5.2, etc.).

Applicant also argues that Liu does not teach utilizing an observed state by an agent in determining rewards.
However, Liu teaches that the final reward is a weighted combination of a reward based on features and an image tracking reward, where the image tracking reward directly penalizes the policy from producing observations that differ from the translated observations, (section 5.1, etc.).

Applicant also argues that the cited art does not teach the amendment language added to the independent claims.  This is addressed in the update rejections, above, including the newly cited reference to Choi.


Conclusion
The following is a summary of the treatment and status of all claims in the application as recommended by M.P.E.P. 707.07(i): claims 1-20 are rejected.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Mukadam (US 10,739,776) – discloses a system training a network only with state information identified by masking.
Hausman et al. (Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets, Nov 2017, pgs. 1-11) – discloses imitation learning using GANs.
Abbeel et al. (Apprenticeship Learning via Inverse Reinforcement Learning, July 2004, pgs. 1-8) – discloses an apprentice learning system for estimating reward based upon expert demonstrations.
Englert et al. (Model-based Imitation Learning by Probabilistic Trajectory Matching, May 2013, pgs. 1922-1927) – discloses imitation learning by finding policies such that predicted trajectories match observed ones.
Gupta (US 2018/0293721) – discloses utilizing autoencoders in a GAN.
Amer (US 2016/0071024) – discloses temporal generative and discriminative models.
Lett (US 2003/0018457) – discloses updating a model based on the difference between acquired states and predicted states.
Levine et al. (End-to-End Training of Deep Visuomotor Policies, Jan 2015, pgs. 1-40) – discloses using supervised learning for a policy search.
Nguyen et al. (Inverse reinforcement learning with locally consistent reward functions, Dec 2015, pgs. 1-9) – discloses generating expert trajectories with multiple locally consistent reward functions.

The examiner requests, in response to this Office action, that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line number(s) in the specification and/or drawing figure(s). This will assist the examiner in prosecuting the application.

When responding to this office action, Applicant is advised to clearly point out the patentable novelty which he or she thinks the claims present, in view of the state of the art disclosed by the references cited or the objections made. He or she must also show how the amendments avoid such references or objections.  See 37 CFR 1.111(c).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GEORGE GIROUX whose telephone number is (571)272-9769. The examiner can normally be reached M-F 10am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on 571-272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GEORGE GIROUX/Primary Examiner, Art Unit 2128