DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This action is responsive to the original application filed on 6/19/2018 and the Remarks and Amendments filed on 8/18/2021.  	


Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



Claims 1-5, 7-10, 12-16, 18, and 19 are rejected under 35 U.S.C. § 103 as being obvious over Torabi et al. (Torabi et al., “Behavioral Cloning from Observation”, May 11, 2018, Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 1-8, hereinafter “Torabi”) in view of Kimura et al. (Kimura et al., “Reward Estimation via State Prediction”, Feb. 15, 2018, ICLR 2018 Conference Blind Submission, pp. 1-14, hereinafter “Kimura”)1 and Eleftheriadis et al. (US 20200218999 A1, hereinafter “Eleftheriadis”).

Regarding claim 1, Torabi discloses [a] computer-implemented method for learning an action policy, comprising: (Page 2, Column 1; “In this paper, we propose a new imitation learning algorithm called behavioral cloning from observation (BCO). BCO simultaneously addresses both of the issues discussed above, i.e., it provides reasonable imitation policies almost immediately upon observing state-trajectory-only demonstrations . . . Then, upon observation of a demonstration without action information, BCO uses the learned model to infer the missing actions. Finally, BCO uses the demonstration and the inferred actions to find a policy via behavioral cloning”, which discloses a computer-implemented method for learning an action policy using 
behavioral cloning. Note that the method is inherently performed on a computer in the Torabi reference as demonstrated in the experimental results section on page 5 of the reference)
obtaining, by a processor  environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state; (Page 3, Column 1; “We consider agents acting within the broad framework of Markov decision processes (MDPs). We denote a MDP using the 5-tuple M = {S, A, T, r, γ}, where S is the agent’s state space, A is its action space, T sia si+1 = P(si+1|si , a) is a function denoting the probability of the agent transitioning from state si to si+1 after taking action a, r : S × A → R is a function specifying the immediate reward that the agent receives for taking a specific action in a given state, and γ is a discount factor . . . We denote the set of state transitions experienced by an agent during a particular execution of a policy π by Tπ = {(si , si+1)}.”, which discloses the tuple of state “S”, action “A”, and a next state “s+1”; and Page 3, Column 1; “The learning problem is for an agent to determine an imitation policy, π : S → A that the agent may use in order to behave like the expert, using a provided set of expert demonstrations {ξ1, ξ2, ...} in which each ξ is a demonstrated stateaction trajectory {(s0, a0),(s1, a1), ...,(sN , aN )}.”, the expert demonstrations {ξ1, ξ2, ...} including expert states {(s0, a0),(s1, a1), ...,(sN , aN )}.  Note again that the computer processor is inherent to Torabi as discussed above)
training, by the processor using the environment dynamics as training data, a dynamics model which obtains a pair of the state and the action as an input and outputs, for each next state, state-transition probabilities; and (Page 4,  the algorithm discloses, under a broadest reasonable interpretation of the claim language, training or learning (in step 9) by the inherent disclosed processor using the environment dynamics as training data, a dynamics models which obtains a pair of state and action as inputs(steps 2, 3, and 6) and outputs, for each state, state=transition probabilities (represented as T in steps 7, 9, and 10, but most specific to step 10); and Page 4, Column 1; “In order to use the learned agent-specific inverse dynamics model, we first extract the agent-specific part of the demonstrated state sequences and then form the set of demonstrated agent specific state transitions T a demo (Algorithm 1, Line 10); and Page 3, Column 1; “T sia si+1 = P(si+1|si , a) is a function denoting the probability of the agent transitioning from state si to si+1 after taking action a”; and Page 5, §5.1)
learning, by the processor performing behavioral cloning, the action policy using trajectories of expert states according to a supervised learning technique (Page 4, Algorithm 1;  the algorithm discloses, under a broadest reasonable interpretation of the claim language, learning, through behavioral cloning, the action policy πφ using trajectories of expert states according to supervised learning or expert demonstrations; and Page 3, Column 1; “Imitation learning is typically defined in the context of a MDP without an explicitly-defined reward function, i.e., M \ r. The learning problem is for an agent to determine an imitation policy, π : S → A that the agent may use in order to behave like the expert, using a provided set of expert demonstrations {ξ1, ξ2, ...} in which each ξ is a demonstrated stateaction trajectory {(s0, a0),(s1, a1), ...,(sN , aN )}. Therefore, in this setting, the agent must have access to the demonstrator’s actions”; and Page 3, Column 2; “learning an imitation policy from a set of demonstration trajectories”; and Page 5, Column 1; “Using a nonzero α, the model is able to leverage post-demonstration environment interaction in order to more accurately estimate the actions taken by the demonstrator, and therefore improve its learned imitation policy”, the using of the demonstrator being the supervised learning technique)
wherein parameters of the dynamics model are fixed in the learning of the action policy (Page 5, Column 1; “Using a nonzero α, the model is able to leverage post-demonstration environment interaction in order to more accurately estimate the actions taken by the demonstrator, and therefore improve its learned imitation policy”, the value α being, under a BRI, the fixed parameter of the dynamics model that is used in learning the action policy or learned imitation policy).
Torabi fails to explicitly disclose performing model-free inverse reinforcement learning; back-propagating error gradients through the trained dynamics model.
Kimura discloses obtaining, by a processor performing model-free inverse reinforcement learning, environment dynamics (Abstract; “The reward signal is computed by a function of the difference between the actual next state acquired by the agent and the predicted next state given by the learned generative or predictive model. With this inferred reward function, we perform standard reinforcement learning in the inner loop to guide the agent to learn the given task”, the inferred reward function being a product of model-free inverse reinforcement learning, and it uses rewards estimated by using a predictor to sample the environment dynamics; and Page 1, Introduction , ¶1; “Typically, a scalar reward signal is used to guide the agent’s behavior and the agent learns a control policy that maximizes the cumulative reward over a trajectory, based on observations. This type of learning is referred to as ”model-free” RL since the agent does not know apriori or learn the dynamics of the environment.”, which described that model-free learning is performed since the agent does not know apriori or learn the dynamics of the environment, and a reward function is not known; and Page 2, ¶1; “As such, in this work, we propose a reward estimation method that can estimate the underlying reward based only on the expert demonstrations of state trajectories for accomplishing a given task. The estimated reward function can be used in RL algorithms in order to learn a suitable policy for the task”, the estimating of the reward function being accomplished using inverse reinforcement learning; and Page 2, ¶5; “An alternate approach is Inverse Reinforcement Learning (IRL) proposed in the seminal work by Ng & Russell (2000). In this work, the authors try to recover the optimal reward function as a best description behind the given expert demonstrations from humans or other expert agents”, which clarifies that inverse reinforcement learning is the inferencing or estimation of a reward given expert demonstrations, which is what the paper is making use of).
Torabi and Kimura are analogous because both are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the model-free inverse reinforcement learning of Kimura with the computer-implemented method for learning an action policy of Torabi to yield the predictable result of obtaining, by a processor performing model-free inverse reinforcement learning, environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state. The motivation for doing so would be to guide an agent to mimic expert behavior (Kimura; Abstract).
back-propagating error gradients through the trained dynamics model [0055]; “Policy learner 419 receives experience data from experience buffer 425 and implements, at S513, a reinforcement learning algorithm. The specific choice of reinforcement learning algorithms implemented by policy learner 419 is selected by a user and may be chosen depending on the nature of a specific reinforcement learning problem. In a specific example, policy learner 419 implements a temporal-difference learning algorithm, and uses supervised-learning function approximation to frame the reinforcement learning problem as a supervised learning problem, in which each backup plays the role of a training example. Supervised-learning function approximation allows a range of well-known gradient descent methods to be utilised by a learner in order to learn approximate value functions [circumflex over (v)](s, w) or [circumflex over (q)](s, a, w). The policy learner 419 may use the backpropagation algorithm for DNNs, in which case the vector of weights w for each DNN is a vector of connection weights in the DNN” (emphasis added), which discloses learning an action policy using trajectories of states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model; and Figure 8; the figure discloses the processor).
Torabi, Kimura, and Eleftheriadis are analogous art because all are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the backpropagation of Eleftheriadis with the computer-implemented method for learning an action policy of Torabi and Kimura to yield the predictable result of back-propagating error gradients through the trained dynamics model. The motivation for doing so would 

Regarding claim 2, the rejection of claim 1 is incorporated and Torabi further discloses learning a predictor which predicts a next state using the trajectories of the expert states (Page 3, Column 1; “We consider agents acting within the broad framework of Markov decision processes (MDPs). We denote a MDP using the 5-tuple M = {S, A, T, r, γ}, where S is the agent’s state space, A is its action space, T sia si+1 = P(si+1|si , a) is a function denoting the probability of the agent transitioning from state si to si+1 after taking action a, r : S × A → R is a function specifying the immediate reward that the agent receives for taking a specific action in a given state, and γ is a discount factor”, the predictor being T; and Page 4, Column 1; “Our overarching problem is that of finding a good imitation policy from a set of state-only demonstration trajectories, Ddemo = {ζ1, ζ2, . . .} where each ζ is a trajectory {s0, s1, . . . , sN }”; and Page 2, Column 1; “Then, upon observation of a demonstration without action information, BCO uses the learned model to infer the missing actions. Finally, BCO uses the demonstration and the inferred actions to find a policy via behavioral cloning. If post-demonstration environment interaction is allowed”, the behavioral cloning predicts the next state using expert demonstrations of states).
Torabi fails to explicitly disclose performing the model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics.
performing the model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics (Abstract; “The reward signal is computed by a function of the difference between the actual next state acquired by the agent and the predicted next state given by the learned generative or predictive model. With this inferred reward function, we perform standard reinforcement learning in the inner loop to guide the agent to learn the given task”, the inferred reward function being a product of model-free inverse reinforcement learning, and it uses rewards estimated by using a predictor to sample the environment dynamics; and Page 1, Introduction , ¶1; “Typically, a scalar reward signal is used to guide the agent’s behavior and the agent learns a control policy that maximizes the cumulative reward over a trajectory, based on observations. This type of learning is referred to as ”model-free” RL since the agent does not know apriori or learn the dynamics of the environment.”, which described that model-free learning is performed since the agent does not know apriori or learn the dynamics of the environment, and a reward function is not known; and Page 2, ¶1; “As such, in this work, we propose a reward estimation method that can estimate the underlying reward based only on the expert demonstrations of state trajectories for accomplishing a given task. The estimated reward function can be used in RL algorithms in order to learn a suitable policy for the task”, the estimating of the reward function being accomplished using inverse reinforcement learning; and Page 2, ¶5; “An alternate approach is Inverse Reinforcement Learning (IRL) proposed in the seminal work by Ng & Russell (2000). In this work, the authors try to recover the optimal reward function as a best description behind the given expert demonstrations from humans or other expert agents”, which clarifies that inverse reinforcement learning is the inferencing or estimation of a reward given expert demonstrations, which is what the paper is making use of).
Torabi and Kimura are analogous art because all are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the model-free inverse reinforcement learning of Kimura with the learning of a predictor to predict next states and method of Torabi to yield the predictable result of wherein said obtaining step comprises: learning a predictor which predicts a next state using the trajectories of the expert states; and performing model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics. The motivation for doing so would be to guide an agent to mimic expert behavior (Kimura; Abstract).

Regarding claim 3, the rejection of claims 1 and 2 are incorporated but Torabi fails to explicitly disclose wherein the model-free inverse reinforcement learning is performed during an exploration stage of the method.
Kimura discloses wherein the model-free inverse reinforcement learning is performed during an exploration stage of the method (Page 3, Last paragraph; “This method constrains exploration to the states that have been demonstrated by an expert and enables learning a policy that closely matches the expert”, which discloses, under a broadest reasonable interpretation of the claim language, performing the model-free reinforcement learning, as taught in section 1 of Kimura, during an exploration stage where the method constrains exploration to the states that have been demonstrated by an expert; and Page 13, §6.1; “The exploration policy is Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein (1930)) (θ = 0.15, µ = 0, σ = 0.01), size of reply memory is 1M, and optimizer is Adam (Kingma & Ba (2014))”, further disclosing the exploration stage).
Torabi and Kimura are analogous art because all are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the exploration stage of Kimura with the method of Torabi to yield the predictable result of wherein the model-free inverse reinforcement learning is performed during an exploration stage of the method. The motivation for doing so would be to enable learning a policy that closely matches an expert (Kimura; Page 3, Last paragraph).

Regarding claim 4, the rejection of claims 1 and 2 are incorporated but Torabi fails to explicitly disclose wherein the predictor is learned using a machine learning mechanism selected from the group consisting of a Long Short-Term Memory (LSTM) and a Dynamic Boltzmann Machine (DyBM).
Kimura discloses wherein the predictor is learned using a machine learning mechanism selected from the group consisting of a Long Short-Term Memory (LSTM) and a Dynamic Boltzmann Machine (DyBM) (Page 9, §4.3; “Hence, LSTM is trained for predicting absolute position of bird location given images”, which discloses learning or training a predictor using a LSTM machine learning mechanism; and Page 11, Conclusion; “temporal sequence prediction using LSTM”; and Page 8, Last paragraph; “The LSTM based prediction method learns to reach the target faster than the dense reward, while LSTM (s 0 ) has the best overall performance by learning with human-guided demonstration data).
Torabi and Kimura are analogous art because ALL are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the LSTM of Kimura with the method of Torabi to yield the predictable result of wherein the predictor is learned using a machine learning mechanism selected from the group consisting of a Long Short-Term Memory (LSTM) and a Dynamic Boltzmann Machine (DyBM). The motivation for doing so would be to estimate a reward via state prediction by using state-only trajectories of the expert (Kimura; Conclusion).

Regarding claim 5, the rejection of claims 1, 2, and 4 are incorporated and Torabi further discloses wherein the machine learning mechanism comprises a plurality of machine learning mechanisms that, in turn, form a time-series predictive model for predicting the next state using the trajectories of the expert states (Page 4, Algorithm 1; the algorithm discloses the use of ML mechanism to for the time series predictive model for predicting next states from demonstrations of expert states).
In the alternative, Kimura further discloses wherein the machine learning mechanism comprises a plurality of machine learning mechanisms that, in turn, form a time-series predictive model for predicting the next state using the trajectories of the expert states (Page 3, §3.2; “As such, the next approach we take is to consider a temporal sequence prediction model that can be trained to predict the next state value given current state, based on the expert trajectories”, the time-series predictive model being the temporal sequence prediction model; and Page 4, §3.2.2).
The motivation to combine Torabi and Kimura is the same as discussed above with respect to claim 4.

Regarding claim 7, the rejection of claim 1 is incorporated and Torabi further discloses wherein said training step uses closed-loop training to train the dynamics model (Page 4, Algorithm 1; the algorithm disclose performing the training (line 9 of the algorithm) repeatedly or in a closed loop fashion until a task is learned (line 14)).

Regarding claim 8, the rejection of claim 1 is incorporated but Torabi fails to explicitly disclose wherein said obtaining step is performed during a model-free exploration stage of the method.
Kimura discloses wherein said obtaining step is performed during a model-free exploration stage of the method (Page 3, Last paragraph; “This method constrains exploration to the states that have been demonstrated by an expert and enables learning a policy that closely matches the expert”, which discloses, under a broadest reasonable interpretation of the claim language, performing the obtaining step, or model-free reinforcement learning as taught in section 1 of Kimura, during a model-free exploration stage where the method constrains exploration to the states that have been demonstrated by an expert; and Page 13, §6.1; “The exploration policy is Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein (1930)) (θ = 0.15, µ = 0, σ = 0.01), size of reply memory is 1M, and optimizer is Adam (Kingma & Ba (2014))”, further disclosing the model-free exploration stage).
The motivation to combine Torabi and Kimura is the same as discussed above with respect to claim 3.


Regarding claim 9, the rejection of claim 1 is incorporated and Torabi further discloses wherein the error gradients comprise policy gradients with respect to a corresponding action to the policy gradients and in an absence of an expert action corresponding to the policy gradients (Page 4, Algorithm 1; the algorithm discloses, under a broadest reasonable interpretation of the claim language, the computation of policy gradients with respect to a corresponding action in absence of an expert action corresponding to the policy gradients).

Regarding claim 10, the rejection of claim 1 is incorporated but Torabi fails to explicitly disclose performing one obstacle avoidance using the trained dynamics model.
Kimura discloses performing one obstacle avoidance using the trained dynamics model (Page 8, §4.2; the section discloses performing obstacle avoidance using the trained dynamics model. “The agent’s goal is to reach the target while avoiding the obstacle in this case”; and Figure 4).
Torabi and Kimura are analogous art because all are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would 


Regarding claim 12, the rejection of claim 1 is incorporated and Torabi further discloses wherein said learning step is performed in an absence of expert actions corresponding to the expert states (Page 3, Column 2; “In this context, we are concerned here with the following specific goal: given a set of state-only demonstration trajectories, D, find a good imitation policy using a minimal number of post-demonstration environment interactions, i.e., |Ipost|.”, which discloses learning in absence of expert actions or demonstrations, using only expert or demonstration states).

Regarding claim 13, the rejection of claim 1 is incorporated and Torabi further discloses wherein the pair of the state and the action is obtained as the input to the dynamics model from a model-based policy map (Page 3, Column 1; “We consider agents acting within the broad framework of Markov decision processes (MDPs). We denote a MDP using the 5-tuple M = {S, A, T, r, γ}, where S is the agent’s state space, A is its action space, T sia si+1 = P(si+1|si , a) is a function denoting the probability of the agent transitioning from state si to si+1 after taking action a, r : S × A → R is a function specifying the immediate reward that the agent receives for taking a specific action in a given state, and γ is a discount factor”, which discloses the state and action pair as inputs; and Page 4, Algorithm 1; the algorithm receives the random policy which includes a state and action pair as inputs to the dynamics model).

Regarding claim 14, the rejection of claim 1 is incorporated and Torabi further discloses controlling a hardware object to perform an action involving movement responsive to the learned action policy (Page 5, §5, “Implementation and Experimental Results”; the section discloses controlling a hardware object to perform an action such as “keep the pole vertically upward as long as possible” or “to have the car reach the target point”).

Regarding claim 15, it is a computer program product claim corresponding to the steps of claim 1 and is reject for the same reasons as claim 1.

Regarding claim 16, the rejection of claim 15 is incorporated and Torabi further discloses learning a predictor which predicts a next state using the trajectories of the expert states (Page 4, Algorithm 1).
Torabi fails to explicitly disclose performing model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics.
Kimura discloses performing model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics (Abstract; “The reward signal is computed by a function of the difference between the actual next state acquired by the agent and the predicted next state given by the learned generative or predictive model. With this inferred reward function, we perform standard reinforcement learning in the inner loop to guide the agent to learn the given task”, the inferred reward function being a product of model-free inverse reinforcement learning, and it uses rewards estimated by using a predictor to sample the environment dynamics; and Page 1, Introduction , ¶1; “Typically, a scalar reward signal is used to guide the agent’s behavior and the agent learns a control policy that maximizes the cumulative reward over a trajectory, based on observations. This type of learning is referred to as ”model-free” RL since the agent does not know apriori or learn the dynamics of the environment.”, which described that model-free learning is performed since the agent does not know apriori or learn the dynamics of the environment, and a reward function is not known; and Page 2, ¶1; “As such, in this work, we propose a reward estimation method that can estimate the underlying reward based only on the expert demonstrations of state trajectories for accomplishing a given task. The estimated reward function can be used in RL algorithms in order to learn a suitable policy for the task”, the estimating of the reward function being accomplished using inverse reinforcement learning; and Page 2, ¶5; “An alternate approach is Inverse Reinforcement Learning (IRL) proposed in the seminal work by Ng & Russell (2000). In this work, the authors try to recover the optimal reward function as a best description behind the given expert demonstrations from humans or other expert agents”, which clarifies that inverse reinforcement learning is the inferencing or estimation of a reward given expert demonstrations, which is what the paper is making use of).


Regarding claim 18, it is a computer processing system claim corresponding to the steps of claim 1 and is reject for the same reasons as claim 1.

Regarding claim 19, the rejection of claim 18 is incorporated and Torabi further discloses learning a predictor which predicts a next state using the trajectories of the expert states (Page 4, Algorithm 1).
Torabi fails to explicitly disclose performing model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics.
Kimura discloses performing model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics (Abstract; “The reward signal is computed by a function of the difference between the actual next state acquired by the agent and the predicted next state given by the learned generative or predictive model. With this inferred reward function, we perform standard reinforcement learning in the inner loop to guide the agent to learn the given task”, the inferred reward function being a product of model-free inverse reinforcement learning, and it uses rewards estimated by using a predictor to sample the environment dynamics; and Page 1, Introduction , ¶1; “Typically, a scalar reward signal is used to guide the agent’s behavior and the agent learns a control policy that maximizes the cumulative reward over a trajectory, based on observations. This type of learning is referred to as ”model-free” RL since the agent does not know apriori or learn the dynamics of the environment.”, which described that model-free learning is performed since the agent does not know apriori or learn the dynamics of the environment, and a reward function is not known; and Page 2, ¶1; “As such, in this work, we propose a reward estimation method that can estimate the underlying reward based only on the expert demonstrations of state trajectories for accomplishing a given task. The estimated reward function can be used in RL algorithms in order to learn a suitable policy for the task”, the estimating of the reward function being accomplished using inverse reinforcement learning; and Page 2, ¶5; “An alternate approach is Inverse Reinforcement Learning (IRL) proposed in the seminal work by Ng & Russell (2000). In this work, the authors try to recover the optimal reward function as a best description behind the given expert demonstrations from humans or other expert agents”, which clarifies that inverse reinforcement learning is the inferencing or estimation of a reward given expert demonstrations, which is what the paper is making use of).
The motivation to combine Torabi and Kimura is the same as discussed above with respect to claim 2.

Claim 11 is rejected under 35 U.S.C. § 103 as being obvious over Torabi in view of Kimura and Eleftheriadis and further in view of Van Seijen et al. (US 20180165603 A1, hereinafter” Van Seijen”).

11, the rejection of claim 1 is incorporated but Torabi fails to explicitly disclose performing transfer learning between at least two agents using the trained dynamic model.
Van Seijen discloses performing transfer learning between at least two agents using the trained dynamic model ([0139]; “The agents were trained in parallel with off -policy learning using Q-learning. An aggregator function summed the Q-values for each action: a A.sub.flat:Q.sup.sum(a, X.sub.t.sup.flat):=.SIGMA..sub.i Q.sup.i (a, X.sub.t.sup.i), and used -greedy action selection with respect to these summed values. The Q-table of both ghost-agents where the same, so benefit was gained from intra-task transfer learning by sharing the Q-table between the two ghost agents, which resulted in the ghost-agents learning twice as fast” (emphasis added), which discloses the transfer learning between two agents using the trained dynamic model; and [0006]; “This approach has at least the following advantages: 1) it allows for specialized agents for different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents”). 
Torabi, Kimura, Eleftheriadis, and Van Seijen are analogous art because all are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the transfer learning of Van Seijen with the method of Torabi, Kimura, and Eleftheriadis to yield the predictable result of performing transfer learning between at least two agents using the trained dynamic model. The motivation for doing so would be to achieve faster learning in two agents (Van Seijen; [0139]).

21, 24, and 25 are rejected under 35 U.S.C. § 103 as being obvious over Torabi in view of Kimura.

Regarding claim 21, Torabi discloses [a] computer-implemented method for learning an action policy, comprising: (Page 2, Column 1; “In this paper, we propose a new imitation learning algorithm called behavioral cloning from observation (BCO). BCO simultaneously addresses both of the issues discussed above, i.e., it provides reasonable imitation policies almost immediately upon observing state-trajectory-only demonstrations . . . Then, upon observation of a demonstration without action information, BCO uses the learned model to infer the missing actions. Finally, BCO uses the demonstration and the inferred actions to find a policy via behavioral cloning”, which discloses a computer-implemented method for learning an action policy using behavioral cloning. Note that the method is inherently performed on a computer in the Torabi reference as demonstrated in the experimental results section on page 5 of the reference)
learning, by a processor, a predictor which predicts a next state using trajectories of expert states (Page 3, Column 1; “We consider agents acting within the broad framework of Markov decision processes (MDPs). We denote a MDP using the 5-tuple M = {S, A, T, r, γ}, where S is the agent’s state space, A is its action space, T sia si+1 = P(si+1|si , a) is a function denoting the probability of the agent transitioning from state si to si+1 after taking action a, r : S × A → R is a function specifying the immediate reward that the agent receives for taking a specific action in a given state, and γ is a discount factor”, the predictor being T; and Page 4, Column 1; “Our overarching problem is that of finding a good imitation policy from a set of state-only demonstration trajectories, Ddemo = {ζ1, ζ2, . . .} where each ζ is a trajectory {s0, s1, . . . , sN }”; and Page 2, Column 1; “Then, upon observation of a demonstration without action information, BCO uses the learned model to infer the missing actions. Finally, BCO uses the demonstration and the inferred actions to find a policy via behavioral cloning. If post-demonstration environment interaction is allowed”, the behavioral cloning predicts the next state using expert demonstrations of states; and Page 3, Column 1; “The learning problem is for an agent to determine an imitation policy, π : S → A that the agent may use in order to behave like the expert, using a provided set of expert demonstrations {ξ1, ξ2, ...} in which each ξ is a demonstrated stateaction trajectory {(s0, a0),(s1, a1), ...,(sN , aN )}.”, the expert demonstrations {ξ1, ξ2, ...} including expert states {(s0, a0),(s1, a1), ...,(sN , aN )}.  Note again that the computer processor is inherent to Torabi as discussed above)
sample environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state; (Page 3, Column 1; “We consider agents acting within the broad framework of Markov decision processes (MDPs). We denote a MDP using the 5-tuple M = {S, A, T, r, γ}, where S is the agent’s state space, A is its action space, T sia si+1 = P(si+1|si , a) is a function denoting the probability of the agent transitioning from state si to si+1 after taking action a, r : S × A → R is a function specifying the immediate reward that the agent receives for taking a specific action in a given state, and γ is a discount factor . . . We denote the set of state transitions experienced by an agent during a particular execution of a policy π by Tπ = {(si , si+1)}.”, which discloses the tuple of state “S”, action “A”, and a next state “s+1”; and Page 3, Column 1; “The learning problem is for an agent to determine an imitation policy, π : S → A that the agent may use in order to behave like the expert, using a provided set of expert demonstrations {ξ1, ξ2, ...} in which each ξ is a demonstrated stateaction trajectory {(s0, a0),(s1, a1), ...,(sN , aN )}.”, the expert demonstrations {ξ1, ξ2, ...} including expert states {(s0, a0),(s1, a1), ...,(sN , aN )}.  Note again that the computer processor is inherent to Torabi as discussed above)
training, by the processor using the sampled environment dynamics as training data, a dynamics model which obtains a pair of the state and the action as an input and outputs, for each next state, state-transition probabilities to provide a trained dynamics model (Page 4, Algorithm 1;  the algorithm discloses, under a broadest reasonable interpretation of the claim language, training or learning (in step 9) by the inherent disclosed processor using the environment dynamics as training data, a dynamics models which obtains a pair of state and action as inputs(steps 2, 3, and 6) and outputs, for each state, state=transition probabilities (represented as T in steps 7, 9, and 10, but most specific to step 10); and Page 4, Column 1; “In order to use the learned agent-specific inverse dynamics model, we first extract the agent-specific part of the demonstrated state sequences and then form the set of demonstrated agent specific state transitions T a demo (Algorithm 1, Line 10); and Page 3, Column 1; “T sia si+1 = P(si+1|si , a) is a function denoting the probability of the agent transitioning from state si to si+1 after taking action a”; and Page 5, §5.1)
wherein parameters of the dynamics model are fixed in the learning of the action policy (Page 5, Column 1; “Using a nonzero α, the model is able to leverage post-demonstration environment interaction in order to more accurately estimate the actions taken by the demonstrator, and therefore improve its learned imitation policy”, the value α being, under a BRI, the fixed parameter of the dynamics model that is used in learning the action policy or learned imitation policy).
Torabi fails to explicitly disclose performing, by the processor, model-free inverse reinforcement learning using rewards estimated by using the predictor.
Kimura discloses performing, by the processor, model-free inverse reinforcement learning using rewards estimated by using the predictor (Abstract; “The reward signal is computed by a function of the difference between the actual next state acquired by the agent and the predicted next state given by the learned generative or predictive model. With this inferred reward function, we perform standard reinforcement learning in the inner loop to guide the agent to learn the given task”, the inferred reward function being a product of model-free inverse reinforcement learning, and it uses rewards estimated by using a predictor to sample the environment dynamics; and Page 1, Introduction , ¶1; “Typically, a scalar reward signal is used to guide the agent’s behavior and the agent learns a control policy that maximizes the cumulative reward over a trajectory, based on observations. This type of learning is referred to as ”model-free” RL since the agent does not know apriori or learn the dynamics of the environment.”, which described that model-free learning is performed since the agent does not know apriori or learn the dynamics of the environment, and a reward function is not known; and Page 2, ¶1; “As such, in this work, we propose a reward estimation method that can estimate the underlying reward based only on the expert demonstrations of state trajectories for accomplishing a given task. The estimated reward function can be used in RL algorithms in order to learn a suitable policy for the task”, the estimating of the reward function being accomplished using inverse reinforcement learning; and Page 2, ¶5; “An alternate approach is Inverse Reinforcement Learning (IRL) proposed in the seminal work by Ng & Russell (2000). In this work, the authors try to recover the optimal reward function as a best description behind the given expert demonstrations from humans or other expert agents”, which clarifies that inverse reinforcement learning is the inferencing or estimation of a reward given expert demonstrations, which is what the paper is making use of)
Torabi and Kimura are analogous art because both are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the model-free inverse reinforcement learning of Kimura with the sampled environment dynamics and method of Torabi to yield the predictable result of performing, by the processor, model-free inverse reinforcement learning using rewards estimated by using the predictor to sample environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state. The motivation for doing so would be to guide an agent to mimic expert behavior (Kimura; Abstract).

Regarding claim 24, Torabi discloses [a] non-transitory article of manufacture tangibly embodying a computer readable program which when executed causes a computer to perform (the method is inherently implemented on a computer with a processor and memory or non-transitory article of manufacture embodying a program, as suggested by the §IV Experiments section where the method is implemented for different simulated tasks)
the steps of claim 21 (see the rejection of claim 21 above, where both Torabi and Kimura disclose the steps of claim 21).
The motivation to combine Torabi and Kimura is the same as discussed above with respect to claim 21.

Regarding claim 25, it is a computer processing system corresponding to the steps of claim 21 and is rejected for the same reasons as claim 21.

Claim 22 is rejected under 35 U.S.C. § 103 as being obvious over Torabi in view of Kimura and Eleftheriadis.

Regarding claim 22, the rejection of claim 21 is incorporated and Torabi further discloses the expert states (Page 3, Column 1; “The learning problem is for an agent to determine an imitation policy, π : S → A that the agent may use in order to behave like the expert, using a provided set of expert demonstrations {ξ1, ξ2, ...} in which each ξ is a demonstrated state action trajectory {(s0, a0),(s1, a1), ...,(sN , aN )}.”, the expert demonstrations {ξ1, ξ2, ...} including expert states {(s0, a0),(s1, a1), ...,(sN , aN )}) and
	the trained dynamics model (Page 4, Algorithm 1;  the algorithm produces the trained dynamics model)
Torabi fails to explicitly disclose learning, by the processor, the action policy using the trajectories of the expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model.
learning, by the processor, the action policy using the trajectories of the expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model ([0055]; “Policy learner 419 receives experience data from experience buffer 425 and implements, at S513, a reinforcement learning algorithm. The specific choice of reinforcement learning algorithms implemented by policy learner 419 is selected by a user and may be chosen depending on the nature of a specific reinforcement learning problem. In a specific example, policy learner 419 implements a temporal-difference learning algorithm, and uses supervised-learning function approximation to frame the reinforcement learning problem as a supervised learning problem, in which each backup plays the role of a training example. Supervised-learning function approximation allows a range of well-known gradient descent methods to be utilised by a learner in order to learn approximate value functions [circumflex over (v)](s, w) or [circumflex over (q)](s, a, w). The policy learner 419 may use the backpropagation algorithm for DNNs, in which case the vector of weights w for each DNN is a vector of connection weights in the DNN” (emphasis added), which discloses learning an action policy using trajectories of states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model; and Figure 8; the figure discloses the processor).
Torabi, Kimura, and Eleftheriadis are analogous art because all are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the supervised learning and backpropagation of Eleftheriadis with the expert states, dynamics model, and method of Torabi and Kimura to yield the predictable result of 

Response to Arguments

Applicant’s arguments and amendments, filed on 8/18/2021, with respect to the 35 USC § 112(b) rejection of claim 4 have been fully considered and are persuasive.  The 35 USC § 112(b) rejection of claim 4 has been withdrawn.

Applicant’s arguments and amendments, filed on 8/18/2021, with respect to the 35 USC § 103 rejection of claims 1-25 have been fully considered but are moot because the arguments do not apply to any of the references being used in the current rejection to reject independent claims 1, 15, 18, 21, 24, and 25.  Torabi, Kimura, and Eleftheriadis are now being used to render claims 1, 15, 18 obvious under 35 USC § 103 and Torabi and Kimura are now being used to render claims 21, 24, and 25 obvious under 35 USC § 103.

Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Brent Hoover whose telephone number is (303)297-4403. The examiner can normally be reached Monday - Friday 9-5 MST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on 571-270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/BRENT JOHNSTON HOOVER/Examiner, Art Unit 2127                                                                                                                                                                                                        
/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127                                                                                                                                                                                                        


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 Note that this reference qualifies as prior art under MPEP §2153.01(a) because the application names
        fewer joint inventors than the Kimura reference used in the rejection (“If, however, the application names fewer joint inventors than a publication (e.g., the application names as joint inventors A and B, and the publication names as authors A, B and C), it would not be readily apparent from the publication that it is by the inventor (i.e., the inventive entity) or a joint inventor and the publication would be treated as prior art under AIA  35 U.S.C. 102(a)(1).  Further note that the Kimura reference is labeled as having an anonymous source of inventorship or authorship, but upon further inspection, it appears that the Kimura reference names “Daiki Kimura, Subhajit Chaudhury, Ryuki Tachibana, and Sakyasingha Dasgupta” as the authors of this publication. See https://openreview.net/forum?id=HktXuGb0-.   Because Sakyasingha Dasgupta is not listed as an inventor in the present application, this Kimura reference therefore qualifies as prior art as discussed above with respect to MPEP §2153.01(a)