DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This action is responsive to the original application filed on 6/19/2018.  	

Information Disclosure Statement

The information disclosure statement submitted on 6/19/2018 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement was considered by the examiner.


Claim Rejections - 35 USC § 112

The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


4 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.

Claim 4 recites the limitation "the group consisting of a Long Short-Term Memory (LSTM) and a Dynamic Boltzmann Machine (DyBM)” (emphasis added).  There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


1, 6, 7, 9, 12-15, 17, 18, and 20 are rejected under 35 U.S.C. § 103 as being obvious over Englert et al. (Englert et al., “Model-based Imitation Learning by Probabilistic Trajectory Matching”, May 10, 2013, 2013 IEEE International Conference on Robotics and Automation (ICRA), pp. 1922-1927, hereinafter “Englert”) in view of Eleftheriadis et al. (US 20200218999 A1, hereinafter “Eleftheriadis”).

Regarding claim 1, Englert discloses [a] computer-implemented method for learning an action policy, comprising: (Abstract; “We present an imitation-learning approach to efficiently learn a task from expert demonstrations. Instead of finding policies indirectly, either via state-action mappings (behavioral cloning), or cost function learning (inverse reinforcement learning), our goal is to find policies directly such that predicted trajectories match observed ones”, which discloses a method for learning an action policy.  Note that the method is inherently implemented on a computer with a processor and memory, as suggested by the §IV Experiments section where the method is implemented for different simulated tasks; and Page 1922, Column 2; “The key idea is to find a robot-specific policy such that the observed expert trajectories and the predicted robot trajectories match”)
obtaining, by a processor, environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state; (Page 1924, §A; “A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1)”, which discloses that the triplet of a state, action, and next state are obtained, and as discussed above, this method is inherently performed by a computer that comprises a processor; and Page 1923, Column 1; “Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u. In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp.”, the expert trajectories (or expert sequences of states) being the expert states)
training, by the processor using the environment dynamics as training data, a dynamics model which obtains a pair of the state and the action as an input and outputs, for each next state, state-transition probabilities; and (Page 1924, §A; “Learning a Probabilistic Forward Model A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1). Such a model can be used to represent the transition dynamics of a robot. We represent the model by a probability distribution over models and implemented as a GP”, which discloses training or learning a dynamics model which obtains a pair of state and action (state xt−1 and action ut−1) as the input and output, for each new state (next state xt = f(xt−1,ut−1)) state-transition probabilities (probability distribution), and as discussed above, this method is inherently performed by a computer that comprises a processor; and Page1925, Column 1; “Based on the PILCO framework, we use the learned GP forward model for iteratively predicting the state distributions p(x1), . . . , p(xT ) for a given policy π and an initial state distribution p(x0) . . . The transition probability p(f(x˜t−1)|x˜t−1) is the GP predictive distribution given in Eqs. (14)–(15)”, which further discloses the state transition probabilities; and Page 1924, Algorithm 1, Line 3; the algorithm discloses pseudocode for the training of a dynamics model which is discussed in further detail in §III.A of the paper; and Page 1924, Column 2; “. As training inputs to the GP we used state-action pairs (xt−1,ut−1)”, which discloses the training data)
Englert fails to explicitly disclose learning, by the processor, the action policy using trajectories of expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model1.
Eleftheriadis discloses learning, by the processor, the action policy using trajectories of expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model ([0055]; “Policy learner 419 receives experience data from experience buffer 425 and implements, at S513, a reinforcement learning algorithm. The specific choice of reinforcement learning algorithms implemented by policy learner 419 is selected by a user and may be chosen depending on the nature of a specific reinforcement learning problem. In a specific example, policy learner 419 implements a temporal-difference learning algorithm, and uses supervised-learning function approximation to frame the reinforcement learning problem as a supervised learning problem, in which each backup plays the role of a training example. Supervised-learning function approximation allows a range of well-known gradient descent methods to be utilised by a learner in order to learn approximate value functions [circumflex over (v)](s, w) or [circumflex over (q)](s, a, w). The policy learner 419 may use the backpropagation algorithm for DNNs, in which case the vector of weights w for each DNN is a vector of connection weights in the DNN” (emphasis added), which discloses learning an action policy using trajectories of states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model; and Figure 8; the figure discloses the processor).
Englert and Eleftheriadis are analogous art because both are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the supervised learning and backpropagation of Eleftheriadis with the expert states, dynamics model, and method of Englert to yield the predictable result of learning, by the processor, the action policy using trajectories of expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model. The motivation for doing so would be to learn approximate value functions for a policy learning system (Eleftheriadis; [0055]).

Regarding claim 6, the rejection of claim 1 is incorporated and Englert further discloses wherein parameters of the dynamics model are fixed in the learning of the action policy (Page 1923, Column 1; “Throughout this paper, we use the following notation: States are denoted by x ∈ R D and actions as u ∈ R E, respectively. Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u”, which discloses, under a broadest reasonable interpretation of the claim language, that the states and actions are fixed for a fixed time horizon in learning the action policy, the learning of the action policy being described in detail on page 1924, section III of the paper).

Regarding claim 7, the rejection of claim 1 is incorporated and Englert further discloses wherein said training step uses closed-loop training to train the dynamics model (Page 1924, Algorithm 1; the algorithm disclose performing the training (line 3 of the algorithm) repeatedly or in a closed loop fashion until a task is learned (line 12)).

Regarding claim 9, the rejection of claim 1 is incorporated and Englert further discloses wherein the error gradients comprise policy gradients with respect to a corresponding action to the policy gradients and in an absence of an expert action corresponding to the policy gradients (Page 1924, Algorithm 1; the algorithm discloses the computation of policy gradients in line 7 of the algorithm; and Page 1924, §III; the section discloses, under a broadest reasonable interpretation of the claim language, the computation of policy gradients with respect to a corresponding action (see eqn. 10) an in absence of an expert action corresponding to the policy gradients).

Regarding claim 12, the rejection of claim 1 is incorporated and Englert further discloses wherein said learning step is performed in an absence of expert actions corresponding to the expert states (Page 1923, Column 1; “In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp. We use probability distributions over trajectories for representing both the demonstrated trajectories and the predicted trajectory”, the trajectories only include a sequence of states (and therefore no expert actions), and the trajectories of the expert are used in the learning step, which is discussed in more detail in sections II, III, and algorithm 1 of the paper).

Regarding claim 13, the rejection of claim 1 is incorporated and Englert further discloses wherein the pair of the state and the action is obtained as the input to the dynamics model from a model-based policy map (Page 1924, Column 2; “A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1). Such a model can be used to represent the transition dynamics of a robot”, which discloses, under a broadest reasonable interpretation of the claim language, a model-based policy map as the inputs are mapped by the forward model; and Page 1923, Column 1; “The policy π maps a state x to a corresponding action u”; and Figures 3, 5, and 6).

Regarding claim 14, the rejection of claim 1 is incorporated and Englert further discloses controlling a hardware object to perform an action involving movement responsive to the learned action policy (Page 1926, §B, “Learning a Ball Hitting Task with the BioRob”; the section discloses controlling a hardware object (BioRob) to perform an action, hitting a ball, involving movement responsive to the learned policy discussed in the previous sections of the paper).

Regarding claim 15, Englert discloses [a] computer program product for learning an action policy, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising (Abstract; “We present an imitation-learning approach to efficiently learn a task from expert demonstrations. Instead of finding policies indirectly, either via state-action mappings (behavioral cloning), or cost function learning (inverse reinforcement learning), our goal is to find policies directly such that predicted trajectories match observed ones”, which discloses a method for learning an action policy.  Note that the method is inherently implemented on a computer with a processor and memory that contains a computer program product, as suggested by the §IV Experiments section where the method is implemented for different simulated tasks; and Page 1922, Column 2; “The key idea is to find a robot-specific policy such that the observed expert trajectories and the predicted robot trajectories match”)
obtaining, by a processor, environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state; (Page 1924, §A; “A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1)”, which discloses that the triplet of a state, action, and next state are obtained, and as discussed above, this method is inherently performed by a computer that comprises a processor; and Page 1923, Column 1; “Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u. In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp.”, the expert trajectories (or expert sequences of states) being the expert states)
training, by the processor using the environment dynamics as training data, a dynamics model which obtains a pair of the state and the action as an input and outputs, for each next state, state-transition probabilities; and (Page 1924, §A; “Learning a Probabilistic Forward Model A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1). Such a model can be used to represent the transition dynamics of a robot. We represent the model by a probability distribution over models and implemented as a GP”, which discloses training or learning a dynamics model which obtains a pair of state and action (state xt−1 and action ut−1) as the input and output, for each new state (next state xt = f(xt−1,ut−1)) state-transition probabilities (probability distribution), and as discussed above, this method is inherently performed by a computer that comprises a processor; and Page1925, Column 1; “Based on the PILCO framework, we use the learned GP forward model for iteratively predicting the state distributions p(x1), . . . , p(xT ) for a given policy π and an initial state distribution p(x0) . . . The transition probability p(f(x˜t−1)|x˜t−1) is the GP predictive distribution given in Eqs. (14)–(15)”, which further discloses the state transition probabilities; and Page 1924, Algorithm 1, Line 3; the algorithm discloses pseudocode for the training of a dynamics model which is discussed in further detail in §III.A of the paper; and Page 1924, Column 2; “. As training inputs to the GP we used state-action pairs (xt−1,ut−1)”, which discloses the training data)
Englert fails to explicitly disclose learning, by the processor, the action policy using trajectories of expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model2.
Eleftheriadis discloses learning, by the processor, the action policy using trajectories of expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model ([0055]; “Policy learner 419 receives experience data from experience buffer 425 and implements, at S513, a reinforcement learning algorithm. The specific choice of reinforcement learning algorithms implemented by policy learner 419 is selected by a user and may be chosen depending on the nature of a specific reinforcement learning problem. In a specific example, policy learner 419 implements a temporal-difference learning algorithm, and uses supervised-learning function approximation to frame the reinforcement learning problem as a supervised learning problem, in which each backup plays the role of a training example. Supervised-learning function approximation allows a range of well-known gradient descent methods to be utilised by a learner in order to learn approximate value functions [circumflex over (v)](s, w) or [circumflex over (q)](s, a, w). The policy learner 419 may use the backpropagation algorithm for DNNs, in which case the vector of weights w for each DNN is a vector of connection weights in the DNN” (emphasis added), which discloses learning an action policy using trajectories of states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model; and Figure 8; the figure discloses the processor).
The motivation to combine Englert and Eleftheriadis is the same as discussed above with respect to claim 1.

Regarding claim 17, the rejection of claim 15 is incorporated and Englert further discloses wherein parameters of the dynamics model are fixed in the learning of the action policy (Page 1923, Column 1; “Throughout this paper, we use the following notation: States are denoted by x ∈ R D and actions as u ∈ R E, respectively. Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u”, which discloses, under a broadest reasonable interpretation of the claim language, that the states and actions are fixed for a fixed time horizon in learning the action policy, the learning of the action policy being described in detail on page 1924, section III of the paper).

Regarding claim 18, Englert discloses [a] computer processing system for learning an action policy, comprising: a memory for storing program code; and a processor, operatively coupled to the memory, for running the program code to (Abstract; “We present an imitation-learning approach to efficiently learn a task from expert demonstrations. Instead of finding policies indirectly, either via state-action mappings (behavioral cloning), or cost function learning (inverse reinforcement learning), our goal is to find policies directly such that predicted trajectories match observed ones”, which discloses a method for learning an action policy.  Note that the method, implemented through a system, is inherently implemented on a computer with a processor and memory that contains code, as suggested by the §IV Experiments section where the method is implemented for different simulated tasks; and Page 1922, Column 2; “The key idea is to find a robot-specific policy such that the observed expert trajectories and the predicted robot trajectories match”)
obtain environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state (Page 1924, §A; “A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1)”, which discloses that the triplet of a state, action, and next state are obtained, and as discussed above, this method is inherently performed by a computer that comprises a processor; and Page 1923, Column 1; “Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u. In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp.”, the expert trajectories (or expert sequences of states) being the expert states)
train, using the environment dynamics as training data, a dynamics model which obtains a pair of the state and the action as an input and outputs, for each next state, state- transition probabilities; and (Page 1924, §A; “Learning a Probabilistic Forward Model A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1). Such a model can be used to represent the transition dynamics of a robot. We represent the model by a probability distribution over models and implemented as a GP”, which discloses training or learning a dynamics model which obtains a pair of state and action (state xt−1 and action ut−1) as the input and output, for each new state (next state xt = f(xt−1,ut−1)) state-transition probabilities (probability distribution); and Page1925, Column 1; “Based on the PILCO framework, we use the learned GP forward model for iteratively predicting the state distributions p(x1), . . . , p(xT ) for a given policy π and an initial state distribution p(x0) . . . The transition probability p(f(x˜t−1)|x˜t−1) is the GP predictive distribution given in Eqs. (14)–(15)”, which further discloses the state transition probabilities; and Page 1924, Algorithm 1, Line 3; the algorithm discloses pseudocode for the training of a dynamics model which is discussed in further detail in §III.A of the paper; and Page 1924, Column 2; “. As training inputs to the GP we used state-action pairs (xt−1,ut−1)”, which discloses the training data).
Englert fails to explicitly disclose learn the action policy using trajectories of expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model3.
Eleftheriadis discloses learn the action policy using trajectories of expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model ([0055]; “Policy learner 419 receives experience data from experience buffer 425 and implements, at S513, a reinforcement learning algorithm. The specific choice of reinforcement learning algorithms implemented by policy learner 419 is selected by a user and may be chosen depending on the nature of a specific reinforcement learning problem. In a specific example, policy learner 419 implements a temporal-difference learning algorithm, and uses supervised-learning function approximation to frame the reinforcement learning problem as a supervised learning problem, in which each backup plays the role of a training example. Supervised-learning function approximation allows a range of well-known gradient descent methods to be utilised by a learner in order to learn approximate value functions [circumflex over (v)](s, w) or [circumflex over (q)](s, a, w). The policy learner 419 may use the backpropagation algorithm for DNNs, in which case the vector of weights w for each DNN is a vector of connection weights in the DNN” (emphasis added), which discloses learning an action policy using trajectories of states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model).
The motivation to combine Englert and Eleftheriadis is the same as discussed above with respect to claim 1.

Regarding claim 20, the rejection of claim 18 is incorporated and Englert further discloses wherein parameters of the dynamics model are fixed in the learning of the action policy (Page 1923, Column 1; “Throughout this paper, we use the following notation: States are denoted by x ∈ R D and actions as u ∈ R E, respectively. Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u”, which discloses, under a broadest reasonable interpretation of the claim language, that the states and actions are fixed for a fixed time horizon in learning the action policy, the learning of the action policy being described in detail on page 1924, section III of the paper).
Claims 2-5, 8, 10, 16, and 19 are rejected under 35 U.S.C. § 103 as being obvious over Englert in view of Eleftheriadis and further in view of Kimura et al. (Kimura et al., “Reward Estimation via State Prediction”, Feb. 15, 2018, ICLR 2018 Conference Blind Submission, pp. 1-14, hereinafter “Kimura”)4.

Regarding claim 2, the rejection of claim 1 is incorporated and Englert further discloses learning a predictor which predicts a next state using the trajectories of the expert states (Page 1922, Column 2; “The key idea is to find a robot-specific policy such that the observed expert trajectories and the predicted robot trajectories match”, which discloses learning a predictor for predicted robot trajectories; and Page 1923, Column 1; “Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u. In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp.”, the expert trajectories (or expert sequences of states) being the expert states are used to predict a next state in a trajectory)
Englert fails to explicitly disclose performing model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics.
Kimura discloses performing model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics (Abstract; “The reward signal is computed by a function of the difference between the actual next state acquired by the agent and the predicted next state given by the learned generative or predictive model. With this inferred reward function, we perform standard reinforcement learning in the inner loop to guide the agent to learn the given task”, the inferred reward function being a product of model-free inverse reinforcement learning, and it uses rewards estimated by using a predictor to sample the environment dynamics; and Page 1, Introduction , ¶1; “Typically, a scalar reward signal is used to guide the agent’s behavior and the agent learns a control policy that maximizes the cumulative reward over a trajectory, based on observations. This type of learning is referred to as ”model-free” RL since the agent does not know apriori or learn the dynamics of the environment.”, which described that model-free learning is performed since the agent does not know apriori or learn the dynamics of the environment, and a reward function is not known; and Page 2, ¶1; “As such, in this work, we propose a reward estimation method that can estimate the underlying reward based only on the expert demonstrations of state trajectories for accomplishing a given task. The estimated reward function can be used in RL algorithms in order to learn a suitable policy for the task”, the estimating of the reward function being accomplished using inverse reinforcement learning; and Page 2, ¶5; “An alternate approach is Inverse Reinforcement Learning (IRL) proposed in the seminal work by Ng & Russell (2000). In this work, the authors try to recover the optimal reward function as a best description behind the given expert demonstrations from humans or other expert agents”, which clarifies that inverse reinforcement learning is the inferencing or estimation of a reward given expert demonstrations, which is what the paper is making use of).
Englert, Eleftheriadis, and Kimura are analogous art because all are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the model-free inverse reinforcement learning of Kimura with the learning of a predictor to predict next states and method of Englert and Eleftheriadis to yield the predictable result of wherein said obtaining step comprises: learning a predictor which predicts a next state using the trajectories of the expert states; and performing model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics. The motivation for doing so would be to guide an agent to mimic expert behavior (Kimura; Abstract).

Regarding claim 3, the rejection of claims 1 and 2 are incorporated but Englert fails to explicitly disclose wherein the model-free inverse reinforcement learning is performed during an exploration stage of the method.
Kimura discloses wherein the model-free inverse reinforcement learning is performed during an exploration stage of the method (Page 3, Last paragraph; “This method constrains exploration to the states that have been demonstrated by an expert and enables learning a policy that closely matches the expert”, which discloses, under a broadest reasonable interpretation of the claim language, performing the model-free reinforcement learning, as taught in section 1 of Kimura, during an exploration stage where the method constrains exploration to the states that have been demonstrated by an expert; and Page 13, §6.1; “The exploration policy is Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein (1930)) (θ = 0.15, µ = 0, σ = 0.01), size of reply memory is 1M, and optimizer is Adam (Kingma & Ba (2014))”, further disclosing the exploration stage).
Englert, Eleftheriadis, and Kimura are analogous art because all are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the exploration stage of Kimura with the method of Englert and Eleftheriadis to yield the predictable result of wherein the model-free inverse reinforcement learning is performed during an exploration stage of the method. The motivation for doing so would be to enable learning a policy that closely matches an expert (Kimura; Page 3, Last paragraph).

Regarding claim 4, the rejection of claims 1 and 2 are incorporated but Englert fails to explicitly disclose wherein the predictor is learned using a machine learning mechanism selected from the group consisting of a Long Short-Term Memory (LSTM) and a Dynamic Boltzmann Machine (DyBM).
Kimura discloses wherein the predictor is learned using a machine learning mechanism selected from the group consisting of a Long Short-Term Memory (LSTM) and a Dynamic Boltzmann Machine (DyBM) (Page 9, §4.3; “Hence, LSTM is trained for predicting absolute position of bird location given images”, which discloses learning or training a predictor using a LSTM machine learning mechanism; and Page 11, Conclusion; “temporal sequence prediction using LSTM”; and Page 8, Last paragraph; “The LSTM based prediction method learns to reach the target faster than the dense reward, while LSTM (s 0 ) has the best overall performance by learning with human-guided demonstration data).
Englert, Eleftheriadis, and Kimura are analogous art because ALL are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the LSTM of Kimura with the method of Englert and Eleftheriadis to yield the predictable result of wherein the predictor is learned using a machine learning mechanism selected from the group consisting of a Long Short-Term Memory (LSTM) and a Dynamic Boltzmann Machine (DyBM). The motivation for doing so would be to estimate a reward via state prediction by using state-only trajectories of the expert (Kimura; Conclusion).

Regarding claim 5, the rejection of claims 1, 2, and 4 are incorporated and Englert further discloses wherein the machine learning mechanism comprises a plurality of machine learning mechanisms that, in turn, form a time-series predictive model for predicting the next state using the trajectories of the expert states (Page 1923, Column 1; “Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u. In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp.”, the expert trajectories (or expert sequences of states) being the expert states are used to predict a next state in a trajectory in a time-series using a time-series predictive model; and Abstract; “In this paper, we propose to learn probabilistic forward models to compute a probability distribution over trajectories”; and Page 1924, Column 1; “Since the KL divergence between trajectory distributions in Eq. (7) corresponds to a RL long-term cost function, see Eq. (9), we can apply RL algorithms to find optimal policies. In principle, any algorithm that can approximate trajectories is suitable. For instance, model-free methods based on trajectory sampling [27], [20] or model-based RL algorithms that learn forward models of the robot, and, subsequently, use them for predictions [12], [5], [23], [11], are suitable”, the RL algorithm being, under a broadest reasonable interpretation of the claim language, the plurality of time-series based machine learning mechanisms; see also §III for a further discussion of the time-series predictive model which is a plurality of machine learning mechanisms applied in a time-series based fashion).
In the alternative, Kimura further discloses wherein the machine learning mechanism comprises a plurality of machine learning mechanisms that, in turn, form a time-series predictive model for predicting the next state using the trajectories of the expert states (Page 3, §3.2; “As such, the next approach we take is to consider a temporal sequence prediction model that can be trained to predict the next state value given current state, based on the expert trajectories”, the time-series predictive model being the temporal sequence prediction model; and Page 4, §3.2.2).
The motivation to combine Englert, Eleftheriadis, and Kimura is the same as discussed above with respect to claim 4.

Regarding claim 8, the rejection of claim 1 is incorporated but Englert fails to explicitly disclose wherein said obtaining step is performed during a model-free exploration stage of the method.
Kimura discloses wherein said obtaining step is performed during a model-free exploration stage of the method (Page 3, Last paragraph; “This method constrains exploration to the states that have been demonstrated by an expert and enables learning a policy that closely matches the expert”, which discloses, under a broadest reasonable interpretation of the claim language, performing the obtaining step, or model-free reinforcement learning as taught in section 1 of Kimura, during a model-free exploration stage where the method constrains exploration to the states that have been demonstrated by an expert; and Page 13, §6.1; “The exploration policy is Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein (1930)) (θ = 0.15, µ = 0, σ = 0.01), size of reply memory is 1M, and optimizer is Adam (Kingma & Ba (2014))”, further disclosing the model-free exploration stage).
The motivation to combine Englert, Eleftheriadis, and Kimura is the same as discussed above with respect to claim 3.


Regarding claim 10, the rejection of claim 1 is incorporated but Englert fails to explicitly disclose performing one obstacle avoidance using the trained dynamics model.
Kimura discloses performing one obstacle avoidance using the trained dynamics model (Page 8, §4.2; the section discloses performing obstacle avoidance using the trained dynamics model. “The agent’s goal is to reach the target while avoiding the obstacle in this case”; and Figure 4).
Englert, Eleftheriadis, and Kimura are analogous art because all are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the obstacle avoidance of Kimura with the method of Englert and Eleftheriadis to yield the predictable result of performing one obstacle avoidance using the trained dynamics model. The motivation for doing so would be to reach a target while avoiding an obstacle (Kimura; Page 8, §4.2).

Regarding claim 16, the rejection of claim 15 is incorporated and Englert further discloses learning a predictor which predicts a next state using the trajectories of the expert states (Page 1922, Column 2; “The key idea is to find a robot-specific policy such that the observed expert trajectories and the predicted robot trajectories match”, which discloses learning a predictor for predicted robot trajectories; and Page 1923, Column 1; “Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u. In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp.”, the expert trajectories (or expert sequences of states) being the expert states are used to predict a next state in a trajectory)
Englert fails to explicitly disclose performing model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics.
Kimura discloses performing model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics (Abstract; “The reward signal is computed by a function of the difference between the actual next state acquired by the agent and the predicted next state given by the learned generative or predictive model. With this inferred reward function, we perform standard reinforcement learning in the inner loop to guide the agent to learn the given task”, the inferred reward function being a product of model-free inverse reinforcement learning, and it uses rewards estimated by using a predictor to sample the environment dynamics; and Page 1, Introduction , ¶1; “Typically, a scalar reward signal is used to guide the agent’s behavior and the agent learns a control policy that maximizes the cumulative reward over a trajectory, based on observations. This type of learning is referred to as ”model-free” RL since the agent does not know apriori or learn the dynamics of the environment.”, which described that model-free learning is performed since the agent does not know apriori or learn the dynamics of the environment, and a reward function is not known; and Page 2, ¶1; “As such, in this work, we propose a reward estimation method that can estimate the underlying reward based only on the expert demonstrations of state trajectories for accomplishing a given task. The estimated reward function can be used in RL algorithms in order to learn a suitable policy for the task”, the estimating of the reward function being accomplished using inverse reinforcement learning; and Page 2, ¶5; “An alternate approach is Inverse Reinforcement Learning (IRL) proposed in the seminal work by Ng & Russell (2000). In this work, the authors try to recover the optimal reward function as a best description behind the given expert demonstrations from humans or other expert agents”, which clarifies that inverse reinforcement learning is the inferencing or estimation of a reward given expert demonstrations, which is what the paper is making use of).
The motivation to combine Englert, Eleftheriadis, and Kimura is the same as discussed above with respect to claim 2.

Regarding claim 19, the rejection of claim 18 is incorporated and Englert further discloses learning a predictor which predicts a next state using the trajectories of the expert states (Page 1922, Column 2; “The key idea is to find a robot-specific policy such that the observed expert trajectories and the predicted robot trajectories match”, which discloses learning a predictor for predicted robot trajectories; and Page 1923, Column 1; “Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u. In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp.”, the expert trajectories (or expert sequences of states) being the expert states are used to predict a next state in a trajectory)
Englert fails to explicitly disclose performing model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics.
Kimura discloses performing model-free inverse reinforcement learning using rewards estimated by using the predictor to sample the environment dynamics (Abstract; “The reward signal is computed by a function of the difference between the actual next state acquired by the agent and the predicted next state given by the learned generative or predictive model. With this inferred reward function, we perform standard reinforcement learning in the inner loop to guide the agent to learn the given task”, the inferred reward function being a product of model-free inverse reinforcement learning, and it uses rewards estimated by using a predictor to sample the environment dynamics; and Page 1, Introduction , ¶1; “Typically, a scalar reward signal is used to guide the agent’s behavior and the agent learns a control policy that maximizes the cumulative reward over a trajectory, based on observations. This type of learning is referred to as ”model-free” RL since the agent does not know apriori or learn the dynamics of the environment.”, which described that model-free learning is performed since the agent does not know apriori or learn the dynamics of the environment, and a reward function is not known; and Page 2, ¶1; “As such, in this work, we propose a reward estimation method that can estimate the underlying reward based only on the expert demonstrations of state trajectories for accomplishing a given task. The estimated reward function can be used in RL algorithms in order to learn a suitable policy for the task”, the estimating of the reward function being accomplished using inverse reinforcement learning; and Page 2, ¶5; “An alternate approach is Inverse Reinforcement Learning (IRL) proposed in the seminal work by Ng & Russell (2000). In this work, the authors try to recover the optimal reward function as a best description behind the given expert demonstrations from humans or other expert agents”, which clarifies that inverse reinforcement learning is the inferencing or estimation of a reward given expert demonstrations, which is what the paper is making use of).
The motivation to combine Englert, Eleftheriadis, and Kimura is the same as discussed above with respect to claim 2.



Claim 11 is rejected under 35 U.S.C. § 103 as being obvious over Englert in view of Eleftheriadis and further in view of Van Seijen et al. (US 20180165603 A1, hereinafter” Van Seijen”).

Regarding claim 11, the rejection of claim 1 is incorporated but Englert fails to explicitly disclose performing transfer learning between at least two agents using the trained dynamic model.
Van Seijen discloses performing transfer learning between at least two agents using the trained dynamic model ([0139]; “The agents were trained in parallel with off -policy learning using Q-learning. An aggregator function summed the Q-values for each action: a A.sub.flat:Q.sup.sum(a, X.sub.t.sup.flat):=.SIGMA..sub.i Q.sup.i (a, X.sub.t.sup.i), and used -greedy action selection with respect to these summed values. The Q-table of both ghost-agents where the same, so benefit was gained from intra-task transfer learning by sharing the Q-table between the two ghost agents, which resulted in the ghost-agents learning twice as fast” (emphasis added), which discloses the transfer learning between two agents using the trained dynamic model; and [0006]; “This approach has at least the following advantages: 1) it allows for specialized agents for different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents”). 
Englert, Eleftheriadis, and Van Seijen are analogous art because all are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the transfer learning of Van Seijen with the method of Englert and Eleftheriadis to yield the predictable result of performing transfer learning between at least two agents using the trained dynamic model. The motivation for doing so would be to achieve faster learning in two agents (Van Seijen; [0139]).

Claims 21, 24, and 25 are rejected under 35 U.S.C. § 103 as being obvious over Englert in view of Kimura.

Regarding claim 21, Englert discloses [a] computer-implemented method for learning an action policy, comprising: (Abstract; “We present an imitation-learning approach to efficiently learn a task from expert demonstrations. Instead of finding policies indirectly, either via state-action mappings (behavioral cloning), or cost function learning (inverse reinforcement learning), our goal is to find policies directly such that predicted trajectories match observed ones”, which discloses a method for learning an action policy.  Note that the method is inherently implemented on a computer with a processor and memory, as suggested by the §IV Experiments section where the method is implemented for different simulated tasks; and Page 1922, Column 2; “The key idea is to find a robot-specific policy such that the observed expert trajectories and the predicted robot trajectories match”)
learning, by a processor, a predictor which predicts a next state using trajectories of expert states (Page 1922, Column 2; “The key idea is to find a robot-specific policy such that the observed expert trajectories and the predicted robot trajectories match”, which discloses learning a predictor for predicted robot trajectories; and Page 1923, Column 1; “Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u. In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp.”, the expert trajectories (or expert sequences of states) being the expert states are used to predict a next state in a trajectory, and as discussed above, this method is inherently performed by a computer that comprises a processor)
sample environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state; (Page 1924, §A; “A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1)”, which discloses that the triplet of a state, action, and next state are obtained, and as discussed above, this method is inherently performed by a computer that comprises a processor; and Page 1923, Column 1; “Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u. In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp.”, the expert trajectories (or expert sequences of states) being the expert states)
training, by the processor using the sampled environment dynamics as training data, a dynamics model which obtains a pair of the state and the action as an input and outputs, for each next state, state-transition probabilities to provide a trained dynamics model (Page 1924, §A; “Learning a Probabilistic Forward Model A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1). Such a model can be used to represent the transition dynamics of a robot. We represent the model by a probability distribution over models and implemented as a GP”, which discloses training or learning a dynamics model which obtains a pair of state and action (state xt−1 and action ut−1) as the input and output, for each new state (next state xt = f(xt−1,ut−1)) state-transition probabilities (probability distribution); and Page1925, Column 1; “Based on the PILCO framework, we use the learned GP forward model for iteratively predicting the state distributions p(x1), . . . , p(xT ) for a given policy π and an initial state distribution p(x0) . . . The transition probability p(f(x˜t−1)|x˜t−1) is the GP predictive distribution given in Eqs. (14)–(15)”, which further discloses the state transition probabilities; and Page 1924, Algorithm 1, Line 3; the algorithm discloses pseudocode for the training of a dynamics model which is discussed in further detail in §III.A of the paper; and Page 1924, Column 2; “. As training inputs to the GP we used state-action pairs (xt−1,ut−1)”, which discloses the training data).
Englert fails to explicitly disclose performing, by the processor, model-free inverse reinforcement learning using rewards estimated by using the predictor.
Kimura discloses performing, by the processor, model-free inverse reinforcement learning using rewards estimated by using the predictor (Abstract; “The reward signal is computed by a function of the difference between the actual next state acquired by the agent and the predicted next state given by the learned generative or predictive model. With this inferred reward function, we perform standard reinforcement learning in the inner loop to guide the agent to learn the given task”, the inferred reward function being a product of model-free inverse reinforcement learning, and it uses rewards estimated by using a predictor to sample the environment dynamics; and Page 1, Introduction , ¶1; “Typically, a scalar reward signal is used to guide the agent’s behavior and the agent learns a control policy that maximizes the cumulative reward over a trajectory, based on observations. This type of learning is referred to as ”model-free” RL since the agent does not know apriori or learn the dynamics of the environment.”, which described that model-free learning is performed since the agent does not know apriori or learn the dynamics of the environment, and a reward function is not known; and Page 2, ¶1; “As such, in this work, we propose a reward estimation method that can estimate the underlying reward based only on the expert demonstrations of state trajectories for accomplishing a given task. The estimated reward function can be used in RL algorithms in order to learn a suitable policy for the task”, the estimating of the reward function being accomplished using inverse reinforcement learning; and Page 2, ¶5; “An alternate approach is Inverse Reinforcement Learning (IRL) proposed in the seminal work by Ng & Russell (2000). In this work, the authors try to recover the optimal reward function as a best description behind the given expert demonstrations from humans or other expert agents”, which clarifies that inverse reinforcement learning is the inferencing or estimation of a reward given expert demonstrations, which is what the paper is making use of).
Englert and Kimura are analogous art because both are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the model-free inverse reinforcement learning of Kimura with the sampled environment dynamics and method of Englert to yield the predictable result of performing, by the processor, model-free inverse reinforcement learning using rewards estimated by using the predictor to sample environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state. The motivation for doing so would be to guide an agent to mimic expert behavior (Kimura; Abstract).

Regarding claim 24, Englert discloses [a] non-transitory article of manufacture tangibly embodying a computer readable program which when executed causes a computer to perform (the method is inherently implemented on a computer with a processor and memory or non-transitory article of manufacture embodying a program, as suggested by the §IV Experiments section where the method is implemented for different simulated tasks)
the steps of claim 21 (see the rejection of claim 21 above, where both Englert and Kimura disclose the steps of claim 21).
The motivation to combine Englert and Kimura is the same as discussed above with respect to claim 21.

Regarding claim 25, Englert discloses [a] computer processing system for learning an action policy, comprising: a memory for storing program code; and a processor, operatively coupled to the memory, for running the program code to (Abstract; “We present an imitation-learning approach to efficiently learn a task from expert demonstrations. Instead of finding policies indirectly, either via state-action mappings (behavioral cloning), or cost function learning (inverse reinforcement learning), our goal is to find policies directly such that predicted trajectories match observed ones”, which discloses a method for learning an action policy.  Note that the system is inherently implemented on a computer with a processor and memory and code, as suggested by the §IV Experiments section where the method is implemented for different simulated tasks; and Page 1922, Column 2; “The key idea is to find a robot-specific policy such that the observed expert trajectories and the predicted robot trajectories match”)
learn a predictor which predicts a next state using trajectories of expert states; and (Page 1922, Column 2; “The key idea is to find a robot-specific policy such that the observed expert trajectories and the predicted robot trajectories match”, which discloses learning a predictor for predicted robot trajectories; and Page 1923, Column 1; “Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u. In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp.”, the expert trajectories (or expert sequences of states) being the expert states are used to predict a next state in a trajectory, and as discussed above, this method is inherently performed by a computer that comprises a processor)
sample environment dynamics including triplets of a state, an action, and a next state, wherein the state in each of the triplets is an expert state (Page 1924, §A; “A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1)”, which discloses that the triplet of a state, action, and next state are obtained, and as discussed above, this method is inherently performed by a computer that comprises a processor; and Page 1923, Column 1; “Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u. In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp.”, the expert trajectories (or expert sequences of states) being the expert states)
train, using the sampled environment dynamics as training data, a dynamics model which obtains a pair of the state and the action as an input and outputs, for each next state, state-transition probabilities to provide a trained dynamics model (Page 1924, §A; “Learning a Probabilistic Forward Model A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1). Such a model can be used to represent the transition dynamics of a robot. We represent the model by a probability distribution over models and implemented as a GP”, which discloses training or learning a dynamics model which obtains a pair of state and action (state xt−1 and action ut−1) as the input and output, for each new state (next state xt = f(xt−1,ut−1)) state-transition probabilities (probability distribution); and Page1925, Column 1; “Based on the PILCO framework, we use the learned GP forward model for iteratively predicting the state distributions p(x1), . . . , p(xT ) for a given policy π and an initial state distribution p(x0) . . . The transition probability p(f(x˜t−1)|x˜t−1) is the GP predictive distribution given in Eqs. (14)–(15)”, which further discloses the state transition probabilities; and Page 1924, Algorithm 1, Line 3; the algorithm discloses pseudocode for the training of a dynamics model which is discussed in further detail in §III.A of the paper; and Page 1924, Column 2; “. As training inputs to the GP we used state-action pairs (xt−1,ut−1)”, which discloses the training data).
Englert fails to explicitly disclose perform model-free inverse reinforcement learning using rewards estimated by using the predictor.
Kimura discloses perform model-free inverse reinforcement learning using rewards estimated by using the predictor (Abstract; “The reward signal is computed by a function of the difference between the actual next state acquired by the agent and the predicted next state given by the learned generative or predictive model. With this inferred reward function, we perform standard reinforcement learning in the inner loop to guide the agent to learn the given task”, the inferred reward function being a product of model-free inverse reinforcement learning, and it uses rewards estimated by using a predictor to sample the environment dynamics; and Page 1, Introduction , ¶1; “Typically, a scalar reward signal is used to guide the agent’s behavior and the agent learns a control policy that maximizes the cumulative reward over a trajectory, based on observations. This type of learning is referred to as ”model-free” RL since the agent does not know apriori or learn the dynamics of the environment.”, which described that model-free learning is performed since the agent does not know apriori or learn the dynamics of the environment, and a reward function is not known; and Page 2, ¶1; “As such, in this work, we propose a reward estimation method that can estimate the underlying reward based only on the expert demonstrations of state trajectories for accomplishing a given task. The estimated reward function can be used in RL algorithms in order to learn a suitable policy for the task”, the estimating of the reward function being accomplished using inverse reinforcement learning; and Page 2, ¶5; “An alternate approach is Inverse Reinforcement Learning (IRL) proposed in the seminal work by Ng & Russell (2000). In this work, the authors try to recover the optimal reward function as a best description behind the given expert demonstrations from humans or other expert agents”, which clarifies that inverse reinforcement learning is the inferencing or estimation of a reward given expert demonstrations, which is what the paper is making use of).
The motivation to combine Englert and Kimura is the same as discussed above with respect to claim 21.

Claims 22 and 23 are rejected under 35 U.S.C. § 103 as being obvious over Englert in view of Kimura and Eleftheriadis.

Regarding claim 22, the rejection of claim 21 is incorporated and Englert further discloses the expert states (Page 1924, §A; “A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1)”, which discloses that the triplet of a state, action, and next state are obtained, and as discussed above, this method is inherently performed by a computer that comprises a processor; and Page 1923, Column 1; “Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u. In the context of imitation learning, our objective is to find a policy π, such that the robot’s predicted trajectory τ π matches the observed expert trajectory τ exp.”, the expert trajectories (or expert sequences of states) being the expert states) and
	the trained dynamics model (Page 1924, §A; “Learning a Probabilistic Forward Model A forward model maps a state xt−1 and action ut−1 of the system to the next state xt = f(xt−1,ut−1). Such a model can be used to represent the transition dynamics of a robot. We represent the model by a probability distribution over models and implemented as a GP”, which discloses training or learning a dynamics model which obtains a pair of state and action (state xt−1 and action ut−1) as the input and output, for each new state (next state xt = f(xt−1,ut−1)) state-transition probabilities (probability distribution); and Page1925, Column 1; “Based on the PILCO framework, we use the learned GP forward model for iteratively predicting the state distributions p(x1), . . . , p(xT ) for a given policy π and an initial state distribution p(x0) . . . The transition probability p(f(x˜t−1)|x˜t−1) is the GP predictive distribution given in Eqs. (14)–(15)”, which further discloses the state transition probabilities; and Page 1924, Algorithm 1, Line 3; the algorithm discloses pseudocode for the training of a dynamics model which is discussed in further detail in §III.A of the paper)
Englert fails to explicitly disclose learning, by the processor, the action policy using the trajectories of the expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model.
Eleftheriadis discloses learning, by the processor, the action policy using the trajectories of the expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model ([0055]; “Policy learner 419 receives experience data from experience buffer 425 and implements, at S513, a reinforcement learning algorithm. The specific choice of reinforcement learning algorithms implemented by policy learner 419 is selected by a user and may be chosen depending on the nature of a specific reinforcement learning problem. In a specific example, policy learner 419 implements a temporal-difference learning algorithm, and uses supervised-learning function approximation to frame the reinforcement learning problem as a supervised learning problem, in which each backup plays the role of a training example. Supervised-learning function approximation allows a range of well-known gradient descent methods to be utilised by a learner in order to learn approximate value functions [circumflex over (v)](s, w) or [circumflex over (q)](s, a, w). The policy learner 419 may use the backpropagation algorithm for DNNs, in which case the vector of weights w for each DNN is a vector of connection weights in the DNN” (emphasis added), which discloses learning an action policy using trajectories of states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model; and Figure 8; the figure discloses the processor).
Englert, Kimura, and Eleftheriadis are analogous art because all are concerned with reinforcement learning.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in reinforcement learning to combine the supervised learning and backpropagation of Eleftheriadis with the expert states, dynamics model, and method of Englert and Kimura to yield the predictable result of learning, by the processor, the action policy using trajectories of expert states according to a supervised learning technique by back-propagating error gradients through the trained dynamics model. The motivation for doing so would be to learn approximate value functions for a policy learning system (Eleftheriadis; [0055]).

Regarding claim 23, the rejection of claims 21 and 22 are incorporated and Englert further discloses wherein parameters of the dynamics model are fixed in the learning of the action policy (Page 1923, Column 1; “Throughout this paper, we use the following notation: States are denoted by x ∈ R D and actions as u ∈ R E, respectively. Furthermore, a trajectory as τ comprises a sequence of states x0, . . . , xT for a fixed time horizon T. The policy π maps a state x to a corresponding action u”, which discloses, under a broadest reasonable interpretation of the claim language, that the states and actions are fixed for a fixed time horizon in learning the action policy, the learning of the action policy being described in detail on page 1924, section III of the paper).

Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Boularias et al., “Relative Entropy Inverse Reinforcement Learning”, 2011, Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) 2011, pp. 182-189.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRENT JOHNSTON HOOVER whose telephone number is (303)297-4403.  The examiner can normally be reached on Monday - Friday 9-5 MST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on 571-272-7796.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
 
/BRENT JOHNSTON HOOVER/Examiner, Art Unit 2125                                                                                                                                                                                                        









    
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 Note that page 1924, Algorithm 1, line 7 of Englert does disclose a gradient-based policy improvement, but this does not explicitly recite a supervised learning technique by back-propagating error gradients as claimed.
        2 Note that page 1924, Algorithm 1, line 7 of Englert does disclose a gradient-based policy improvement, but this does not explicitly recite a supervised learning technique by back-propagating error gradients as claimed.
        3 Note that page 1924, Algorithm 1, line 7 of Englert does disclose a gradient-based policy improvement, but this does not explicitly recite a supervised learning technique by back-propagating error gradients as claimed.
        4 Note that this reference qualifies as prior art under MPEP §2153.01(a) because the application names
        fewer joint inventors than the Kimura reference used in the rejection (“If, however, the application names fewer joint inventors than a publication (e.g., the application names as joint inventors A and B, and the publication names as authors A, B and C), it would not be readily apparent from the publication that it is by the inventor (i.e., the inventive entity) or a joint inventor and the publication would be treated as prior art under AIA  35 U.S.C. 102(a)(1).  Further note that the Kimura reference is labeled as having an anonymous source of inventorship or authorship, but upon further inspection, it appears that the Kimura reference names “Daiki Kimura, Subhajit Chaudhury, Ryuki Tachibana, and Sakyasingha Dasgupta” as the authors of this publication. See https://openreview.net/forum?id=HktXuGb0-.   Because Sakyasingha Dasgupta is not listed as an inventor in the present application, this Kimura reference therefore qualifies as prior art as discussed above with respect to MPEP §2153.01(a)