Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 10/26/2021 has been entered. 

Allowable Subject Matter
Claims 5-8 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Response to Arguments
Applicant's arguments filed on 10/26/2021 have been fully considered but they are not persuasive.
In Remarks, pp. 11-13, Applicant contends: 
“Firstly, referring to the combination of Argall and Daniel, Applicants contend that the combination is impermissible as it is improper and based on impermissible hindsight. … As a conclusion, it appears that any such combination is a clear improper practicing of hindsight utilization as there is no real basis or rational for combining the two references, except when attempting to imitate the disclosed claimed invention. Furthermore, Applicants contend that for one of ordinary skill in the art, when seeking for 

Examiner’s response:
Applicant’s arguments have been considered but are moot because the Argall reference is not used any more and independent claims 1, 14 and 17 are rejected based on a combination of the Guenter, Suleman, Daniel and MNIH references.

In Remarks, pp. 13-14, Applicant contends: 
“Applicants contend that both Argall and Daniel already teach learning techniques to teach robotic machines to perform tasks, each of them in a different and known in the art methodology. The Examiner did not provide sufficient rationale why would anybody seek to specifically add the neural network of Mnih to the methods of Argall and/or Daniel who already present their solution for robotic task learning. That is, adding Mnih has no reasonable motivation or apparent need for any of the methods of Argall and Daniel except when attempting to imitate the claim limitations, which is a clear impermissible hindsight.”

Examiner’s response:
Please see the clarified motivation for the rejections on claims 1, 14 and 17 based on a combination of the Guenter, Suleman, Daniel and MNIH references.
Therefore, the applicant’s arguments are not convincing.

Applicant’s arguments with respect to the limitation “gathering, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one sensor output value associated with said performance of said a robotic actuator” of claims 1, 14 and 17 have been considered but are moot because the arguments do not apply to any of the references being used in the current rejection of this limitation. 
Applicant’s arguments with respect to the limitation “updating at least some of said plurality of neural network parameters by receiving in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state comprising at least one sensor output value, while said robotic actuator performs said target task according to said neural network dataset” of claims 1, 14 and 17 have been considered but are moot because the arguments do not apply to any of the references being used in the current rejection of this limitation. Please see the rejections for how the existing references have been applied to the limitation.
Applicant’s arguments with respect to the limitation “performing by a robotic actuator a target task according to a neural network dataset; while the robotic actuator performs the target task, gathering a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one sensor output value, by in each of a plurality of reward training iterations: receiving a world state comprising at least one sensor output value” of claim 19 have been considered but are moot because the arguments do not apply to any of the references being used in the current rejection of this limitation. 
Applicant’s arguments with respect to the limitation “receiving from the instructor via an input device a score given by the instructor to the world state” of claim 19 have been considered but are moot because the arguments do not apply to any of the references being used in the current rejection of this limitation. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 9-10, 12-14 and 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over Guenter et al. (Reinforcement Learning for Imitating Constrained Reaching Movements) in view of Suleman et al. (Learning from demonstration in robots: Experimental comparison of neural architectures), further in view of Daniel et al. (Active Reward Learning), further in view of MNIH et al. (US 2015/0100530 A1).

Regarding claim 1, 

receiving data documenting a plurality of actions demonstrated by a demonstrating actuator performing a target task in a plurality of initial iterations; (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [sec 2.3] “The first step is to define the policy of our system. In order to explore the space of the parameters, we introduce a stochastic Gaussian control policy: 
    PNG
    media_image1.png
    107
    776
    media_image1.png
    Greyscale
, (15) where ξ˙m is the modulation speed of the dynamical system defined in Eq. (5), Σξ˙ is the covariance matrix (see Eq. (7)) of the Gaussian control policy and ξ˙r is the noisy command generated by ρ(ξ˙r, ξ˙m, Σξ˙) and used to explore the parameters space. We consider here that the demonstrations performed by the user are sufficiently informative to allow the robot to reproduce the task in standard conditions. We thus choose to keep the covariance matrix Σξ˙ in order to respect the constraints taught by the demonstrator during the exploration process of the RL module. The parameters of the policy that we want to optimize are the means µk,ξ˙ that represent the means of the GMM model. By learning new means µk,ξ˙, we will be able to generate new trajectories ξ˙mq that will help the dynamical system to avoid the obstacle smoothly.”)

(Note: Hereinafter, if a limitation has brackets (i.e. [ ]) around claim languages, the bracketed claim languages indicate that they have not been taught yet by the current prior art reference but they will be taught by the other prior art reference(s) afterwards.)

calculating using said data a [neural network] dataset having a plurality of [neural network] parameters and used for mimicking said demonstrated plurality of actions in performing said target task; (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [sec 2.3] as cited above; e.g., parameters/variables related to the GMM model may read on “a [neural network] dataset having a plurality of [neural network] parameters”.)

gathering, in a plurality of reward training iterations of a robotic actuator performing said target task according to said [neural network] dataset, a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value associated with said performance of said robotic actuator; (Guenter, [fig 2], [fig 6], [fig 7] “evolution of the reward function”; [sec 2, p. 3] “ξ consists of the joint angles of the robot arm. … Note that the system described below does not make any assumption on the type of data and, thus, ξ could be composed of other variables, such as, for instance, the position of the robot’s end effector or the data projected in a latent space as done in [14].”; [sec 2.3] as cited above, and “The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16) where c1 > 0, c2 > 0 ∈ R are weighting constants, ξr is the simulated noisy command used to explore the solution space, ξmt,q=1 is the modulation speed of the first trial (see Eq. (7)) and ξg is the target position. Thus the reward function is determined by a term that represents the similarity between the current trajectory and the original modulation given by the GMM and a term that represents the distance between the target and the last point of the tested trajectory.”; e.g., “evaluate the produced trajectory” may read on “score”. In addition, e.g., “reinforcement learning” of fig 2 may read on “reward training iterations”.)

calculating, using said plurality of scores, a reward [neural network] dataset having a second plurality of [neural network] parameters; (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [sec 2.3] “The algorithm we have used for the reinforcement learning module of our system is the episodic natural actor-critic (NAC) algorithm presented in [25, 26]. The NAC is a variant of the actor-critic method for reinforcement learning.”; e.g., parameters/variables related to the NAC algorithm may read on “reward [neural network] dataset having a second plurality of [neural network] parameters”.)

computing, through [machine learning], a reward function from said reward [neural network] dataset; (Guenter, [figs 2-3]; [sec 2.3] “The algorithm we have used for the reinforcement learning module of our system is the episodic natural actor-critic (NAC) algorithm presented in [25, 26]. The NAC is a variant of the actor-critic method for reinforcement learning. … The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16)”; e.g., parameters/variables related to the NAC algorithm may read on “reward [neural network] dataset”.)

receiving in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said [neural network] dataset, wherein said another world state comprising at least one [sensor] output value; (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [fig 7] “Reward”; [sec 2.3] “We thus choose to keep the covariance matrix Σξ˙ in order to respect the constraints taught by the demonstrator during the exploration process of the RL module. The parameters of the policy that we want to optimize are the means µk,ξ˙ that represent the means of the GMM model. By learning new means µk,ξ˙, we will be able to generate new trajectories ξ˙mq that will help the dynamical system to avoid the obstacle smoothly. The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16)”; e.g., “learning new means” may read on “policy training iterations”.)

updating at least some of said plurality of [neural network] parameters based on said received reward value of each of said plurality of policy training iteration; and (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [fig 7] “Reward”; [sec 2.3] “We thus choose to keep the covariance matrix Σξ˙ in order to respect the constraints taught by the demonstrator during the exploration process of the RL module. The parameters of the policy that we want to optimize are the means µk,ξ˙ that represent the means of the GMM model. By learning new means µk,ξ˙, we will be able to generate new trajectories ξ˙mq that will help the dynamical system to avoid the obstacle smoothly. The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16)”; e.g., “learning new means” may read on “policy training iterations”.)

[outputting] said updated [neural network] dataset. (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [sec 2.3] as cited above; e.g., “learning new means µk,ξ˙” may read on “updated [neural network] dataset”.)

Guenter do not teach
calculating using said data a [neural network] dataset having a plurality of [neural network] parameters and used for mimicking said demonstrated plurality of actions in performing said target task;

calculating, using said plurality of scores, a reward [neural network] dataset having a second plurality of [neural network] parameters;
computing, through [machine learning], a reward function from said reward [neural network] dataset;
receiving in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said [neural network] dataset, wherein said another world state comprising at least one [sensor] output value;
updating at least some of said plurality of [neural network] parameters based on said received reward value of each of said plurality of policy training iteration; and
[outputting] said updated [neural network] dataset.

(Note: Hereinafter, if a limitation has one or more underlines, the one or more underlined claim languages indicate that they are taught by the current prior art reference, while the one or more non-underlined claim languages indicate that they have been taught already by the one or more previous art references.)

Suleman teaches
calculating using said data a neural network dataset having a plurality of neural network parameters and used for mimicking said demonstrated plurality of actions in performing said target task; (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2, 2.1 and 2.2] “Since the learner robot would eventually be run via an independent controller responsible to carry on the action performed by the demonstrator robot in the demonstration mode, it must learn the relation between the demonstrator robot controller’s inputs and output, i.e. sensor values and corresponding actuator commands, respectively. Therefore the primary function of this mode is training of a controller, i.e. the learner controller that is logic equivalent to the demonstrator controller logic. The functional depiction of learning mode is given in Fig. 3. The demonstration streams from the demonstration log are fed in to the ANN for training. The training data is formulated such that the sensor values are inputs and the time-wise corresponding actuator values are the output targets. Once the learning is complete the resulting ANN is stored.”; e.g., “learning” may read on “calculating”. Note that Guenter teaches “calculating using said data a [neural network] dataset having a plurality of [neural network] parameters and used for mimicking said demonstrated plurality of actions in performing said target task”.)

gathering, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value associated with said performance of said robotic actuator; (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2, 2.1 and 2.2] “The sensor and actuator streams are grabbed by hacking into the sensor readings and the actuator commands and put in the demonstration log. … The sensor stream thus acquired is then stored in the drive log for evaluation.”; [sec 2.5] “a specific set of demonstrations is trained on a particular network” [tables 2-3] “Training demos.”; [sec 3] “Note that experiment 2 is the simplest of all experiments as it drives a network against the single demonstration with which it was trained. Also note that FF performance degrades as the number of training or drive demonstrations increase, i.e. in the remainder of the experiments..”; Note that Guenter teaches “gathering, in a plurality of reward training iterations of a robotic actuator performing said target task according to said [neural network] dataset, a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value associated with said performance of said robotic actuator”.)

receiving in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said neural network dataset, wherein said another world state comprising at least one [sensor] output value; (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2, 2.1 and 2.2] “The sensor and actuator streams are grabbed by hacking into the sensor readings and the actuator commands and put in the demonstration log. … The sensor stream thus acquired is then stored in the drive log for evaluation.”; [sec 2.5] “a specific set of demonstrations is trained on a particular network” [tables 2-3] “Training demos.”; [sec 3] “Note that experiment 2 is the simplest of all experiments as it drives a network against the single demonstration with which it was trained. Also note that FF performance degrades as the number of training or drive demonstrations increase, i.e. in the remainder of the experiments..”; Note that Guenter teaches “receiving in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said [neural network] dataset, wherein said another world state comprising at least one [sensor] output value”.)

updating at least some of said plurality of neural network parameters based on said received reward value of each of said plurality of policy training iteration; and (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2.5] “a specific set of demonstrations is trained on a particular network” [tables 2-3] “Training demos.”; [sec 3] “Note that experiment 2 is the simplest of all experiments as it drives a network against the single demonstration with which it was trained. Also note that FF performance degrades as the number of training or drive demonstrations increase, i.e. in the remainder of the experiments..”; Note that Guenter teaches “updating at least some of said plurality of [neural network] parameters based on said received reward value of each of said plurality of policy training iteration”.)

outputting said updated neural network dataset. (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2.5] “a specific set of demonstrations is trained on a particular network” [table 2] “Training demos.”; [table 3] “Best configurations from experiment 1 expressed in terms of hidden neurons (h) and delay (d).”; [sec 3] “Note that experiment 2 is the simplest of all experiments as it drives a network against the single demonstration with which it was trained. Also note that FF performance degrades as the number of training or drive demonstrations increase, i.e. in the remainder of the experiments..”; e.g., table 3 may read on “outputting”. Note that Guenter teaches “said updated [neural network] dataset”.)

Guenter and Suleman are all in the same field of endeavor of processing input signal with the autonomous system and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the autonomous system of Guenter with the neural network of Suleman. Doing so would lead to providing a strong candidate to implement robot control and a strong subset of neural architectures and compatible algorithms for supervised learning. (Suleman, [sec 1] “ANNs for example, as general purpose universal problem solvers [6], offer a strong subset of neural architectures and compatible algorithms for supervised learning. ANNs are naturally suited to robot control implementation and thus have been frequently used in robotics beyond LFD [7–14]. … Since LFD is a suitable paradigm to implement robot control whereas ANNs offer a strong candidate to implement robot control, it is useful to gage ANNs performance for LFD application.”)

Guenter and Suleman do not teach
gathering, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value associated with said performance of said robotic actuator;
calculating, using said plurality of scores, a reward [neural network] dataset having a second plurality of [neural network] parameters;
computing, through [machine learning], a reward function from said reward [neural network] dataset;
receiving in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said neural network dataset, wherein said another world state comprising at least one [sensor] output value;

Daniel teaches
gathering, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one [sensor] output value associated with said performance of said robotic actuator; (Daniel, [figs 2] “Expert Ratings t ∼ p(t|                        
                            τ
                        
                    )” and “the proposed active reward learning approach, which shares the policy learning component with vanilla RL but models a probabilistic reward model that gets updated by asking for expert ratings sometimes”; [table I] “Query Expert Reward R+” and “Update reward model p(R|o,D)”; [sec I] “While it may be difficult to analytically design a reward function or to give demonstrations, it is often easy for an expert to rate an agent’s executions of a task. Thus, a promising alternative is to use the human not as an expert in performing the task, but as an expert in evaluating task executions.”; [sec II, pg 2] “The agent starts with an initial control policy π0(ω|s) in iteration 0 and performs a predetermined number of rollouts.”; e.g., “expert ratings” may read on “a plurality of scores given by an instructor to a plurality of world states”. Note that Guenter and Suleman teach “gathering, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value associated with said performance of said robotic actuator”.)

computing, through machine learning, a reward function from said reward [neural network] dataset; (Daniel: [figs 2], [table I] and [secs I-II, pg 2] as cited above; e.g., “expert ratings” may read on “scores”. In addition, reward model updating may read on “computing, through machine learning, a reward function” since the reward model is updated based on the reward model update module. Note that Guenter teaches “computing, through [machine learning], a reward function from said reward [neural network] dataset”.)

In the alternative, Daniel can also be interpreted to teach the following limitations:
updating at least some of said plurality of neural network parameters based on said received reward value of each of said plurality of policy training iteration; and (Daniel: [figs 2], [table I] and [secs I-II, pg 2] as cited above; [sec II. A] “The reward model relies on the policy π(ω|s) to provide outcomes in interesting, i.e., high reward regions, and the policy relies on the reward model p(R|o) to guide its exploration towards such regions of interest.” e.g., “Policy update” may read on “updating at least some of said plurality of … parameters”. In addition, Table I may read on “based on said received reward value of each of said plurality of policy training iteration” since a reward value in each iteration is calculated based on the latest reward model for updating the policy weights. Note that Suleman teaches “at least some of said plurality of neural network parameters”.)

outputting said updated neural network dataset. (Daniel: [figs 2], [table I] and [secs I-II, pg 2] as cited above; [sec II. A] “The reward model relies on the policy π(ω|s) to provide outcomes in interesting, i.e., high reward regions, and the policy relies on the reward model p(R|o) to guide its exploration towards such regions of interest.”; e.g., “Policy update” may read on “outputting said updated … dataset” since the updated policy weights are produced (i.e. outputted). Note that Suleman teaches “neural network”.)

Guenter, Suleman and Daniel are all in the same field of endeavor of processing input signal with the autonomous system and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the autonomous system of Guenter and Suleman with the rating by an instructor of Daniel. Doing so would lead to improving the policy model for accomplishing target tasks by updating the reward model based on expert ratings. (Daniel, [secs I-II, pg 2] “Obviously, it is not sufficient to build the reward function model on samples from the initial policy π0(ω|s), as the agent is likely to exhibit poor performance in the early stages of learning and the reward learner would not observe good samples. Therefore, we need to couple the process of learning a good policy π(ω|s) and a good reward function Rˆ(o), such that they are developed interdependently.”)

Guenter, Suleman and Daniel do not teach
gathering, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an instructor 
calculating, using said plurality of scores, a reward [neural network] dataset having a second plurality of [neural network] parameters;
computing, through machine learning, a reward function from said reward [neural network] dataset;
receiving in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said neural network dataset, wherein said another world state comprising at least one [sensor] output value;

MNIH teaches
gathering, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one sensor output value associated with said performance of said robotic actuator; (MNIH: [figs 5]; [pars 21-37] “More generally a state may be defined by sensory information from one or more sensors, or by data captured from or over a computer network, or by real-world data in general and, potentially, by data representing any real or virtual system which may be affected by actions of a software agent. … In a further related aspect the invention provides a control system, the system, comprising: a data input to receive sensor data; a data output to provide action control data;”; Note that Guenter, Suleman and Daniel teach “gathering, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one [sensor] output value associated with said performance of said robotic actuator”.)

calculating, using said plurality of scores, a reward neural network dataset having a second plurality of neural network parameters (MNIH: [figs 2, 5]; [pars 16-19] “In embodiments the experience data includes reward (or cost) data relating to a reward (or cost) of the action in moving from a current state to a subsequent state. The reward/cost may be measured from the system, for example by inputting data defining a reward or cost collected/incurred by the action. Additionally or alternatively however the reward/cost may be defined by parameters of the system or engineering problem to be solved. … In embodiments the experience data is used in conjunction with the first neural network for training the second neural network. More particularly a transition comprising a first, starting state, action, and next state is sampled from the stored experience data. … In some preferred implementations the reward (or cost) is recorded in the experience data store when storing data for a transition.”; [par 71] “The stored experience may then still be used to update the first neural network, which in turn generates targets for training the second neural network.”; see also [par 85]; e.g., “first neural network” may read on “reward neural network dataset having a second plurality of neural network parameters”. Note that Guenter, Suleman and Daniel teach “calculating, using said plurality of scores, a reward [neural network] dataset having a second plurality of [neural network] parameters”.);

computing, through machine learning, a reward function from said reward neural network dataset (MNIH: [figs 2, 5]; [pars 16-19] and [par 71] as cited above; see also [par 85]; e.g., “first neural network” may read on “reward neural network dataset”. Note that Guenter, Suleman and Daniel teach “computing, through machine learning, a reward function from said reward [neural network] dataset”.);

sensor output value; (MNIH: [figs 5]; [pars 21-37] “More generally a state may be defined by sensory information from one or more sensors, or by data captured from or over a computer network, or by real-world data in general and, potentially, by data representing any real or virtual system which may be affected by actions of a software agent. … In a further related aspect the invention provides a control system, the system, comprising: a data input to receive sensor data; a data output to provide action control data;”; Note that Guenter, Suleman and Daniel teach “receiving in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said neural network dataset, wherein said another world state comprising at least one [sensor] output value”.)

Guenter, Suleman, Daniel and MNIH are all in the same field of endeavor of processing input signal with the autonomous system and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the autonomous system of Guenter, Suleman and Daniel with the reward neural network of MNIH. Doing so would lead to avoiding the divergences in estimating parameters, substantially speeding up processing operations and learning to discriminate features of the sensory input which are relevant to the available actions which can be performed. (MNIH, [pars 11-19] “two neural networks are maintained to avoid the divergences which can otherwise occur when estimating an action-value parameter, in particular where a neural network would otherwise be updated based on its own predictions. … In this way the system can itself learn to discriminate features of the sensory input which are relevant to the available actions which can be performed. … This substantially speeds up operation by effectively processing the actions in parallel, allowing a subsequent selector module (either code/ software, or hardware), coupled to the outputs of the neural network, to select the maximum/minimum output value, the node with this value defining the corresponding action to be taken.”).

Regarding claim 2, 
Guenter, Suleman, Daniel and MNIH teach claim 1.

Guenter further teaches 
performing said target task comprises instructing at least one controller to perform one or more of an identified set of controller actions. (Guenter, [fig 1] “Programming a robot by demonstration means that a user demonstrates a task and the robot has to extract the important features of the task in order to be able to reproduce it in different situations. It might happen in some special cases that the robot encounters a new situation where the demonstration does not help to fulfill the task.”; [sec 2.3] “By learning new means µk,ξ˙, we will be able to generate new trajectories ξ˙mq that will help the dynamical system to avoid the obstacle smoothly.”)

Regarding claim 3, 
Guenter, Suleman, Daniel and MNIH teach claim 1.

Daniel further teaches 
said reward value is in a predefined range of reward values (Daniel: [figs 2, 5]; [sec IV, B] “The scheme presented in Fig. 3 labels grasps as failures with a reward of −1, if the object was not lifted at all or slipped when slightly perturbed. Grasps that were stable but did not keep the intended orientation were given a reward of 0. Finally, grasps that lifted the object and kept the orientation of the object were assigned a reward of 1.”).

Guenter, Suleman, Daniel and MNIH are combinable with Daniel for the same rationale as set forth above with respect to claim 1.

In the alternative, MNIH can also be interpreted to teach this limitation:
MNIH further teaches 
said reward value is in a predefined range of reward values (MNIH: [figs 2, 5]; [par 85] “However, since the scale of scores varies greatly from game to game, all positive rewards were fixed to be 1 and all negative rewards to be -1, leaving 0 rewards unchanged.”).

Guenter, Suleman, Daniel and MNIH are combinable with MNIH for the same rationale as set forth above with respect to claim 1.

Regarding claim 9, 
Guenter, Suleman, Daniel and MNIH teach claim 2.

Guenter further teaches 
said data comprises a plurality of sensor output values and a plurality of controller actions instructed by said actuator to perform said target task. (Guenter, [fig 1] “Programming a robot by demonstration means that a user demonstrates a task and the robot has to extract the important features of the task in order to be able to reproduce it in different situations. It might happen in some special cases that the robot encounters a new situation where the demonstration does not help to fulfill the task.”; [sec 3, p. 10] “It means that, for each demonstration, the demonstrator moved the robot arm to perform the task. The robot recorded the trajectories of its 4 DOFs using its own encoders. The specification of the demonstration for the first task was to raise the hand in order to reach the box from above. For the second task, the specification was to grasp the chess queen following either a vertical movement (to grasp it from above) or a horizontal movement that is constraint by the need of a specific direction.”)

Regarding claim 10, 
Guenter, Suleman, Daniel and MNIH teach claim 1.

Guenter further teaches 
receiving a preliminary plurality of sensor output values; and (Guenter, [fig 1]; [sec 3, p. 10] “It means that, for each demonstration, the demonstrator moved the robot arm to perform the task. The robot recorded the trajectories of its 4 DOFs using its own encoders. The specification of the demonstration for the first task was to raise the hand in order to reach the box from above. For the second task, the specification was to grasp the chess queen following either a vertical movement (to grasp it from above) or a horizontal movement that is constraint by the need of a specific direction.”)

Daniel further teaches 
calculating a preliminary feature [neural network] dataset used for identifying a preliminary set of features of an environment of said robotic actuator4 (Daniel: [figs 2] “Expert Ratings t ∼ p(t|                        
                            τ
                        
                    )” and “the proposed active reward learning approach, which shares the policy learning component with vanilla RL but models a probabilistic reward model that gets updated by asking for expert ratings sometimes”; [table I] “Query Expert Reward R+” and “Update reward model p(R|o,D)”; [sec I] “While it may be difficult to analytically design a reward function or to give demonstrations, it is often easy for an expert to rate an agent’s executions of a task. Thus, a promising alternative is to use the human not as an expert in performing the task, but as an expert in evaluating task executions.”; [sec II, pg 2] “We use the term ‘context’ instead of the more general term ‘state’ to indicate the initial state of the robot and the environment. … The agent starts with an initial control policy π0(ω|s) in iteration 0 and performs a predetermined number of rollouts.”);

Guenter, Suleman, Daniel and MNIH are combinable with Daniel for the same rationale as set forth above with respect to claim 1.

MNIH further teaches 
calculating a preliminary feature neural network dataset used for identifying a preliminary set of features of an environment of said robotic actuator (MNIH: [figs 2, 5]; [pars 12-19] “This in turn facilitates the use of very large quantities of training data and, in particular, the use of sensory data such as image data or sound data (waveforms) for the state data. Embodiments of the technique may be trained directly on visual images and/or sound, and thus the reinforcement learning may be applied end to end, from this input to the output actions. This enables learning of features that may be directly relevant to discriminating action-values … In embodiments the experience data is used in conjunction with the first neural network for training the second neural network. More particularly a transition comprising a first, starting state, action, and next state is sampled from the stored experience data.”; [par 71] “The stored experience may then still be used to update the first neural network, which in turn generates targets for training the second neural network.”; Note that Daniel teaches “calculating a preliminary feature [neural network] dataset used for identifying a preliminary set of features of an environment of said robotic actuator4”.);

wherein calculating said neural network dataset further comprises using said preliminary feature neural network dataset (MNIH: [figs 2, 5]; [pars 16-19] “In embodiments the experience data is used in conjunction with the first neural network for training the second neural network. More particularly a transition comprising a first, starting state, action, and next state is sampled from the stored experience data.”; [par 71] “The stored experience may then still be used to update the first neural network, which in turn generates targets for training the second neural network.”);

Guenter, Suleman, Daniel and MNIH are combinable with MNIH for the same rationale as set forth above with respect to claim 1.

Regarding claim 12, 
Guenter, Suleman, Daniel and MNIH teach claim 1.

Guenter further teaches 
repeating in each of one or more iterations: 
gathering in a plurality of new reward training iterations a plurality of new scores given by said [instructor] to a plurality of new world states, each new world state comprising at least one new [sensor] output value, while said robotic actuator performs said target task according to said updated [neural network] dataset; (Guenter, [fig 2], [fig 6] “The dynamical system alone would have produced the trajectory represented by the dotted line. … we use a convergence criterion to stop the learning, in order to show the convergence of the reinforcement learning module”; [fig 7] “evolution of the reward function”; [sec 2, p. 3] “ξ consists of the joint angles of the robot arm. … Note that the system described below does not make any assumption on the type of data and, thus, ξ could be composed of other variables, such as, for instance, the position of the robot’s end effector or the data projected in a latent space as done in [14].”; [sec 2.3] We thus choose to keep the covariance matrix Σξ˙ in order to respect the constraints taught by the demonstrator during the exploration process of the RL module. The parameters of the policy that we want to optimize are the means µk,ξ˙ that represent the means of the GMM model. By learning new means µk,ξ˙, we will be able to generate new trajectories ξ˙mq that will help the dynamical system to avoid the obstacle smoothly. The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16)”; e.g., “learning new means” may read on “updated [neural network] dataset”. In addition, e.g., “evaluate the produced trajectory” may read on “score”. In addition, e.g., “reinforcement learning” of fig 2 may read on “new reward training iterations”.)

calculating using said plurality of new scores a new reward [neural network] dataset having a fourth plurality of [neural network] parameters and used for computing a new reward function; (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [sec 2.3] “The algorithm we have used for the reinforcement learning module of our system is the episodic natural actor-critic (NAC) algorithm presented in [25, 26]. The NAC is a variant of the actor-critic method for reinforcement learning. … The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16)”; e.g., parameters/variables related to the NAC algorithm may read on “new reward [neural network] dataset having a fourth plurality of [neural network] parameters”.)

 (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [fig 7] “Reward”; [sec 2.3] “We thus choose to keep the covariance matrix Σξ˙ in order to respect the constraints taught by the demonstrator during the exploration process of the RL module. The parameters of the policy that we want to optimize are the means µk,ξ˙ that represent the means of the GMM model. By learning new means µk,ξ˙, we will be able to generate new trajectories ξ˙mq that will help the dynamical system to avoid the obstacle smoothly. The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16)”; e.g., “learning new means” may read on “policy training iterations”.)

Suleman further teaches 
gathering in a plurality of new reward training iterations a plurality of new scores given by said [instructor] to a plurality of new world states, each new world state comprising at least one new [sensor] output value, while said robotic actuator performs said target task according to said updated neural network dataset; (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2, 2.1 and 2.2] “The sensor and actuator streams are grabbed by hacking into the sensor readings and the actuator commands and put in the demonstration log. … The sensor stream thus acquired is then stored in the drive log for evaluation.”; [sec 2.5] “a specific set of demonstrations is trained on a particular network” [tables 2-3] “Training demos.”; [sec 3] “Note that experiment 2 is the simplest of all experiments as it drives a network against the single demonstration with which it was trained. Also note that FF performance degrades as the number of training or drive demonstrations increase, i.e. in the remainder of the experiments..”; Note that Guenter teaches “gathering in a plurality of new reward training iterations a plurality of new scores given by said [instructor] to a plurality of new world states, each new world state comprising at least one new [sensor] output value, while said robotic actuator performs said target task according to said updated [neural network] dataset”.)

updating at least some of said plurality of neural network parameters by receiving in each of a plurality of new policy training iterations a new reward value computed by applying said new reward function to a new other world state comprising at least one new [sensor] output value, while said robotic actuator performs said target task according to said updated neural network dataset; (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2, 2.1 and 2.2] “The sensor and actuator streams are grabbed by hacking into the sensor readings and the actuator commands and put in the demonstration log. … The sensor stream thus acquired is then stored in the drive log for evaluation.”; [sec 2.5] “a specific set of demonstrations is trained on a particular network” [tables 2-3] “Training demos.”; [sec 3] “Note that experiment 2 is the simplest of all experiments as it drives a network against the single demonstration with which it was trained. Also note that FF performance degrades as the number of training or drive demonstrations increase, i.e. in the remainder of the experiments..”; Note that Guenter teaches “updating at least some of said plurality of [neural network] parameters by receiving in each of a plurality of new policy training iterations a new reward value computed by applying said new reward function to a new other world state comprising at least one new [sensor] output value, while said robotic actuator performs said target task according to said updated [neural network] dataset”.)



Daniel further teaches 
gathering in a plurality of new reward training iterations a plurality of new scores given by said instructor to a plurality of new world states, each new world state comprising at least one new [sensor] output value, while said robotic actuator performs said target task according to said updated neural network dataset; (Daniel: [figs 2] “Expert Ratings t ∼ p(t|                        
                            τ
                        
                    )” and “the proposed active reward learning approach, which shares the policy learning component with vanilla RL but models a probabilistic reward model that gets updated by asking for expert ratings sometimes”; [table I] “Query Expert Reward R+” and “Update reward model p(R|o,D)”; [sec I] “While it may be difficult to analytically design a reward function or to give demonstrations, it is often easy for an expert to rate an agent’s executions of a task. Thus, a promising alternative is to use the human not as an expert in performing the task, but as an expert in evaluating task executions.”; [sec II, pg 2] “The agent starts with an initial control policy π0(ω|s) in iteration 0 and performs a predetermined number of rollouts.”; e.g., “expert ratings” reads on “scores”. Note that Guenter and Suleman teach “gathering in a plurality of new reward training iterations a plurality of new scores given by said [instructor] to a plurality of new world states, each new world state comprising at least one new [sensor] output value, while said robotic actuator performs said target task according to said updated neural network dataset”.);

In the alternative, Daniel can also be interpreted to teach the following limitation:
updating at least some of said plurality of neural network parameters by receiving in each of a plurality of new policy training iterations a new reward value computed by applying said new reward function to a new other world state comprising at least one new [sensor] output  (Daniel: [figs 2], [table I] and [secs I-II, pg 2] as cited above; “Policy update” reads on “in each of a plurality of new policy training iterations”. In addition, Table I reads on “updating at least some of said plurality of … parameters by receiving in each of a plurality of new policy training iterations a new reward value” since the policy weights are updated based on a reward value. Note that Suleman teaches “neural network”.); 
 
Guenter, Suleman, Daniel and MNIH are combinable with Daniel for the same rationale as set forth above with respect to claim 1.

MNIH further teaches 
gathering in a plurality of new reward training iterations a plurality of new scores given by said instructor to a plurality of new world states, each new world state comprising at least one new sensor output value, while said robotic actuator performs said target task according to said updated neural network dataset; (MNIH: [figs 5]; [pars 21-37] “More generally a state may be defined by sensory information from one or more sensors, or by data captured from or over a computer network, or by real-world data in general and, potentially, by data representing any real or virtual system which may be affected by actions of a software agent. … In a further related aspect the invention provides a control system, the system, comprising: a data input to receive sensor data; a data output to provide action control data;”; Note that Guenter, Suleman and Daniel teach gathering in a plurality of new reward training iterations a plurality of new scores given by said instructor to a plurality of new world states, each new world state comprising at least one new [sensor] output value, while said robotic actuator performs said target task according to said updated neural network dataset”.)

neural network dataset having a fourth plurality of neural network parameters and used for computing a new reward function (MNIH: [figs 2, 5]; [pars 16-19] “In embodiments the experience data includes reward (or cost) data relating to a reward (or cost) of the action in moving from a current state to a subsequent state. The reward/cost may be measured from the system, for example by inputting data defining a reward or cost collected/incurred by the action. Additionally or alternatively however the reward/cost may be defined by parameters of the system or engineering problem to be solved. … In embodiments the experience data is used in conjunction with the first neural network for training the second neural network. More particularly a transition comprising a first, starting state, action, and next state is sampled from the stored experience data. … In some preferred implementations the reward (or cost) is recorded in the experience data store when storing data for a transition.”; [par 71] “The stored experience may then still be used to update the first neural network, which in turn generates targets for training the second neural network.”; see also [par 85]; e.g., “first neural network” may read on “new reward neural network dataset having a fourth plurality of neural network parameters” since the first neural network is updated. Note that Guenter, Suleman and Daniel teach “calculating using said plurality of new scores a new reward [neural network] dataset having a fourth plurality of [neural network] parameters and used for computing a new reward function”.); 

updating at least some of said plurality of neural network parameters by receiving in each of a plurality of new policy training iterations a new reward value computed by applying said new reward function to a new other world state comprising at least one new sensor output value, while said robotic actuator performs said target task according to said updated neural network dataset; (MNIH: [figs 2, 5]; [pars 16-19] and [par 71] as cited above; see also [par 85]; e.g., “experience data is used in conjunction with the first neural network for training the second neural network” may read on “updating at least some of said plurality of neural network parameters by receiving in each of a plurality of new policy training iterations a new reward value”. Note that Guenter, Suleman and Daniel teach “updating at least some of said plurality of neural network parameters by receiving in each of a plurality of new policy training iterations a new reward value computed by applying said new reward function to a new other world state comprising at least one new [sensor] output value, while said robotic actuator performs said target task according to said updated neural network dataset”.).

Guenter, Suleman, Daniel and MNIH are combinable with MNIH for the same rationale as set forth above with respect to claim 1.

Regarding claim 13, 
Guenter, Suleman, Daniel and MNIH teach claim 1.

MNIH further teaches 
at least one Q-Learning method is used while updating said at least some of said plurality of neural network parameters. (MNIH: [figs 2, 5]; [pars 16-28] “In embodiments the experience data includes reward (or cost) data relating to a reward (or cost) of the action in moving from a current state to a subsequent state. The reward/cost may be measured from the system, for example by inputting data defining a reward or cost collected/incurred by the action. Additionally or alternatively however the reward/cost may be defined by parameters of the system or engineering problem to be solved. … In a related aspect the invention provides a method of Q-learning wherein Q values are determined by a neural network and used to select actions to be performed on a system to move the system between states, wherein a first neural network is used to generate a Q-value for a target for training a second neural network used to select said actions.”; [par 71] “The stored experience may then still be used to update the first neural network, which in turn generates targets for training the second neural network.”; see also [par 85])

Guenter, Suleman, Daniel and MNIH are combinable with MNIH for the same rationale as set forth above with respect to claim 1.

Regarding claim 14, 
Guenter teaches

A system for training a computerized mechanical device's neural network dataset, comprising: 
at least one hardware processor, [executing at least one neural network comprising a plurality of convolutional layers]; (Guenter, [sec 4] “PC equipped with a Pentium 4, 2.4 GHz processor”)

at least one sensor electrically connected to an input of said at least one hardware processor; and (Guenter, [sec 2, p. 3] “ξ consists of the joint angles of the robot arm. … Note that the system described below does not make any assumption on the type of data and, thus, ξ could be composed of other variables, such as, for instance, the position of the robot’s end effector or the data projected in a latent space as done in [14].”; [sec 3, p. 10] “The robot recorded the trajectories of its 4 DOFs using its own encoders.”)

at least one controller, connected to an output of said at least one hardware processor (Guenter, [sec 2.3] “We consider here that the demonstrations performed by the user are sufficiently informative to allow the robot to reproduce the task in standard conditions.”)

wherein said at least one hardware processor is adapted to: 
receive data documenting a plurality of actions demonstrated by a demonstrating actuator performing a target task in a plurality of initial iterations; (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [sec 2.3] “The first step is to define the policy of our system. In order to explore the space of the parameters, we introduce a stochastic Gaussian control policy: 
    PNG
    media_image1.png
    107
    776
    media_image1.png
    Greyscale
, (15) where ξ˙m is the modulation speed of the dynamical system defined in Eq. (5), Σξ˙ is the covariance matrix (see Eq. (7)) of the Gaussian control policy and ξ˙r is the noisy command generated by ρ(ξ˙r, ξ˙m, Σξ˙) and used to explore the parameters space. We consider here that the demonstrations performed by the user are sufficiently informative to allow the robot to reproduce the task in standard conditions. We thus choose to keep the covariance matrix Σξ˙ in order to respect the constraints taught by the demonstrator during the exploration process of the RL module. The parameters of the policy that we want to optimize are the means µk,ξ˙ that represent the means of the GMM model. By learning new means µk,ξ˙, we will be able to generate new trajectories ξ˙mq that will help the dynamical system to avoid the obstacle smoothly.”)

calculate using said data a [neural network] dataset having a plurality of [neural network] parameters and used for mimicking said demonstrated plurality of actions in performing said target task; (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [sec 2.3] as cited above; e.g., parameters/variables related to the GMM model may read on “a [neural network] dataset having a plurality of [neural network] parameters”.)

 (Guenter, [fig 2], [fig 6], [fig 7] “evolution of the reward function”; [sec 2, p. 3] “ξ consists of the joint angles of the robot arm. … Note that the system described below does not make any assumption on the type of data and, thus, ξ could be composed of other variables, such as, for instance, the position of the robot’s end effector or the data projected in a latent space as done in [14].”; [sec 2.3] as cited above, and “The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16) where c1 > 0, c2 > 0 ∈ R are weighting constants, ξr is the simulated noisy command used to explore the solution space, ξmt,q=1 is the modulation speed of the first trial (see Eq. (7)) and ξg is the target position. Thus the reward function is determined by a term that represents the similarity between the current trajectory and the original modulation given by the GMM and a term that represents the distance between the target and the last point of the tested trajectory.”; e.g., “evaluate the produced trajectory” may read on “score”. In addition, e.g., “reinforcement learning” of fig 2 may read on “reward training iterations”.)

calculate, using said plurality of scores, a reward [neural network] dataset having a second plurality of [neural network] parameters; (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [sec 2.3] “The algorithm we have used for the reinforcement learning module of our system is the episodic natural actor-critic (NAC) algorithm presented in [25, 26]. The NAC is a variant of the actor-critic method for reinforcement learning.”; e.g., parameters/variables related to the NAC algorithm may read on “reward [neural network] dataset having a second plurality of [neural network] parameters”.)

compute, through [machine learning], a reward function from said reward [neural network] dataset; (Guenter, [figs 2-3]; [sec 2.3] “The algorithm we have used for the reinforcement learning module of our system is the episodic natural actor-critic (NAC) algorithm presented in [25, 26]. The NAC is a variant of the actor-critic method for reinforcement learning. … The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16)”; e.g., parameters/variables related to the NAC algorithm may read on “reward [neural network] dataset”.)

receive in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said [neural network] dataset, wherein said another world state comprising at least one [sensor] output value received from said at least one [sensor]; (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [fig 7] “Reward”; [sec 2.3] “We thus choose to keep the covariance matrix Σξ˙ in order to respect the constraints taught by the demonstrator during the exploration process of the RL module. The parameters of the policy that we want to optimize are the means µk,ξ˙ that represent the means of the GMM model. By learning new means µk,ξ˙, we will be able to generate new trajectories ξ˙mq that will help the dynamical system to avoid the obstacle smoothly. The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16)”; e.g., “learning new means” may read on “policy training iterations”.)

update at least some of said plurality of [neural network] parameters based on said received reward value of each of said plurality of policy training iteration; and (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [fig 7] “Reward”; [sec 2.3] “We thus choose to keep the covariance matrix Σξ˙ in order to respect the constraints taught by the demonstrator during the exploration process of the RL module. The parameters of the policy that we want to optimize are the means µk,ξ˙ that represent the means of the GMM model. By learning new means µk,ξ˙, we will be able to generate new trajectories ξ˙mq that will help the dynamical system to avoid the obstacle smoothly. The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16)”; e.g., “learning new means” may read on “policy training iterations”.)

[output] said updated [neural network] dataset. (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [sec 2.3] as cited above; e.g., “learning new means µk,ξ˙” may read on “updated [neural network] dataset”.)

Guenter does not teach
at least one hardware processor, [executing at least one neural network comprising a plurality of convolutional layers];

gather, in a plurality of reward training iterations of a robotic actuator performing said target task according to said [neural network] dataset, a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value received from said at least one [sensor] and associated with said performance of said robotic actuator;
calculate, using said plurality of scores, a reward [neural network] dataset having a second plurality of [neural network] parameters;
compute, through [machine learning], a reward function from said reward [neural network] dataset;
receive in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said [neural network] dataset, wherein said another world state comprising at least one [sensor] output value received from said at least one [sensor];
update at least some of said plurality of [neural network] parameters based on said received reward value of each of said plurality of policy training iteration; and
[output] said updated [neural network] dataset.

Suleman teaches
calculate using said data a neural network dataset having a plurality of neural network parameters and used for mimicking said demonstrated plurality of actions in performing said target task; (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2, 2.1 and 2.2] “Since the learner robot would eventually be run via an independent controller responsible to carry on the action performed by the demonstrator robot in the demonstration mode, it must learn the relation between the demonstrator robot controller’s inputs and output, i.e. sensor values and corresponding actuator commands, respectively. Therefore the primary function of this mode is training of a controller, i.e. the learner controller that is logic equivalent to the demonstrator controller logic. The functional depiction of learning mode is given in Fig. 3. The demonstration streams from the demonstration log are fed in to the ANN for training. The training data is formulated such that the sensor values are inputs and the time-wise corresponding actuator values are the output targets. Once the learning is complete the resulting ANN is stored.”; e.g., “learning” may read on “calculating”. Note that Guenter teaches “calculate using said data a [neural network] dataset having a plurality of [neural network] parameters and used for mimicking said demonstrated plurality of actions in performing said target task”.)

gather, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value received from said at least one [sensor] and associated with said performance of said robotic actuator; (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2, 2.1 and 2.2] “The sensor and actuator streams are grabbed by hacking into the sensor readings and the actuator commands and put in the demonstration log. … The sensor stream thus acquired is then stored in the drive log for evaluation.”; [sec 2.5] “a specific set of demonstrations is trained on a particular network” [tables 2-3] “Training demos.”; [sec 3] “Note that experiment 2 is the simplest of all experiments as it drives a network against the single demonstration with which it was trained. Also note that FF performance degrades as the number of training or drive demonstrations increase, i.e. in the remainder of the experiments..”; Note that Guenter teaches “gather, in a plurality of reward training iterations of a robotic actuator performing said target task according to said [neural network] dataset, a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value received from said at least one [sensor] and associated with said performance of said robotic actuator”.)

receive in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said neural network dataset, wherein said another world state comprising at least one [sensor] output value received from said at least one [sensor]; (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2, 2.1 and 2.2] “The sensor and actuator streams are grabbed by hacking into the sensor readings and the actuator commands and put in the demonstration log. … The sensor stream thus acquired is then stored in the drive log for evaluation.”; [sec 2.5] “a specific set of demonstrations is trained on a particular network” [tables 2-3] “Training demos.”; [sec 3] “Note that experiment 2 is the simplest of all experiments as it drives a network against the single demonstration with which it was trained. Also note that FF performance degrades as the number of training or drive demonstrations increase, i.e. in the remainder of the experiments..”; Note that Guenter teaches “receive in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said [neural network] dataset, wherein said another world state comprising at least one [sensor] output value received from said at least one [sensor]”.)

update at least some of said plurality of neural network parameters based on said received reward value of each of said plurality of policy training iteration; and (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2.5] “a specific set of demonstrations is trained on a particular network” [tables 2-3] “Training demos.”; [sec 3] “Note that experiment 2 is the simplest of all experiments as it drives a network against the single demonstration with which it was trained. Also note that FF performance degrades as the number of training or drive demonstrations increase, i.e. in the remainder of the experiments..”; Note that Guenter teaches “update at least some of said plurality of [neural network] parameters based on said received reward value of each of said plurality of policy training iteration”.)

output said updated neural network dataset. (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2.5] “a specific set of demonstrations is trained on a particular network” [table 2] “Training demos.”; [table 3] “Best configurations from experiment 1 expressed in terms of hidden neurons (h) and delay (d).”; [sec 3] “Note that experiment 2 is the simplest of all experiments as it drives a network against the single demonstration with which it was trained. Also note that FF performance degrades as the number of training or drive demonstrations increase, i.e. in the remainder of the experiments..”; e.g., table 3 may read on “output”. Note that Guenter teaches “said updated [neural network] dataset”.)

Guenter is combinable with Suleman for the same rationale as set forth above with respect to claim 1.

Guenter and Suleman do not teach
at least one hardware processor, [executing at least one neural network comprising a plurality of convolutional layers];
gather, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value 
calculate, using said plurality of scores, a reward [neural network] dataset having a second plurality of [neural network] parameters;
compute, through [machine learning], a reward function from said reward [neural network] dataset;
receive in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said neural network dataset, wherein said another world state comprising at least one [sensor] output value received from said at least one [sensor];

Daniel teaches
gather, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one [sensor] output value received from said at least one [sensor] and associated with said performance of said robotic actuator; (Daniel, [figs 2] “Expert Ratings t ∼ p(t|                        
                            τ
                        
                    )” and “the proposed active reward learning approach, which shares the policy learning component with vanilla RL but models a probabilistic reward model that gets updated by asking for expert ratings sometimes”; [table I] “Query Expert Reward R+” and “Update reward model p(R|o,D)”; [sec I] “While it may be difficult to analytically design a reward function or to give demonstrations, it is often easy for an expert to rate an agent’s executions of a task. Thus, a promising alternative is to use the human not as an expert in performing the task, but as an expert in evaluating task executions.”; [sec II, pg 2] “The agent starts with an initial control policy π0(ω|s) in iteration 0 and performs a predetermined number of rollouts.”; e.g., “expert ratings” may read on “a plurality of scores given by an instructor to a plurality of world states”. Note that Guenter and Suleman teach “gather, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value received from said at least one [sensor] and associated with said performance of said robotic actuator”.)

compute, through machine learning, a reward function from said reward [neural network] dataset; (Daniel: [figs 2], [table I] and [secs I-II, pg 2] as cited above; e.g., “expert ratings” may read on “scores”. In addition, reward model updating may read on “computing, through machine learning, a reward function” since the reward model is updated based on the reward model update module. Note that Guenter teaches “compute, through [machine learning], a reward function from said reward [neural network] dataset”.)

In the alternative, Daniel can also be interpreted to teach the following limitations:
update at least some of said plurality of neural network parameters based on said received reward value of each of said plurality of policy training iteration; and (Daniel: [figs 2], [table I] and [secs I-II, pg 2] as cited above; [sec II. A] “The reward model relies on the policy π(ω|s) to provide outcomes in interesting, i.e., high reward regions, and the policy relies on the reward model p(R|o) to guide its exploration towards such regions of interest.” e.g., “Policy update” may read on “update at least some of said plurality of … parameters”. In addition, Table I may read on “based on said received reward value of each of said plurality of policy training iteration” since a reward value in each iteration is calculated based on the latest reward model for updating the policy weights. Note that Suleman teaches “at least some of said plurality of neural network parameters”.)

(Daniel: [figs 2], [table I] and [secs I-II, pg 2] as cited above; [sec II. A] “The reward model relies on the policy π(ω|s) to provide outcomes in interesting, i.e., high reward regions, and the policy relies on the reward model p(R|o) to guide its exploration towards such regions of interest.”; e.g., “Policy update” may read on “output said updated … dataset” since the updated policy weights are produced (i.e. outputted). Note that Suleman teaches “neural network”.).

Guenter and Suleman are combinable with Daniel for the same rationale as set forth above with respect to claim 1.

Guenter, Suleman and Daniel do not teach
at least one hardware processor, [executing at least one neural network comprising a plurality of convolutional layers];
gather, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one [sensor] output value received from said at least one [sensor] and associated with said performance of said robotic actuator;
calculate, using said plurality of scores, a reward [neural network] dataset having a second plurality of [neural network] parameters;
compute, through machine learning, a reward function from said reward [neural network] dataset;
receive in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said neural network dataset, wherein said another world state comprising at least one [sensor] output value received from said at least one [sensor];

MNIH teaches
at least one hardware processor, executing at least one neural network comprising a plurality of convolutional layers; (MNIH: [fig 4]; [pars 78-81] “The input to the neural network comprises an 84x84x4 image produced by (p. The first hidden layer convolves 16 8x8 filters with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 32 4x4 filters with stride 2. again followed by a rectifier nonlinearity”; Note that Guenter, Suleman and Daniel teach “at least one hardware processor, [executing at least one neural network comprising a plurality of convolutional layers]”.);

In the alternative, MNIH can also be interpreted to teach the following limitation:
at least one sensor electrically connected to an input of said at least one hardware processor; and (MNIH: [figs 5]; [pars 21-37] “More generally a state may be defined by sensory information from one or more sensors, or by data captured from or over a computer network, or by real-world data in general and, potentially, by data representing any real or virtual system which may be affected by actions of a software agent. … The above described techniques may be implemented in software, for example as code running on a digital signal processor (DSP) or parallelised across multiple processors, for example GPUs (graphics processing units), or on a general purpose computer system. … In a further related aspect the invention provides a control system, the system, comprising: a data input to receive sensor data; a data output to provide action control data”);

gather, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one sensor output value sensor and associated with said performance of said robotic actuator; (MNIH: [figs 5]; [pars 21-37] “More generally a state may be defined by sensory information from one or more sensors, or by data captured from or over a computer network, or by real-world data in general and, potentially, by data representing any real or virtual system which may be affected by actions of a software agent. … In a further related aspect the invention provides a control system, the system, comprising: a data input to receive sensor data; a data output to provide action control data;”; Note that Guenter, Suleman and Daniel teach “gather, in a plurality of reward training iterations of a robotic actuator performing said target task according to said neural network dataset, a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one [sensor] output value received from said at least one [sensor] and associated with said performance of said robotic actuator;”.)

calculate, using said plurality of scores, a reward neural network dataset having a second plurality of neural network parameters (MNIH: [figs 2, 5]; [pars 16-19] “In embodiments the experience data includes reward (or cost) data relating to a reward (or cost) of the action in moving from a current state to a subsequent state. The reward/cost may be measured from the system, for example by inputting data defining a reward or cost collected/incurred by the action. Additionally or alternatively however the reward/cost may be defined by parameters of the system or engineering problem to be solved. … In embodiments the experience data is used in conjunction with the first neural network for training the second neural network. More particularly a transition comprising a first, starting state, action, and next state is sampled from the stored experience data. … In some preferred implementations the reward (or cost) is recorded in the experience data store when storing data for a transition.”; [par 71] “The stored experience may then still be used to update the first neural network, which in turn generates targets for training the second neural network.”; see also [par 85]; e.g., “first neural network” may read on “reward neural network dataset having a second plurality of neural network parameters”. Note that Guenter, Suleman and Daniel teach “calculate, using said plurality of scores, a reward [neural network] dataset having a second plurality of [neural network] parameters”.);

compute, through machine learning, a reward function from said reward neural network dataset (MNIH: [figs 2, 5]; [pars 16-19] and [par 71] as cited above; see also [par 85]; e.g., “first neural network” may read on “reward neural network dataset”. Note that Guenter, Suleman and Daniel teach “compute, through machine learning, a reward function from said reward [neural network] dataset”.);

receive in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said neural network dataset, wherein said another world state comprising at least one sensor output value received from said at least one sensor; (MNIH: [figs 5]; [pars 21-37] “More generally a state may be defined by sensory information from one or more sensors, or by data captured from or over a computer network, or by real-world data in general and, potentially, by data representing any real or virtual system which may be affected by actions of a software agent. … In a further related aspect the invention provides a control system, the system, comprising: a data input to receive sensor data; a data output to provide action control data;”; Note that Guenter, Suleman and Daniel teach “receive in each of a plurality of policy training iterations a reward value computed by applying said reward function to another world state, while said robotic actuator performs said target task according to said neural network dataset, wherein said another world state comprising at least one [sensor] output value received from said at least one [sensor]”.)

Guenter, Suleman and Daniel are combinable with MNIH for the same rationale as set forth above with respect to claim 1.

Regarding claim 16, 
Guenter, Suleman, Daniel and MNIH teach claim 1.

MNIH further teaches 
said at least one controller controls movement of a vehicle (MNIH: [fig 5]; [par 94] “Further applications of the techniques we describe, which are merely given by way of example, include: robot control (such as bipedal or quadrupedal walking or running, navigation, grasping, and other control skills); vehicle control (autonomous vehicle control, steering control, airborne vehicle control such as helicopter or plane control, autonomous mobile robot control); machine control; control of wired or wireless communication systems; control of laboratory or industrial equipment; control or real or virtual resources (such as memory management, inventory management and the like)”);

Guenter, Suleman, Daniel and MNIH are combinable with MNIH for the same rationale as set forth above with respect to claim 1.

Regarding claim 17, 
Claim 17 is another method claim corresponding to the method claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 1. 

In addition, Suleman further teaches 
(Suleman, [figs 1-4], [sec 2, 2.1 and 2.2] “A learner controller can be drive tested against this demonstration by first configuring the robot such the sensor values of both joints used in this action is exactly 1 (normalized). This is because the given demonstration starts with both joints at value 1. Then by letting the learner controller to control the robot for exactly the duration required to record 117 samples length sensor stream. This is because the given demonstration is 117 samples long. The sensor stream thus acquired is then stored in the drive log for evaluation.”)

instructing at least one controller to perform one or more of an identified set of controller actions according to said updated neural network dataset in response to receiving said plurality of sensoroutput values. (Suleman, [figs 1-4] “Demonstration mode” and “Drive mode on learner robot”, [sec 2, 2.1 and 2.2] “The training data is formulated such that the sensor values are inputs and the time-wise corresponding actuator values are the output targets. Once the learning is complete the resulting ANN is stored. … A learner controller can be drive tested against this demonstration by first configuring the robot such the sensor values of both joints used in this action is exactly 1 (normalized). This is because the given demonstration starts with both joints at value 1. Then by letting the learner controller to control the robot for exactly the duration required to record 117 samples length sensor stream. This is because the given demonstration is 117 samples long. The sensor stream thus acquired is then stored in the drive log for evaluation.”)

Guenter, Suleman, Daniel and MNIH are combinable with Suleman for the same rationale as set forth above with respect to claim 1.

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Guenter et al. (Reinforcement Learning for Imitating Constrained Reaching Movements) in view of Suleman et al. (Learning from demonstration in robots: Experimental comparison of neural architectures), further in view of Daniel et al. (Active Reward Learning), further in view of MNIH et al. (US 2015/0100530 A1), further in view of Su et al. (On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems).

Regarding claim 4, 
Guenter, Suleman, Daniel and MNIH teach claim 3.

Daniel further teaches 
each of said plurality of scores is a value selected from the set [consisting of -1 and 1] (Daniel: [sec IV] “During the learning phase, the human expert assigned grasp ratings in the range of ±1000.”).

said predefined range of reward values is from -1 to 1, including -1 and 1 (Daniel: [sec IV, B] “The scheme presented in Fig. 3 labels grasps as failures with a reward of −1, if the object was not lifted at all or slipped when slightly perturbed. Grasps that were stable but did not keep the intended orientation were given a reward of 0. Finally, grasps that lifted the object and kept the orientation of the object were assigned a reward of 1.”).

Guenter, Suleman, Daniel and MNIH are combinable with Daniel for the same rationale as set forth above with respect to claim 1.

In the alternative, MNIH can also be interpreted to teach this limitation:
MNIH further teaches 
(MNIH: [par 85] “However, since the scale of scores varies greatly from game to game, all positive rewards were fixed to be 1 and all negative rewards to be -1, leaving 0 rewards unchanged.”).

Guenter, Suleman, Daniel and MNIH are combinable with MNIH for the same rationale as set forth above with respect to claim 1.

However, Guenter, Suleman, Daniel and MNIH do not teach
each of said plurality of scores is a value selected from the set [consisting of -1 and 1].  

Su teaches 
each of said plurality of scores is a value selected from the set consisting of -1 and 1 (Su: [figs 1-2]; [sec 1] “The proposed framework is then described in §3. This consists of the policy learning algorithm, the creation of the dialogue embedding function and the active reward model trained from real user ratings.”; [sec 3.2] “We pose this as a classification problem where the rating is a binary observation y ∈ {−1, 1} that defines failure or success.”);

Guenter, Suleman, Daniel, MNIH and Su are all in the same field of endeavor of processing input signal with the machine learning system and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the autonomous system of Guenter, Suleman, Daniel and MNIH with the rating of Su. Doing so would lead to improve the policy learning algorithm based on the active learning model trained from user ratings (Su, sec 1).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Guenter et al. (Reinforcement Learning for Imitating Constrained Reaching Movements) in view of Suleman et al. (Learning from demonstration in robots: Experimental comparison of neural architectures), further in view of Daniel et al. (Active Reward Learning), further in view of MNIH et al. (US 2015/0100530 A1), further in view of Commons (US 8,775,341 B1)

Regarding claim 11, 
Guenter, Suleman, Daniel and MNIH teach claim 1.

Daniel further teaches 
[while calculating said neural network dataset, calculating a revised feature neural network dataset used for] identifying a revised set of features of an environment of said robotic actuator; (Daniel: [figs 2]; [table I]; [sec II] “We use the term ‘context’ instead of the more general term ‘state’ to indicate the initial state of the robot and the environment. … Executing a rollout with control parameters ω results in a trajectory                         
                            τ
                        
                     ∼ p(                        
                            τ
                        
                    |s, ω), where the trajectory encodes both, the robot’s state transitions as well as relevant environment state transitions. The agent starts with an initial control policy π0(ω|s) in iteration 0 and performs a predetermined number of rollouts.”);

Guenter, Suleman, Daniel and MNIH are combinable with Daniel for the same rationale as set forth above with respect to claim 1.

MNIH further teaches 
while calculating said neural network dataset, calculating a [revised feature] neural network dataset used for identifying a revised set of features of an environment of said robotic actuator; (MNIH: [figs 2, 5]; [pars 16-19] “In embodiments the experience data is used in conjunction with the first neural network for training the second neural network. More particularly a transition comprising a first, starting state, action, and next state is sampled from the stored experience data.”; [par 71] “The stored experience may then still be used to update the first neural network, which in turn generates targets for training the second neural network.”; e.g., “second neural network” may read on “said neural network dataset”. Note that Daniel teaches “[while calculating said neural network dataset, calculating a revised feature neural network dataset used for] identifying a revised set of features of an environment of said robotic actuator”.)

wherein calculating said reward neural network dataset further comprises using [said revised feature] neural network dataset; (MNIH: [figs 2, 5]; [pars 16-19] “In embodiments the experience data is used in conjunction with the first neural network for training the second neural network. More particularly a transition comprising a first, starting state, action, and next state is sampled from the stored experience data.”; [par 71] “The stored experience may then still be used to update the first neural network, which in turn generates targets for training the second neural network.”; e.g., “first neural network” may read on “said reward neural network dataset”.); 

wherein updating said at least some of said plurality of neural network parameters further comprises using [said revised feature] neural network dataset; (MNIH: [figs 2, 5]; [pars 16-19] “In embodiments the experience data is used in conjunction with the first neural network for training the second neural network. More particularly a transition comprising a first, starting state, action, and next state is sampled from the stored experience data.”; [par 71] “The stored experience may then still be used to update the first neural network, which in turn generates targets for training the second neural network.”; e.g., “second neural network” may read on “neural network parameters”.)

Guenter, Suleman, Daniel and MNIH are combinable with MNIH for the same rationale as set forth above with respect to claim 1.

However, Guenter, Suleman, Daniel and MNIH do not teach
while calculating said neural network dataset, calculating a [revised feature] neural network dataset used for identifying a revised set of features of an environment of said robotic actuator;
wherein calculating said reward neural network dataset further comprises using [said revised feature] neural network dataset;
wherein updating said at least some of said plurality of neural network parameters further comprises using [said revised feature] neural network dataset.  

Commons teaches
while calculating said neural network dataset, calculating a revised feature neural network dataset used for identifying a revised set of features of an environment of said robotic actuator ([figs 1-3]; [col 35, ln 18 – col 40, ln 3] “Each of neural networks 20, 22, 24, 26, etc., except for the first in the hierarchical stack, neural network 20, can provide feedback 30,32,34,36,38, 40 to a lower stage/order neural network 20, 22, 24, etc. This feedback adjusts weights in lower stage/order neural networks. Neural networks in the hierarchical stack 20, 22, 24, 26... can send a request 50 for sensory input 60 to feed more information to neural network 20. A neural network can send this request when its input does not provide enough information for it to determine an output.”; e.g., Nm+1 of fig 1 may read on “said neural network dataset”, and Nm+2 of fig 1 may read on “revised feature neural network dataset”.)

said revised feature neural network dataset ([figs 1-3]; [col 35, ln 18– col 40, ln 3] as cited above; e.g., Nm of fig 1 may read on “reward neural network dataset”, and Nm+2 of fig 1 may read on “revised feature neural network dataset”.)

wherein updating said at least some of said plurality of neural network parameters further comprises using said revised feature neural network dataset ([figs 1-3]; [col 35, ln 18– col 40, ln 3] as cited above; e.g., Nm+1 of fig 1 may read on “neural network parameters”, and Nm+2 of fig 1 may read on “revised feature neural network dataset”.).

Guenter, Suleman, Daniel, MNIH and Commons are all in the same field of endeavor of processing input signal with the autonomous system and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the autonomous system of Guenter, Suleman, Daniel and MNIH with the feature neural network of Commons. Doing so would lead to allowing computers to solve more complex problems based on the hierarchical stacked neural networks (Commons, [col 13, ln 34 – col 15, ln 14] “Hierarchical stacked computer neural networks (Com mons and White, 2006) use Commons (Commons, Trudeau, Stein, Richards, and Krause, 1998) Model of Hierarchical Complexity. They accomplish the following tasks: model human development and learning; reproduce the rich repertoire of behaviors exhibited by humans; allow computers to mimic higher order human cognitive processes and make sophisticated distinctions between stimuli; and allow computers to solve more complex problems.”).

Claims 15 and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Guenter et al. (Reinforcement Learning for Imitating Constrained Reaching Movements) in view Suleman et al. (Learning from demonstration in robots: Experimental comparison of neural architectures), further in view of Daniel et al. (Active Reward Learning), further in view of MNIH et al. (US 2015/0100530 A1), further in view of Suay et al. (Effect of Human Guidance and State Space Size on Interactive Reinforcement Learning).

Regarding claim 15, 
Guenter, Suleman, Daniel and MNIH teach claim 1.

Guenter, Suleman, Daniel and MNIH do not teach
said at least one sensor is selected from a group consisting of: a light sensor, a camera, a sound sensor, a microphone, a temperature sensor, a contact sensor, a proximity sensor, a distance sensor, a global positioning sensor, a tilt sensor, a pressure sensor, an acceleration sensor, a gyroscope, an electrical current sensor, and an electrical voltage sensor.

Suay teaches
said at least one sensor is selected from a group consisting of: a light sensor, a camera, a sound sensor, a microphone, a temperature sensor, a contact sensor, a proximity sensor, a distance sensor, a global positioning sensor, a tilt sensor, a pressure sensor, an acceleration sensor, a gyroscope, an electrical current sensor, and an electrical voltage sensor. (Suay, [figs 1-2]; [sec III. B] “In this work, we recreate the interactive reward interface for a real robot by providing the human trainer with a fixed view of the environment via a web-cam (Figure 2a). … In our evaluation, the reward interface functions as in the original web application, in which left clicking anywhere in the image brings up a reward bar (Figure 1a) that the user can fill with a positive or negative reward value representing the desirability of the current state.”; [sec IV, p. 3] “The user observes the robot’s actions via a web-cam (see Figure 2a) and provides feedback to the robot through the graphical user interface.”; Note that Guenter, Suleman and Daniel teach “[presenting on a visual display device to an instructor] the at least one sensor output value”. The examiner notes that this claim is a kind of Markush-type and the group “camera” is elected.).

Guenter, Suleman, Daniel, MNIH and Suay are all in the same field of endeavor of processing input signal with the autonomous system and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the autonomous system of Guenter, Suleman, Daniel and MNIH with the input and display devices of Suay. Doing so would lead to having the guidance significantly reduce the learning rate, and having its positive effects increase with state space size. (Suay, [sec I] “Our results show that guidance significantly reduces the learning rate, and that its positive effects increase with state space size. We conclude with a discussion of the benefits and challenges of using Interactive Reinforcement Learning in real-world robotic applications.”).

Regarding claim 18, 
Guenter, Suleman, Daniel and MNIH teach claim 1.

Guenter further teaches 
said gathering a plurality of scores comprises: 
receiving a world state comprising at least one [sensor] output value; (Guenter, [fig 2], [sec 2, p. 3] “ξ consists of the joint angles of the robot arm. … Note that the system described below does not make any assumption on the type of data and, thus, ξ could be composed of other variables, such as, for instance, the position of the robot’s end effector or the data projected in a latent space as done in [14].”; [sec 2.3] “Thus the reward function is determined by a term that represents the similarity between the current trajectory and the original modulation given by the GMM and a term that represents the distance between the target and the last point of the tested trajectory.”; e.g., “current trajectory” may read on “world state”.)

[presenting on a visual display device to an instructor] the at least one [sensor] output value; (Guenter, [fig 2]; [sec 2.3] “Thus the reward function is determined by a term that represents the similarity between the current trajectory and the original modulation given by the GMM and a term that represents the distance between the target and the last point of the tested trajectory.”; e.g., “current trajectory” may read on “output value”.)

receiving [from the instructor via an input device] a score given by the [instructor] to the world state. (Guenter, [fig 2]; [sec 2.3] as cited above, and “The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16) where c1 > 0, c2 > 0 ∈ R are weighting constants, ξr is the simulated noisy command used to explore the solution space, ξmt,q=1 is the modulation speed of the first trial (see Eq. (7)) and ξg is the target position. Thus the reward function is determined by a term that represents the similarity between the current trajectory and the original modulation given by the GMM and a term that represents the distance between the target and the last point of the tested trajectory. … we test the distance between the last point of the trajectory ξsT and the target. The task is considered as fulfilled if |ξsT − ξg| < d where d ∈ R represents the maximal distance we want to obtain.”; e.g., “evaluate the produced trajectory” may read on “score”.)

creating a mapping between the world state and the score; (Guenter, [fig 2]; [sec 2.3] as cited above, and “The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16) where c1 > 0, c2 > 0 ∈ R are weighting constants, ξr is the simulated noisy command used to explore the solution space, ξmt,q=1 is the modulation speed of the first trial (see Eq. (7)) and ξg is the target position. Thus the reward function is determined by a term that represents the similarity between the current trajectory and the original modulation given by the GMM and a term that represents the distance between the target and the last point of the tested trajectory. … we test the distance between the last point of the trajectory ξsT and the target. The task is considered as fulfilled if |ξsT − ξg| < d where d ∈ R represents the maximal distance we want to obtain.”; e.g., “evaluate the produced trajectory” may read on “creating a mapping between the world state and the score” since the evaluation (i.e. score) is based on the produced trajectory.)

Daniel further teaches 
receiving from the instructor [via an input device] a score given by the instructor to the world state; (Daniel: [figs 2] “Expert Ratings t ∼ p(t|                        
                            τ
                        
                    )” and “the proposed active reward learning approach, which shares the policy learning component with vanilla RL but models a probabilistic reward model that gets updated by asking for expert ratings sometimes”; [table I] “Query Expert Reward R+” and “Update reward model p(R|o,D)”; [sec I] “While it may be difficult to analytically design a reward function or to give demonstrations, it is often easy for an expert to rate an agent’s executions of a task. Thus, a promising alternative is to use the human not as an expert in performing the task, but as an expert in evaluating task executions.”; [sec II, pg 2] “The agent starts with an initial control policy π0(ω|s) in iteration 0 and performs a predetermined number of rollouts.”; e.g., “expert ratings” may read on “score” as well. Note that Guenter and Suleman teach “receiving [from the instructor via an input device] a score given by the [instructor] to the world state”.)

In the alternative, Daniel can also be interpreted to teach the following limitation:7
creating a mapping between the world state and the score (Daniel: [figs 2] “Expert Ratings t ∼ p(t|                        
                            τ
                        
                    )” and “the proposed active reward learning approach, which shares the policy learning component with vanilla RL but models a probabilistic reward model that gets updated by asking for expert ratings sometimes”; [table I] “Query Expert Reward R+” and “Update reward model p(R|o,D)”; [sec I] “While it may be difficult to analytically design a reward function or to give demonstrations, it is often easy for an expert to rate an agent’s executions of a task. Thus, a promising alternative is to use the human not as an expert in performing the task, but as an expert in evaluating task executions.”; [sec II, pg 2] “We use the term ‘context’ instead of the more general term ‘state’ to indicate the initial state of the robot and the environment. … The agent starts with an initial control policy π0(ω|s) in iteration 0 and performs a predetermined number of rollouts.”; e.g., fig 2 and table I read on “creating a mapping between the world state and the score” since the expert rating (i.e. score) is based on trajectories which are based on context (i.e. world state).);

Guenter, Suleman, Daniel and MNIH are combinable with Daniel for the same rationale as set forth above with respect to claim 1.

MNIH further teaches 
receiving a world state comprising at least one sensor output value; (MNIH: [figs 5]; [pars 21-37] “More generally a state may be defined by sensory information from one or more sensors, or by data captured from or over a computer network, or by real-world data in general and, potentially, by data representing any real or virtual system which may be affected by actions of a software agent. … In a further related aspect the invention provides a control system, the system, comprising: a data input to receive sensor data; a data output to provide action control data”; Note that Guenter, Suleman and Daniel teach “receiving a world state comprising at least one [sensor] output value”.)

[presenting on a visual display device to an instructor] the at least one sensor output value; (MNIH: [figs 5]; [pars 21-37] as cited above; Note that Guenter, Suleman and Daniel teach “[presenting on a visual display device to an instructor] the at least one [sensor] output value”.)

Guenter, Suleman, Daniel and MNIH are combinable with MNIH for the same rationale as set forth above with respect to claim 1.

However, Guenter, Suleman, Daniel and MNIH do not teach
[presenting on a visual display device to an instructor] the at least one sensor output value;
receiving from the instructor [via an input device] a score given by the instructor to the world state.

Suay teaches
presenting on a visual display device to an instructor the at least one sensor output value; (Suay, [figs 1-2]; [sec III. B] “In this work, we recreate the interactive reward interface for a real robot by providing the human trainer with a fixed view of the environment via a web-cam (Figure 2a). … In our evaluation, the reward interface functions as in the original web application, in which left clicking anywhere in the image brings up a reward bar (Figure 1a) that the user can fill with a positive or negative reward value representing the desirability of the current state.”; [sec IV, p. 3] “The user observes the robot’s actions via a web-cam (see Figure 2a) and provides feedback to the robot through the graphical user interface.”; Note that Guenter, Suleman and Daniel teach “[presenting on a visual display device to an instructor] the at least one sensor output value”.)

receiving from the instructor via an input device a score given by the instructor to the world state. (Suay, [figs 1-2]; [sec III. B] “In this work, we recreate the interactive reward interface for a real robot by providing the human trainer with a fixed view of the environment via a web-cam (Figure 2a). … In our evaluation, the reward interface functions as in the original web application, in which left clicking anywhere in the image brings up a reward bar (Figure 1a) that the user can fill with a positive or negative reward value representing the desirability of the current state.”; [sec IV, p. 3] “The user observes the robot’s actions via a web-cam (see Figure 2a) and provides feedback to the robot through the graphical user interface.”; Note that Guenter, Suleman and Daniel teach “receiving from the instructor [via an input device] a score given by the instructor to the world state”.)

Guenter, Suleman, Daniel, MNIH are combinable with Suay for the same rationale as set forth above with respect to claim 15.

Regarding claim 19, 
Guenter teaches
A computer implemented method for computing a reward function, comprising: 
performing by a robotic actuator a target task according to a [neural network] dataset calculated from data documenting a plurality of actions demonstrated by a demonstrating actuator performing said target task in a plurality of initial iterations; (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [sec 2.3] “The first step is to define the policy of our system. In order to explore the space of the parameters, we introduce a stochastic Gaussian control policy: 
    PNG
    media_image1.png
    107
    776
    media_image1.png
    Greyscale
, (15) where ξ˙m is the modulation speed of the dynamical system defined in Eq. (5), Σξ˙ is the covariance matrix (see Eq. (7)) of the Gaussian control policy and ξ˙r is the noisy command generated by ρ(ξ˙r, ξ˙m, Σξ˙) and used to explore the parameters space. We consider here that the demonstrations performed by the user are sufficiently informative to allow the robot to reproduce the task in standard conditions. We thus choose to keep the covariance matrix Σξ˙ in order to respect the constraints taught by the demonstrator during the exploration process of the RL module. The parameters of the policy that we want to optimize are the means µk,ξ˙ that represent the means of the GMM model. By learning new means µk,ξ˙, we will be able to generate new trajectories ξ˙mq that will help the dynamical system to avoid the obstacle smoothly.”; e.g., parameters/variables related to GMM may read on “dataset”.)

in real time and while the robotic actuator performs the target task, gathering a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value, by in each of a plurality of reward training iterations; (Guenter, [fig 2] “This is done by measuring the distance between the target and the last point of the trajectory, this distance must be smaller than a given value to validate the task. If the task fails, we use a reinforcement learning algorithm which produces a new trajectory ξ˙m in order to correct the modulation and fulfill the task”; [fig 6] “The dynamical system alone would have produced the trajectory represented by the dotted line. … we use a convergence criterion to stop the learning, in order to show the convergence of the reinforcement learning module”; [fig 7] “evolution of the reward function”; [sec 2, p. 3] “ξ consists of the joint angles of the robot arm. … Note that the system described below does not make any assumption on the type of data and, thus, ξ could be composed of other variables, such as, for instance, the position of the robot’s end effector or the data projected in a latent space as done in [14].”; [sec 2.3] as cited above, and “The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16) where c1 > 0, c2 > 0 ∈ R are weighting constants, ξr is the simulated noisy command used to explore the solution space, ξmt,q=1 is the modulation speed of the first trial (see Eq. (7)) and ξg is the target position. Thus the reward function is determined by a term that represents the similarity between the current trajectory and the original modulation given by the GMM and a term that represents the distance between the target and the last point of the tested trajectory. … we test the distance between the last point of the trajectory ξsT and the target. The task is considered as fulfilled if |ξsT − ξg| < d where d ∈ R represents the maximal distance we want to obtain.”; e.g., “evaluate the produced trajectory” may read on “score”. In addition, e.g., “reinforcement learning” of fig 2 may read on “reward training iterations”.)

receiving a world state comprising at least one [sensor] output value; (Guenter, [fig 2]; [sec 2.3] “Thus the reward function is determined by a term that represents the similarity between the current trajectory and the original modulation given by the GMM and a term that represents the distance between the target and the last point of the tested trajectory.”; e.g., “current trajectory” may read on “world state”.)

[presenting on a visual display device to an instructor] the at least one [sensor] output value, in real time during said performance of said robotic actuator; (Guenter, [fig 2]; [sec 2.3] “Thus the reward function is determined by a term that represents the similarity between the current trajectory and the original modulation given by the GMM and a term that represents the distance between the target and the last point of the tested trajectory.”; e.g., “current trajectory” may read on “output value”.)

receiving [from the instructor via an input device] a score given by the [instructor] to the world state in real time during said performance of said robotic actuator; and (Guenter, [fig 2]; [sec 2.3] as cited above, and “The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16) where c1 > 0, c2 > 0 ∈ R are weighting constants, ξr is the simulated noisy command used to explore the solution space, ξmt,q=1 is the modulation speed of the first trial (see Eq. (7)) and ξg is the target position. Thus the reward function is determined by a term that represents the similarity between the current trajectory and the original modulation given by the GMM and a term that represents the distance between the target and the last point of the tested trajectory. … we test the distance between the last point of the trajectory ξsT and the target. The task is considered as fulfilled if |ξsT − ξg| < d where d ∈ R represents the maximal distance we want to obtain.”; e.g., “evaluate the produced trajectory” may read on “score”.)

creating a mapping between the world state and the score; (Guenter, [fig 2]; [sec 2.3] as cited above, and “The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16) where c1 > 0, c2 > 0 ∈ R are weighting constants, ξr is the simulated noisy command used to explore the solution space, ξmt,q=1 is the modulation speed of the first trial (see Eq. (7)) and ξg is the target position. Thus the reward function is determined by a term that represents the similarity between the current trajectory and the original modulation given by the GMM and a term that represents the distance between the target and the last point of the tested trajectory. … we test the distance between the last point of the trajectory ξsT and the target. The task is considered as fulfilled if |ξsT − ξg| < d where d ∈ R represents the maximal distance we want to obtain.”; e.g., “evaluate the produced trajectory” may read on “creating a mapping between the world state and the score” since the evaluation (i.e. score) is based on the produced trajectory.)

calculating, using said plurality of scores, and said plurality of world states a reward [neural network] dataset having a plurality of [neural network] parameters; and (Guenter, [figs 2-3] “Demonstration” and “Reinforcement Learning”; [sec 2.3] “The algorithm we have used for the reinforcement learning module of our system is the episodic natural actor-critic (NAC) algorithm presented in [25, 26]. The NAC is a variant of the actor-critic method for reinforcement learning.”; e.g., parameters/variables related to the NAC algorithm may read on “reward [neural network] dataset having a plurality of [neural network] parameters”.)

computing, through [machine learning], a reward function from said reward [neural network] dataset. (Guenter, [figs 2-3]; [sec 2.3] “The algorithm we have used for the reinforcement learning module of our system is the episodic natural actor-critic (NAC) algorithm presented in [25, 26]. The NAC is a variant of the actor-critic method for reinforcement learning. … The episodic reward function used to evaluate the produced trajectory is defined as follows: 
    PNG
    media_image2.png
    84
    527
    media_image2.png
    Greyscale
, (16)”; e.g., parameters/variables related to the NAC algorithm may read on “reward [neural network] dataset”.)

Guenter does not teach
performing by a robotic actuator a target task according to a [neural network] dataset calculated from data documenting a plurality of actions demonstrated by a demonstrating actuator performing said target task in a plurality of initial iterations;
in real time and while the robotic actuator performs the target task, gathering a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value, by in each of a plurality of reward training iterations;
receiving a world state comprising at least one [sensor] output value;
[presenting on a visual display device to an instructor] the at least one [sensor] output value, in real time during said performance of said robotic actuator;
receiving [from the instructor via an input device] a score given by the [instructor] to the world state in real time during said performance of said robotic actuator; and
calculating, using said plurality of scores, and said plurality of world states a reward [neural network] dataset having a plurality of [neural network] parameters; and
computing, through [machine learning], a reward function from said reward [neural network] dataset.

Suleman teaches
performing by a robotic actuator a target task according to a neural network dataset calculated from data documenting a plurality of actions demonstrated by a demonstrating actuator performing said target task in a plurality of initial iterations; (Suleman, [figs 1-4] “Learning mode via supervised ANN training”; [sec 2, 2.1 and 2.2] “The demonstration streams from the demonstration log are fed in to the ANN for training. The training data is formulated such that the sensor values are inputs and the time-wise corresponding actuator values are the output targets. Once the learning is complete the resulting ANN is stored.”; Note that Guenter teaches “performing by a robotic actuator a target task according to a [neural network] dataset calculated from data documenting a plurality of actions demonstrated by a demonstrating actuator performing said target task in a plurality of initial iterations”.)

Guenter is combinable with Suleman for the same rationale as set forth above with respect to claim 1.

However, Guenter and Suleman do not teach
in real time and while the robotic actuator performs the target task, gathering a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value, by in each of a plurality of reward training iterations;
receiving a world state comprising at least one [sensor] output value;
[presenting on a visual display device to an instructor] the at least one [sensor] output value, in real time during said performance of said robotic actuator;
receiving [from the instructor via an input device] a score given by the [instructor] to the world state in real time during said performance of said robotic actuator; and
calculating, using said plurality of scores, and said plurality of world states a reward [neural network] dataset having a plurality of [neural network] parameters; and
computing, through [machine learning], a reward function from said reward [neural network] dataset.

Daniel teaches
in real time and while the robotic actuator performs the target task, gathering a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one [sensor] output value, by in each of a plurality of reward training iterations; (Daniel: [figs 2] “Expert Ratings t ∼ p(t|                        
                            τ
                        
                    )” and “the proposed active reward learning approach, which shares the policy learning component with vanilla RL but models a probabilistic reward model that gets updated by asking for expert ratings sometimes”; [table I] “Query Expert Reward R+” and “Update reward model p(R|o,D)”; [sec I] “While it may be difficult to analytically design a reward function or to give demonstrations, it is often easy for an expert to rate an agent’s executions of a task. Thus, a promising alternative is to use the human not as an expert in performing the task, but as an expert in evaluating task executions.”; [sec II, pg 2] “The agent starts with an initial control policy π0(ω|s) in iteration 0 and performs a predetermined number of rollouts.”; e.g., “expert ratings” may read on “scores” as well. Note that Guenter and Suleman teach “in real time and while the robotic actuator performs the target task, gathering a plurality of scores given by an [instructor] to a plurality of world states, each world state comprising at least one [sensor] output value, by in each of a plurality of reward training iterations”.);

receiving from the instructor [via an input device] a score given by the instructor to the world state in real time during said performance of said robotic actuator; and (Daniel: [figs 2] “Expert Ratings t ∼ p(t|                        
                            τ
                        
                    )” and “the proposed active reward learning approach, which shares the policy learning component with vanilla RL but models a probabilistic reward model that gets updated by asking for expert ratings sometimes”; [table I] “Query Expert Reward R+” and “Update reward model p(R|o,D)”; [sec I] “While it may be difficult to analytically design a reward function or to give demonstrations, it is often easy for an expert to rate an agent’s executions of a task. Thus, a promising alternative is to use the human not as an expert in performing the task, but as an expert in evaluating task executions.”; [sec II, pg 2] “The agent starts with an initial control policy π0(ω|s) in iteration 0 and performs a predetermined number of rollouts.”; e.g., “expert ratings” may read on “score” as well. Note that Guenter and Suleman teach “receiving [from the instructor via an input device] a score given by the [instructor] to the world state in real time during said performance of said robotic actuator”.)

computing, through machine learning, a reward function from said reward [neural network] dataset. (Daniel: [figs 2], [table I] and [secs I-II, pg 2] as cited above; e.g., “expert ratings” may read on “scores” as well. In addition, reward model updating may read on “computing, through machine learning, a reward function” since the reward model is updated based on the reward model update module. Note that Guenter and Suleman teach “computing, through [machine learning], a reward function from said reward [neural network] dataset”.)

Guenter and Suleman are combinable with Daniel for the same rationale as set forth above with respect to claim 1.

Guenter, Suleman and Daniel do not teach
in real time and while the robotic actuator performs the target task, gathering a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one [sensor] output value, by in each of a plurality of reward training iterations;
receiving a world state comprising at least one [sensor] output value;
[presenting on a visual display device to an instructor] the at least one [sensor] output value, in real time during said performance of said robotic actuator;
receiving from the instructor [via an input device] a score given by the instructor to the world state in real time during said performance of said robotic actuator; and
calculating, using said plurality of scores, and said plurality of world states a reward [neural network] dataset having a plurality of [neural network] parameters; and
computing, through machine learning, a reward function from said reward [neural network] dataset.

MNIH teaches
in real time and while the robotic actuator performs the target task, gathering a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one sensor output value, by in each of a plurality of reward training iterations; (MNIH: [figs 5]; [pars 21-37] “More generally a state may be defined by sensory information from one or more sensors, or by data captured from or over a computer network, or by real-world data in general and, potentially, by data representing any real or virtual system which may be affected by actions of a software agent. … In a further related aspect the invention provides a control system, the system, comprising: a data input to receive sensor data; a data output to provide action control data;”; Note that Guenter, Suleman and Daniel teach “in real time and while the robotic actuator performs the target task, gathering a plurality of scores given by an instructor to a plurality of world states, each world state comprising at least one [sensor] output value, by in each of a plurality of reward training iterations”.)

receiving a world state comprising at least one sensor output value; (MNIH: [figs 5]; [pars 21-37] as cited above; Note that Guenter, Suleman and Daniel teach “receiving a world state comprising at least one [sensor] output value”.)

[presenting on a visual display device to an instructor] the at least one sensor output value, in real time during said performance of said robotic actuator; (MNIH: [figs 5]; [pars 21-37] as cited above; Note that Guenter, Suleman and Daniel teach “[presenting on a visual display device to an instructor] the at least one [sensor] output value, in real time during said performance of said robotic actuator”.)

neural network dataset having a plurality of neural network parameters; and (MNIH: [figs 2, 5]; [pars 16-19] “In embodiments the experience data includes reward (or cost) data relating to a reward (or cost) of the action in moving from a current state to a subsequent state. The reward/cost may be measured from the system, for example by inputting data defining a reward or cost collected/incurred by the action. Additionally or alternatively however the reward/cost may be defined by parameters of the system or engineering problem to be solved. … In embodiments the experience data is used in conjunction with the first neural network for training the second neural network. More particularly a transition comprising a first, starting state, action, and next state is sampled from the stored experience data. … In some preferred implementations the reward (or cost) is recorded in the experience data store when storing data for a transition.”; [par 71] “The stored experience may then still be used to update the first neural network, which in turn generates targets for training the second neural network.”; see also [par 85]; e.g., “first neural network” may read on “reward neural network dataset having a plurality of neural network parameters”. Note that Guenter, Suleman, Daniel and Suay teach “calculating, using said plurality of scores, and said plurality of world states a reward [neural network] dataset having a plurality of [neural network] parameters”.);

computing, through machine learning, a reward function from said reward neural network dataset. (MNIH: [figs 2, 5]; [pars 16-19] and [par 71] as cited above; see also [par 85]; e.g., “first neural network” may read on “reward neural network dataset”. Note that Guenter, Suleman, Daniel and Suay teach “computing, through machine learning, a reward function from said reward [neural network] dataset”.);



Suay teaches
presenting on a visual display device to an instructor the at least one sensor output value, in real time during said performance of said robotic actuator; (Suay, [figs 1-2]; [sec III. B] “In this work, we recreate the interactive reward interface for a real robot by providing the human trainer with a fixed view of the environment via a web-cam (Figure 2a). … In our evaluation, the reward interface functions as in the original web application, in which left clicking anywhere in the image brings up a reward bar (Figure 1a) that the user can fill with a positive or negative reward value representing the desirability of the current state.”; [sec IV, p. 3] “The user observes the robot’s actions via a web-cam (see Figure 2a) and provides feedback to the robot through the graphical user interface.”; Note that Guenter, Suleman and Daniel teach “[presenting on a visual display device to an instructor] the at least one [sensor] output value, in real time during said performance of said robotic actuator”.)

receiving from the instructor [via an input device] a score given by the instructor to the world state in real time during said performance of said robotic actuator; and (Suay, [figs 1-2]; [sec III. B] “In this work, we recreate the interactive reward interface for a real robot by providing the human trainer with a fixed view of the environment via a web-cam (Figure 2a). … In our evaluation, the reward interface functions as in the original web application, in which left clicking anywhere in the image brings up a reward bar (Figure 1a) that the user can fill with a positive or negative reward value representing the desirability of the current state.”; [sec IV, p. 3] “The user observes the robot’s actions via a web-cam (see Figure 2a) and provides feedback to the robot through the graphical user interface.”; Note that Guenter, Suleman and Daniel teach “receiving from the instructor [via an input device] a score given by the instructor to the world state in real time during said performance of said robotic actuator”.)

Guenter, Suleman, Daniel, MNIH are combinable with Suay for the same rationale as set forth above with respect to claim 15.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Guenter et al. (Using Reinforcement Learning to Adapt an Imitation Task) teaches reinforcement learning to adapt demonstration tasks.
Hersch et al. (Learning Dynamical System Modulation for Constrained Reaching Tasks) teaches dynamical system modulation based on learning from demonstration.
Rezvani et al. (Towards Trustworthy Automation: User Interfaces that Convey Internal and External Awareness) teaches a user interface for a driving simulation.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEHWAN KIM whose telephone number is (571)270-7409.  The examiner can normally be reached on Mon - Thu 7:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.





/S.K./Examiner, Art Unit 2129                                                                                                                                                                                                        
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129