DETAILED ACTION
	This is the first office action in response to U.S. application 17/020,294. All claims are pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-4, 6, 8, 11-16 and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Ide (US 20190272477).
Regarding claim 1, Ide teaches a computer-implemented method (Figs. 5-6) comprising: 
maintaining robot experience data characterizing robot interactions with an environment, the robot experience data comprising a plurality of experiences that each comprise an observation and an action performed by a respective robot in response to the observation ([0041] discusses sensors supplying sensor data to the input control part 31 where [0045]-[0046] discuss the input control including observation variables which are input into the state estimating part which supplies state information and detects the instructed action where the state information and action are interpreted as an experience where because the system operates in a cycle as shown by Figs. 5-6 it would maintain a plurality of experiences); 
obtaining annotation data that assigns, to each experience in a first subset of the experiences in the robot experience data, a respective task-specific reward for a particular task ([0049] discusses the reward estimating part 35 receiving the observation variables and estimating a reward for the action of the apparatus based on the user input where the user input is interpreted to be the annotation data where [0051] further describes the history producing part which maintains a reward history associated with the action history); 
training, on the annotation data, a reward model that receives as input an input observation and generates as output a reward prediction that is a prediction of a task-specific reward for the particular task that should be assigned to the input observation ([0049] “The reward estimating part 35 executes estimation of the reward imparted by the user for the action of the information processing apparatus 10 on the basis of the reward model constructed by a reward model learning part 52 and the observation variables based on the input data”); 
generating task-specific training data for the particular task that associates each of a plurality of experiences with a task-specific reward for the particular task ([0053]-[0054] discuss the learning part 40 which comprises a motion model learning part 51 and a reward model learning part 52 which generate learning data based on the stored experiences), comprising, for each experience in a second subset of the experiences in the robot experience data: 
processing the observation in the experience using the trained reward model to generate a reward prediction ([0054] “The reward model learning part 52 executes learning of the reward model used in the estimation of the reward to be imparted by the user for the action of the information processing apparatus 10 on the basis of the reward history stored in the storage part 39. The reward model learning part 52 supplies the constructed reward model to the reward estimating part 35”), and 
associating the reward prediction with the experience ([0049]-[0054] discuss the experience information being updated from the reward estimating part by being stored in the history producing part 38 where the learning part uses the stored associated rewards and action histories); and 
training a policy neural network on the task-specific training data for the particular task, wherein the policy neural network is configured to receive a network input comprising an observation and to generate a policy output that defines a control policy for a robot performing the particular task ([0045]-[0053] and Fig. 1 show how using the learning part 40, a reward is associated with an input action and stored to update the learning models of learning part 40 where the motion model learning part 51 transmits the motion model data to the motion producing part 33 where the motion control part 34 then uses this data to control the robot where the motion control is interpreted as a control policy where [0150] discusses the reward model as a neural network).

Regarding claim 2, Ide teaches controlling a robot while the robot performs the particular task using the trained policy neural network ([0045]-[0053] and Fig. 1 show how using the learning part 40, the motion model learning part 51 transmits the motion model data to the motion producing part 33 where the motion control part 34 then uses this data to control the robot where the motion control is interpreted as a control policy where [0150] discusses the reward model as a neural network).

Regarding claim 3, Ide teaches providing data specifying the trained policy neural network for use in controlling a robot while the robot performs the particular task ([0045]-[0053] and Fig. 1 show how using the learning part 40, the motion model learning part 51 transmits the motion model data to the motion producing part 33 where the motion control part 34 then uses this data to control the robot where the motion control is interpreted as a control policy where [0150] discusses the reward model as a neural network).

Regarding claim 4, Ide teaches obtaining experiences generated as a result of controlling the robot using the policy neural network to perform the particular task; and adding the experiences to the robot experience data ([0137]-[0138] discuss as the system is operating with the reward model, the history producing part 38 which based on user input correlates the reward and action histories to produce a motion history).
Regarding claim 6, Ide teaches wherein the experiences in the robot experience data are not associated with any rewards for any of the plurality of different tasks ([0049] “Moreover, the reward estimating part 35 supplies the observation information including the observation variables used in the estimation of the reward, to the buffer 37.” Further, as described above, the reward is not associated with the experience until the history producing part correlates the data for the reward model).

Regarding claim 8, Ide teaches wherein training the policy neural network comprising training the policy neural network using an off-policy reinforcement learning technique (0144] “In addition, any optional approach is usable for the learning of the motion model and, for example, reinforced learning is used. For example, the parameters of the motion model are learned using a gradient method such that a predicted reward function defined in advance is maximized. Moreover, in the case where the reinforced learning is used, the motion model can be constructed without preparing any large amount of leaning data including the input and the correct solution.” This use of reinforced learning correlates with the instant application’s use of “off-policy reinforcement learning” as described in page 16 line 25 through page 17 line 13.).

Regarding claim 11, Ide teaches wherein obtaining annotation data comprises: providing, for presentation to a user, a representation of one or more of the experiences in the first subset of experience data; and obtaining, from the user, inputs defining the rewards for the one or more experiences ([0096] “after the action by the information processing apparatus 10 comes to an end, the input control part 31 accepts an input of the response that is a user input to impart a reward for the action for a predetermined time period (hereinafter, referred to as “initial response time period”). The input control part 31 supplies the input data supplied from the input part 11 in the initial response time period, to the reward estimating part 35”).

Regarding claim 12, Ide teaches wherein training the reward model comprises training the reward model to optimize a hinge loss function that measures differences in reward predictions between different experiences from a same task episode ([0144] discusses the reward model using reinforcement learning with a gradient method to maximize the predicted reward function with [0153]-[0155] further discussing imparting a reward for a more proper action which correlates with the instant application’s use of “hinge loss function” as described in page 15 lines 15-18).

Regarding claim 13, Ide teaches a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations (Fig. 1 where [0044] “The information processing part 12 includes, for example, a processor, a storage apparatus, and the like”) comprising: 
maintaining robot experience data characterizing robot interactions with an environment, the robot experience data comprising a plurality of experiences that each comprise an observation and an action performed by a respective robot in response to the observation ([0041] discusses sensors supplying sensor data to the input control part 31 where [0045]-[0046] discuss the input control including observation variables which are input into the state estimating part which supplies state information and detects the instructed action where the state information and action are interpreted as an experience where because the system operates in a cycle as shown by Figs. 5-6 it would maintain a plurality of experiences);
obtaining annotation data that assigns, to each experience in a first subset of the experiences in the robot experience data, a respective task-specific reward for a particular task ([0049] discusses the reward estimating part 35 receiving the observation variables and estimating a reward for the action of the apparatus based on the user input where the user input is interpreted to be the annotation data where [0051] further describes the history producing part which maintains a reward history associated with the action history);
training, on the annotation data, a reward model that receives as input an input observation and generates as output a reward prediction that is a prediction of a task-specific reward for the particular task that should be assigned to the input observation ([0049] “The reward estimating part 35 executes estimation of the reward imparted by the user for the action of the information processing apparatus 10 on the basis of the reward model constructed by a reward model learning part 52 and the observation variables based on the input data”);  
generating task-specific training data for the particular task that associates each of a plurality of experiences with a task-specific reward for the particular task ([0053]-[0054] discuss the learning part 40 which comprises a motion model learning part 51 and a reward model learning part 52 which generate learning data based on the stored experiences), comprising, for each experience in a second subset of the experiences in the robot experience data: 
processing the observation in the experience using the trained reward model to generate a reward prediction ([0054] “The reward model learning part 52 executes learning of the reward model used in the estimation of the reward to be imparted by the user for the action of the information processing apparatus 10 on the basis of the reward history stored in the storage part 39. The reward model learning part 52 supplies the constructed reward model to the reward estimating part 35”), and 
associating the reward prediction with the experience ([0049]-[0054] discuss the experience information being updated from the reward estimating part by being stored in the history producing part 38 where the learning part uses the stored associated rewards and action histories); and 
training a policy neural network on the task-specific training data for the particular task, wherein the policy neural network is configured to receive a network input comprising an observation and to generate a policy output that defines a control policy for a robot performing the particular task ([0045]-[0053] and Fig. 1 show how using the learning part 40, a reward is associated with an input action and stored to update the learning models of learning part 40 where the motion model learning part 51 transmits the motion model data to the motion producing part 33 where the motion control part 34 then uses this data to control the robot where the motion control is interpreted as a control policy where [0150] discusses the reward model as a neural network).

Regarding claim 14, Ide teaches controlling a robot while the robot performs the particular task using the trained policy neural network ([0045]-[0053] and Fig. 1 show how using the learning part 40, the motion model learning part 51 transmits the motion model data to the motion producing part 33 where the motion control part 34 then uses this data to control the robot where the motion control is interpreted as a control policy where [0150] discusses the reward model as a neural network).

Regarding claim 15, Ide teaches providing data specifying the trained policy neural network for use in controlling a robot while the robot performs the particular task ([0045]-[0053] and Fig. 1 show how using the learning part 40, the motion model learning part 51 transmits the motion model data to the motion producing part 33 where the motion control part 34 then uses this data to control the robot where the motion control is interpreted as a control policy where [0150] discusses the reward model as a neural network).

Regarding claim 16, Ide teaches obtaining experiences generated as a result of controlling the robot using the policy neural network to perform the particular task; and adding the experiences to the robot experience data ([0137]-[0138] discuss as the system is operating with the reward model, the history producing part 38 which based on user input correlates the reward and action histories to produce a motion history).

Regarding claim 20, Ide teaches one or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations (Fig. 1 where [0044] “The information processing part 12 includes, for example, a processor, a storage apparatus, and the like”) comprising: 
maintaining robot experience data characterizing robot interactions with an environment, the robot experience data comprising a plurality of experiences that each comprise an observation and an action performed by a respective robot in response to the observation ([0041] discusses sensors supplying sensor data to the input control part 31 where [0045]-[0046] discuss the input control including observation variables which are input into the state estimating part which supplies state information and detects the instructed action where the state information and action are interpreted as an experience where because the system operates in a cycle as shown by Figs. 5-6 it would maintain a plurality of experiences);
obtaining annotation data that assigns, to each experience in a first subset of the experiences in the robot experience data, a respective task-specific reward for a particular task ([0049] discusses the reward estimating part 35 receiving the observation variables and estimating a reward for the action of the apparatus based on the user input where the user input is interpreted to be the annotation data where [0051] further describes the history producing part which maintains a reward history associated with the action history);
training, on the annotation data, a reward model that receives as input an input observation and generates as output a reward prediction that is a prediction of a task-specific reward for the particular task that should be assigned to the input observation ([0049] “The reward estimating part 35 executes estimation of the reward imparted by the user for the action of the information processing apparatus 10 on the basis of the reward model constructed by a reward model learning part 52 and the observation variables based on the input data”);  
generating task-specific training data for the particular task that associates each of a plurality of experiences with a task-specific reward for the particular task ([0053]-[0054] discuss the learning part 40 which comprises a motion model learning part 51 and a reward model learning part 52 which generate learning data based on the stored experiences), comprising, for each experience in a second subset of the experiences in the robot experience data: 
processing the observation in the experience using the trained reward model to generate a reward prediction ([0054] “The reward model learning part 52 executes learning of the reward model used in the estimation of the reward to be imparted by the user for the action of the information processing apparatus 10 on the basis of the reward history stored in the storage part 39. The reward model learning part 52 supplies the constructed reward model to the reward estimating part 35”), and 
associating the reward prediction with the experience ([0049]-[0054] discuss the experience information being updated from the reward estimating part by being stored in the history producing part 38 where the learning part uses the stored associated rewards and action histories); and 
training a policy neural network on the task-specific training data for the particular task, wherein the policy neural network is configured to receive a network input comprising an observation and to generate a policy output that defines a control policy for a robot performing the particular task ([0045]-[0053] and Fig. 1 show how using the learning part 40, a reward is associated with an input action and stored to update the learning models of learning part 40 where the motion model learning part 51 transmits the motion model data to the motion producing part 33 where the motion control part 34 then uses this data to control the robot where the motion control is interpreted as a control policy where [0150] discusses the reward model as a neural network).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5, 7, 9-10, and 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Ide in view of McKinley (US 20200039064).
Regarding claim 5, Ide teaches using stored experience data from any action with a reward model as described above but does not explicitly teach wherein the robot experience data comprises data collected from interactions of a plurality of robots while performing a plurality of different tasks.
McKinley teaches wherein the robot experience data comprises data collected from interactions of a plurality of robots while performing a plurality of different tasks ([0112] discusses the robots being used for a variety of tasks where [0114] further states “Arrayed groups of these robots can be used as ‘arm-farms’ for autonomously collecting data used in Reinforcement Learning.”).
Ide teaches using stored experience data from any action with a reward model. McKinley teaches using multiple robots to amass experience data. It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the invention of Ide and modify it with the multiple robots of McKinley as this allows for more data to be collected and more data will give a more accurate algorithm to control the robot. 

Regarding claim 7, Ide teaches wherein the second subset of experience data was collected as a result of a robot performing one or more tasks that are different from the particular task ([0137]-[0138] discuss the history producing part storing state, action and reward information to be used by the reward model where [0022] discusses that a reward for any action can be imparted where it is interpreted that the storage would maintain information on different actions).

Regarding claim 9, Ide teaches using stored experience data with a reward model as described above but does not explicitly teach wherein the first subset of experience data comprises demonstration experiences collected as a robot performs one or more episodes of the particular task.
McKinley teaches wherein the first subset of experience data comprises demonstration experiences collected as a robot performs one or more episodes of the particular task ([0070] “All stated tasks can be teleoperated by humans in the short term, while allowing mass collection of data to enable learning from demonstrations and lead towards higher levels of automation”).
Ide teaches using stored experience data with a reward model. McKinley teaches using human demonstrations to teach a robot an action. It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the invention of Ide and modify it with the human demonstrations of McKinley as McKinley teaches this allows for the system to discover the salient details of a human's skill and decision process and to figure out what information about its environment the algorithm may ignore making the system more accurate [0093].

Regarding claim 10, Ide teaches using stored experience data with a reward model as described above but does not explicitly teach wherein the robot is controlled by a user while performing the one or more episodes of the particular task.
	McKinley teaches wherein the robot is controlled by a user while performing the one or more episodes of the particular task ([0070] “All stated tasks can be teleoperated by humans in the short term, while allowing mass collection of data to enable learning from demonstrations and lead towards higher levels of automation”).
	Ide teaches using stored experience data with a reward model. McKinley teaches using human demonstrations to teach a robot an action. It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the invention of Ide and modify it with the human demonstrations of McKinley as McKinley teaches this allows for the system to discover the salient details of a human's skill and decision process and to figure out what information about its environment the algorithm may ignore making the system more accurate [0093].

Regarding claim 17, Ide teaches using stored experience data from any action with a reward model as described above but does not explicitly teach wherein the robot experience data comprises data collected from interactions of a plurality of robots while performing a plurality of different tasks.
McKinley teaches wherein the robot experience data comprises data collected from interactions of a plurality of robots while performing a plurality of different tasks ([0112] discusses the robots being used for a variety of tasks where [0114] further states “Arrayed groups of these robots can be used as ‘arm-farms’ for autonomously collecting data used in Reinforcement Learning.”).
Ide teaches using stored experience data from any action with a reward model. McKinley teaches using multiple robots to amass experience data. It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the invention of Ide and modify it with the multiple robots of McKinley as this allows for more data to be collected and more data will give a more accurate algorithm to control the robot. 

Regarding claim 18, Ide teaches wherein the second subset of experience data was collected as a result of a robot performing one or more tasks that are different from the particular task ([0137]-[0138] discuss the history producing part storing state, action and reward information to be used by the reward model where [0022] discusses that a reward for any action can be imparted where it is interpreted that the storage would maintain information on different actions).

Regarding claim 19, Ide teaches using stored experience data with a reward model as described above but does not explicitly teach wherein the first subset of experience data comprises demonstration experiences collected as a robot performs one or more episodes of the particular task.
McKinley teaches wherein the first subset of experience data comprises demonstration experiences collected as a robot performs one or more episodes of the particular task ([0070] “All stated tasks can be teleoperated by humans in the short term, while allowing mass collection of data to enable learning from demonstrations and lead towards higher levels of automation”).
Ide teaches using stored experience data with a reward model. McKinley teaches using human demonstrations to teach a robot an action. It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the invention of Ide and modify it with the human demonstrations of McKinley as McKinley teaches this allows for the system to discover the salient details of a human's skill and decision process and to figure out what information about its environment the algorithm may ignore making the system more accurate [0093].



Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: Kim (US 20220032450) and Luciw (US 20190061147) teach storing a plurality of experience which includes reward information; Liu (US 20210308863) teaches episode-based reinforcement learning; Porter (US 10766136) teaches using a model to determine a level of success for a robotic task; Wouhaybi (US 20190047149) teaches multiple robots performing subtasks based on reward functions; and Jaderberg (2016, “Reinforcement Learning with Unsupervised Auxiliary Tasks”) teaches deep reinforcement learning that maximizes reward functions using sparse reward.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIELLE M JACKSON whose telephone number is (303)297-4364. The examiner can normally be reached Monday-Friday 7:00-4:30 MT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abby Lin can be reached on 571-270-3976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/D.M.J./          Examiner, Art Unit 3664                                                                                                                                                                                              /ABBY Y LIN/Supervisory Patent Examiner, Art Unit 3664