Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
1. 	This action is responsive to application communication filed on 11/30/2020.
2. 	Claims 1-19 are pending in the case. 
3.	Claim 1, 12 and 17 is an independent claim. 


Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-10 and 17-19 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.
Claims 1 and 17 recites a “method” comprising various method steps. None of these steps are tied to any machine, whether explicitly or inherently; therefore, claims 1 and 17 do not satisfy the first prong of the machine-or-transformation test. Additionally, these steps do not transform an article into a different state or thing. Examiner submits that the method steps recite what could be broadly but reasonably construed as software code or a person performing the steps---even mentally. To the extent recited in the claims, the steps are not necessarily stated or claimed to be embodied in hardware 
Accordingly, the claim fails to recite statutory subject matter as defined in 35 U.S.C. § 101.

Dependent claims 2-10, 18 and 19:
Dependent claims 2-10, 18 and 19 are also rejected for failing to resolve the deficiencies of independent claims 1 and 17, respectively.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-19 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Levine et al. (hereinafter “Levine”), U.S. Published Application No. 20190232488 A1.
Claim 1:
Levine teaches A method of updating a policy associated with controlling an action of a robot, the method comprising: (e.g., with each learning process iteration, updating a policy associated with controlling physical actions of a robot par. 3; The one 

receiving a plurality of learning datasets generated by a plurality of heterogeneous agents; (e.g., receiving experience data (i.e., learning datasets) from robots (i.e., heterogeneous agents) par. 43; For example, various implementations disclosed herein collect experience data from multiple robots that operate asynchronously from one another. Moreover, various implementations utilize the collected experience data in training a policy neural network asynchronously from (but simultaneous with) the operation of the multiple robots. For example, a buffer of the collected experience data from an episode of one of the robots can be utilized to update the policy neural network, and updated policy parameters from the updated policy neural network provided for implementation by one or more of the multiple robots before performance of corresponding next episodes.)

generating a weighted learning database based on the plurality of learning datasets and weight sets associated with the plurality of heterogeneous agents; 
 (e.g., generating a weighted buffer database based on the experience data and weight sets associated with the robots based on learning functions algorithm as shown in the table of par. 50, par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy 
and updating the policy associated with controlling the action of the robot based on the weighted learning database to generate an updated policy.  (e.g., update parameters of policy associated with controlling a robotic action based on a weighted buffer to generate an updated policy Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. par. 95; For example, the system may provide updated policy parameters and/or other parameters for use by the robots in upcoming episodes.)

Claim 2 depends on claim 1:
Levine teaches wherein a first agent of the plurality of heterogeneous agents is configured to generate a first learning dataset of the plurality of learning datasets such that the first learning dataset includes a plurality of learning data items including a current state, the action, and a reward, (each iteration of experience data is a learning data that include a current state, action and reward par. 6; Each instance of experience data can indicate a corresponding: current/beginning state,  the current state including information on a surrounding environment of the first agent measured by the first agent, the action being performed by the first agent for the current state, and the reward being an assessment value of the action.  (e.g., current state indicating environmental objects and reward being an assessment value of the action par. 51; As described herein, in various implementations a neural network may parametrize the action-value functions and policies. In some of those implementations, various state representations may be utilized, as input to the model, in generating output indicative of an action to be implemented based on the policies. The state representations can indicate the state of the robot and optionally the state of one or more environmental objects. As one example, a robot state representation may include joint angles and end-effector positions, as well as their time derivatives. In some implementations, a success signal (e.g., a target position) may be appended to a robot state representation. As described herein, the success signal may be utilized in determining a reward for an 

Claim 3 depends on claim 1:
Levine teaches wherein the plurality of learning datasets include a first learning dataset generated by a first agent of the plurality of heterogeneous agents and a second learning dataset generated by a second agent of the plurality of heterogeneous agents, and the weight sets include a first weight set associated with the first agent and a second weight set associated with the second agent, and the generating the weighted learning database comprises: 
generating at least one first weighted learning data item based on the first learning dataset and the first weight set; 
generating at least one second weighted learning data item based on the second learning dataset and the second weight set; 
and generating the weighted learning database including the first weighted learning data item and the second weighted learning data item. (e.g., generating a weighted buffer database based on the experience data and weight sets associated with the robots based on learning functions algorithm as shown in the table of par. 50, Examiner notes that weight of the algorithm can be modified to correspond with data from a particular robot (i.e., varying the weight of algorithm to obtain a first and second weight set) par. 13; Each of the iterations of the iteratively generating can include off-

Claim 4 depends on claim 3:
Levine teaches wherein the generating of the first weighted learning data item comprises: calculating a number of data items corresponding to the first weight set for the first agent; (e.g., updating policy based on all available experience data being processed par. 96; At block 566, the system determines whether training is complete. In some implementations, determining that training is complete may be based on: determining that convergence has been achieved, a threshold quantity of iterations of blocks 558-564 have occurred, all available experience data has been processed, a threshold amount of time has passed, and/or other criteria has been satisfied.) and generating the first weighted learning data item based on the number of data items and the first learning dataset. (e.g., weighted learning function algorithm using the experience data to update the policy par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the 
Claim 5 depends on claim 1:
Levine teaches wherein the updating the policy comprises: updating the policy such that a reward value for the action of the robot increases.  (e.g., updating the policy such that a reward value for the action increases to reach predetermined threshold par. 18; The reward for the action can be generated based on a reward function for the reinforcement learning policy. Par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold.  Par. 99; For example, the reward function can be composed of two parts: the closeness of the end-effector to the handle, and the measure of how much the door is opened in the right direction. The first part of the reward function depends on the distance between end-effector position e and the handle position h in its neutral state. The second part of the reward function depends on the distance between the quaternion of the handle q and its value when the handle is turned and door is opened q.sub.O.)

Claim 6 depends on claim 1:
Levine teaches further comprising: acquiring direct learning data of the robot generated based on the updated policy; (e.g. acquiring current state sensor data of the robot based the updated trained policy par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. The current state can include the state of the robot (e.g., angles of joints of the robot, position(s) of end effector(s) of the robot, and/or their time derivatives) and/or the current state of one or more components in the robot's environment (e.g., a current state of sensor(s) in the robot's environment, current pose(s) of target object(s) in the robot's environment).)
generating a direct learning database including the direct learning data; (e.g., generating a collected batch of experience data (i.e., direct learning database) including state information par. 6; Each instance of experience data can indicate a corresponding: current/beginning state, subsequent state transitioned to from the beginning state, robotic action executed to transition from the beginning state to the subsequent state (where the action is based on application of the beginning state to the policy network and its current policy parameters), and optionally a reward for the action (as determined based on the reward function). The collected experience data is generated during the episodes and is used to train the policy network by iteratively updating policy parameters of the policy network based on a batch of collected experience data.)
and updating the policy based on the direct learning database.  (e.g., updating policy based on collected experience data par. 6; The collected experience data is generated during the episodes and is used to train the policy network by iteratively updating policy parameters of the policy network based on a batch of collected experience data. Further, prior to performance of each of a plurality of episodes performed by the robots, the current updated policy parameters can be provided (or retrieved) for utilization in performance of the episode. )

Claim 7 depends on claim 6:
Levine teaches wherein the weighted learning database includes the direct learning database such that the updating the policy based on the direct learning database comprises: updating the policy based on the weighted learning database. (e.g., generating a weighted buffer database based on the experience data and weight sets associated with the robots based on learning functions algorithm as shown in the table of par. 50, par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy learning can be Q-learning, such as Q-learning that utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. Par. 50; Presented below is an overview of one example algorithm for performing asynchronous NAF with N collector threads and one training thread. Par. 90; 

Claim 8 depends on claim 6:
Levine teaches wherein the updating the policy based on the direct learning database comprises: updating the policy in response to a set number of items of the direct learning data being generated.  (e.g., updating policy based on all available experience data being processed par. 96; At block 566, the system determines whether training is complete. In some implementations, determining that training is complete may be based on: determining that convergence has been achieved, a threshold quantity of iterations of blocks 558-564 have occurred, all available experience data has been processed, a threshold amount of time has passed, and/or other criteria has been satisfied.)
Claim 9 depends on claim 6:
Levine teaches wherein the updating the policy based on the direct learning database comprises: updating the policy in response to a reward value calculated based on the policy being greater than or equal to a set value. (e.g., updating the policy based on thresholds being met by the reward value par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold. Par. 74; If the system determines success or other criteria has been met, the system 

Claim 10 depends on claim 6:
Levine teaches wherein the acquiring the direct learning data of the robot based on the updated policy comprises: generating a current state of the robot using at least one sensor associated with the robot; (e.g. acquiring current state sensor data of the robot based the updated trained policy par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. The current state can include the state of the robot (e.g., angles of joints of the robot, position(s) of end effector(s) of the robot, and/or their time derivatives) and/or the current state of one or more components in the robot's environment (e.g., a current state of sensor(s) in the robot's environment, current pose(s) of target object(s) in the robot's environment).)
controlling the action of the robot using the updated policy; (e.g., update parameters of policy associated with controlling a robotic action based on a weighted buffer to generate an updated policy Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for 
calculating a reward for the action of the robot; (e.g., updating the policy based on thresholds being met by the reward value par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold. Par. 74; If the system determines success or other criteria has been met, the system proceeds to block 352 and starts a new episode. It is noted that in the new episode the system can, at block 354 of the new episode, sync the policy parameters with one or more updated parameters that are updated relative to those in the immediately preceding episode (as a result of simultaneous updating of those parameters per method 500 of FIG. 5 and/or other methods).)
and generating the direct learning data including the current state of the robot, the action of the robot, and the reward for the action of the robot. (each iteration of experience data is a learning data that include a current state, action and reward par. 6; Each instance of experience data can indicate a corresponding: current/beginning state, subsequent state transitioned to from the beginning state, robotic action executed to transition from the beginning state to the subsequent state (where the action is based on application of the beginning state to the policy network and its current policy parameters), and optionally a reward for the action (as determined based on the reward function). Par. 25; In some implementations, the method further includes generating a given exploration of the explorations during the given episode by:  
Claim 11 depends on claim 1:
Levine teaches A non-transitory computer-readable medium comprising computer readable instructions that, when executed by a computer, cause the computer to perform the method of claim 1. (e.g., Figure 7; storage subsystem par. 31; Other implementations may include a non-transitory computer readable storage medium storing instructions executable by one or more processors (e.g., one or more central processing units (CPUs).)

Claim 12:
Levine teaches An electronic device configured to update a policy associated with controlling an action of a robot, the electronic device comprising:  (e.g., device configured  to update a policy associated with controlling physical actions of a robot par. 103; For example, all or aspects of control system 660 may be implemented on one or more computing devices that are in wired and/or wireless communication with the robot 620, such as computing device 710. par. 3; The one or more robots may perform in accordance with each new improved iteration of the 
a memory configured to store a program for updating the action of the robot; (e.g., storage configured to store a program for updating the action of the robot based on current state data par. 107; Storage subsystem 724 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 724 may include the logic to perform selected aspects of the method of FIGS. 3, 4, and/or 5. Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state.)

and a processor configured to execute the program to, receive a plurality of learning datasets generated by a plurality of heterogeneous agents, (e.g., processor 714 of Figure 7 configured to execute the program to, receive experience data (i.e., learning datasets) from robots (i.e., heterogeneous agents) par. 43; For example, various implementations disclosed herein collect experience data from multiple robots that operate asynchronously from one another. Moreover, various implementations utilize the collected experience data in training a policy neural network asynchronously from (but simultaneous with) the operation of the multiple robots. For example, a buffer of the collected experience data from an episode of one of the robots can be utilized to update the policy neural network, and updated policy parameters from the updated policy neural network provided for implementation by one or more of the multiple robots before performance of corresponding next episodes.) generate a weighted learning database based on the plurality of learning datasets and weight sets associated with the plurality of heterogeneous agents, (e.g., generating a weighted buffer database based on the experience data and weight sets associated with the robots based on learning functions algorithm as shown in the table of par. 50, par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy learning can be Q-learning, such as Q-learning that utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. Par. 50; Presented below is an overview of one example algorithm for performing asynchronous NAF with N collector threads and one training thread. Par. 90; For example, the system may initialize a target policy network with weight par. 93; For instance, the system may update the weight of the Q network by minimizing the loss)
acquire direct learning data of the robot generated based on the weighted learning database (e.g. acquiring current state sensor data of the robot based the updated weighted trained policy par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. The current state can include the state of the robot (e.g., angles of joints of the robot, position(s) of end effector(s) of the robot, and/or their time derivatives) and/or the current state of one or more components in the robot's environment (e.g., a current state of sensor(s) in the robot's environment, current pose(s) of target object(s) in the robot's environment).) and the policy associated with controlling the action of the robot, (e.g., update parameters of and update the policy based on at least the direct learning data. (e.g., updating policy based on collected experience data par. 6; The collected experience data is generated during the episodes and is used to train the policy network by iteratively updating policy parameters of the policy network based on a batch of collected experience data. Further, prior to performance of each of a plurality of episodes performed by the robots, the current updated policy parameters can be provided (or retrieved) for utilization in performance of the episode. )

Claim 13 depends on claim 12:
Levin teaches wherein the processor is configured to update the policy by, updating the policy in response to a set number of items of the direct learning data being generated. (e.g., updating policy based on all available experience data being processed par. 96; At block 566, the system determines whether training is complete. In some implementations, determining that training is complete may be based on: determining that convergence has been achieved, a threshold quantity of iterations of blocks 558-564 have occurred, all available experience data has been processed, a threshold amount of time has passed, and/or other criteria has been satisfied.)

Claim 14 depends on claim 12:
Levin teaches wherein the processor is configured to update the policy by, updating the policy in response to a reward value calculated based on the policy being greater than or equal to a set value. (e.g., updating the policy based on thresholds being met by the reward value par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold. Par. 74; If the system determines success or other criteria has been met, the system proceeds to block 352 and starts a new episode. It is noted that in the new episode the system can, at block 354 of the new episode, sync the policy parameters with one or more updated parameters that are updated relative to those in the immediately preceding episode (as a result of simultaneous updating of those parameters per method 500 of FIG. 5 and/or other methods).)


Claim 15 depends on claim 1:
Levin teaches wherein the processor is configured to update the policy by, updating the policy such that a reward value for the action of the robot increases. (e.g., updating the policy such that a reward value for the action increases to reach predetermined threshold par. 18; The reward for the action can be generated based on 
Claim 16 depends on claim 12:
Levine teaches wherein the processor is configured to acquire the direct learning data by, generating a current state of the robot using at least one sensor associated with the robot, (e.g. acquiring current state sensor data of the robot based the updated trained policy par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. The current state can include the state of the robot (e.g., angles of joints of the robot, position(s) of end effector(s) of the robot, and/or their time derivatives) and/or the current state of one or more components in the robot's environment (e.g., a current state of sensor(s) in the robot's environment, current pose(s) of target object(s) in the robot's environment).) controlling the action of the robot using the policy, (e.g., update parameters of policy associated with controlling a robotic action based on a weighted buffer to generate an updated policy calculating a reward for the action of the robot, (e.g., updating the policy based on thresholds being met by the reward value par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold. Par. 74; If the system determines success or other criteria has been met, the system proceeds to block 352 and starts a new episode. It is noted that in the new episode the system can, at block 354 of the new episode, sync the policy parameters with one or more updated parameters that are updated relative to those in the immediately preceding episode (as a result of simultaneous updating of those parameters per method 500 of FIG. 5 and/or other methods).) and generating the direct learning data including the current state, the action, and the reward for the action of the robot. . (each iteration of experience data is a learning data that include a current state, action and reward par. 6; Each instance of experience data can indicate a corresponding: current/beginning state, subsequent state transitioned to from the beginning state, robotic action executed to transition from the beginning state to the subsequent state (where the action is based on application of the beginning state to the policy network and its current policy parameters), and optionally a reward for the action (as determined based on the reward function). Par. 25; In some implementations, the method further includes generating a given exploration of the explorations during the given episode by: applying a current  

Claim 17:
Claim 17 is substantially encompassed in claim 12, therefore, Examiner relies on the same rationale set forth in claim 12 to reject claim 17.
Claim 18 depends on claim 17:
Claim 18 is substantially encompassed in claim 13, therefore, Examiner relies on the same rationale set forth in claim 13 to reject claim 18.Claim 19 depends on claim 17:
Claim 19 is substantially encompassed in claim 16, therefore, Examiner relies on the same rationale set forth in claim 16 to reject claim 19.



Conclusion

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Adam Queler can be reached on (571)272-4140. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

HENRY ORR
Primary Examiner
Art Unit 2145



/HENRY ORR/Primary Examiner, Art Unit 2145