Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
1. 	This action is responsive to applicant’s amendment dated 2/16/2022.
2. 	Claims 1-20 are pending in the case. 
3.	Claim 20 is newly added.
4.	Claim 1, 12 and 17 is an independent claim. 

Applicant’s Response
5.	In Applicant’s response dated 2/16/2022, applicant has amended the following:
a) Claims 1, 3-10, 12 and 17-19
Based on Applicant’s amendments and remarks, the following rejections previously set forth in Office Action dated 11/16/2021 are withdrawn:
a) 35 U.S.C. 101 Rejection to claims 1-10 and 17-19


Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claim 20 is rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
Claim 20 :
Claim 20 recites: “the weighted learning database based on the plurality of learning datasets and the weight sets associated with the plurality of heterogeneous agents such that the weighted learning database favors ones of the plurality of learning datasets associated with ones of the plurality of heterogenous agents performing actions similar to the action of the robot over other ones of the plurality of learning datasets. ”. (emphasis added)
There is no mention of the newly amended limitation in the original Specification. Thus, the limitations include subject matter that was not described in the original Specification.
	If the examiner has overlooked the portion of the original Specification that describes this feature of the present invention, then Applicant should point it out (by page number and line number) in the response to this Office Action.    
	


	
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 20 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
The term “similar” in claim 20 is a relative term which renders the claim indefinite. The term “similar” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. 


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-19 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Levine et al. (hereinafter “Levine”), U.S. Published Application No. 20190232488 A1.
Claim 1:
Levine teaches A method of updating a policy associated with controlling an action of a robot using an electronic device including a memory and a processor, the method comprising: (e.g., with each learning process iteration, updating a policy associated with controlling physical actions of a robot Figure 7; illustrates memory system and processor par. 3; The one or more robots may perform in accordance with each new improved iteration of the policy/approach for a particular task as the new iterations are passed through to the computing apparatus(es) responsible for controlling the robots' physical actions.)

receiving a plurality of learning datasets generated by a plurality of heterogeneous agents through performance of respective heterogeneous actions; (e.g., receiving experience data (i.e., learning datasets) from robots (i.e., heterogeneous agents) par. 43; For example, various implementations disclosed herein collect experience data from multiple robots that operate asynchronously from one another. Moreover, various implementations utilize the collected experience data in training a policy neural network asynchronously from (but simultaneous with) the operation of the multiple robots. For example, a buffer of the collected experience data from an episode of one of the robots can be utilized to update the policy neural network, and updated policy parameters from the updated policy neural network provided for implementation by one or more of the multiple robots before performance of corresponding next episodes. Par. 100; FIG. 6 schematically depicts an example architecture of a robot 640. The robot 640 includes a robot control system 660, one or more operational components 640a-640n, and one or more sensors 642a-642m. The sensors 642a-642m may include, for example, vision sensors, light sensors, pressure sensors, pressure wave sensors (e.g., microphones), proximity sensors, accelerometers, gyroscopes, thermometers, barometers, and so forth.  )

generating, by the processor, a weighted learning database based on the plurality of learning datasets and weight sets associated with the plurality of heterogeneous agents; 
(e.g., generating a weighted buffer database based on the experience data and weight sets associated with the robots based on learning functions algorithm as shown in the table of par. 50, par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy learning can be Q-learning, such as Q-learning that utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. Par. 50; Presented below is an overview of one example algorithm for performing asynchronous NAF with N collector threads and one training thread. Par. 90; For example, the system may initialize a target policy network with weight par. 93; For instance, the system may update the weight of the Q network by minimizing the loss)
and updating, by the processor, the policy associated with controlling the action of the robot based on the weighted learning database to generate an updated policy.  (e.g., update parameters of policy associated with controlling a robotic action based on a weighted buffer to generate an updated policy Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. par. 95; For example, the system may provide updated policy parameters and/or other parameters for use by the robots in upcoming episodes.)

Claim 2 depends on claim 1:
Levine teaches wherein a first agent of the plurality of heterogeneous agents is configured to generate a first learning dataset of the plurality of learning datasets such that the first learning dataset includes a plurality of learning data items including a current state, the action, and a reward, (each iteration of experience data is a learning data that include a current state, action and reward par. 6; Each instance of experience data can indicate a corresponding: current/beginning state, subsequent state transitioned to from the beginning state, robotic action executed to transition from the beginning state to the subsequent state (where the action is based on application of the beginning state to the policy network and its current policy parameters), and optionally a reward for the action (as determined based on the reward function). Par. 25; In some implementations, the method further includes generating a given exploration of the explorations during the given episode by: applying a current state representation as input to the policy network, the current state representation indicating a current state of at least the given robot; generating output by processing the input using the policy network; and providing control commands to one or more actuators of the given robot based on the output.)  the current state including information on a surrounding environment of the first agent measured by the first agent, the action being performed by the first agent for the current state, and the reward being an assessment value of the action.  (e.g., current state indicating environmental objects and reward being an assessment value of the action par. 51; As described herein, in various implementations a neural network may parametrize the action-value functions and policies. In some of those implementations, various state representations may be utilized, as input to the model, in generating output indicative of an action to be implemented based on the policies. The state representations can indicate the state of the robot and optionally the state of one or more environmental objects. As one example, a robot state representation may include joint angles and end-effector positions, as well as their time derivatives. In some implementations, a success signal (e.g., a target position) may be appended to a robot state representation. As described herein, the success signal may be utilized in determining a reward for an action and/or for other purposes. Par. 58; This may continue to be performed iteratively (e.g., at each control cycle of the robot) until the success signal is achieved (e.g., as determined based on a reward satisfying a criteria) and/or other criteria is met.)

Claim 3 depends on claim 1:
Levine teaches wherein the plurality of learning datasets include a first learning dataset generated by a first agent of the plurality of heterogeneous agents and a second learning dataset generated by a second agent of the plurality of heterogeneous agents, and the weight sets include a first weight set associated with the first agent and a second weight set associated with the second agent, and the generating the weighted learning database comprises: 
generating, by the processor, at least one first weighted learning data item based on the first learning dataset and the first weight set; 
generating, by the processor, at least one second weighted learning data item based on the second learning dataset and the second weight set; 
and generating, by the processor, the weighted learning database including the first weighted learning data item and the second weighted learning data item. (e.g., generating a weighted buffer database based on the experience data and weight sets associated with the robots based on learning functions algorithm as shown in the table of par. 50, Examiner notes that weight of the algorithm can be modified to correspond with data from a particular robot (i.e., varying the weight of algorithm to obtain a first and second weight set) par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy learning can be Q-learning, such as Q-learning that utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. Par. 50; Presented below is an overview of one example algorithm for performing asynchronous NAF with N collector threads and one training thread. Par. 90; For example, the system may initialize a target policy network with weight par. 93; For instance, the system may update the weight of the Q network by minimizing the loss)

Claim 4 depends on claim 3:
Levine teaches wherein the generating of the first weighted learning data item comprises: calculating, by the processor, a number of data items corresponding to the first weight set for the first agent; (e.g., updating policy based on all available experience data being processed par. 96; At block 566, the system determines whether training is complete. In some implementations, determining that training is complete may be based on: determining that convergence has been achieved, a threshold quantity of iterations of blocks 558-564 have occurred, all available experience data has been processed, a threshold amount of time has passed, and/or other criteria has been satisfied.) and generating, by the processor, the first weighted learning data item based on the number of data items and the first learning dataset. (e.g., weighted learning function algorithm using the experience data to update the policy par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy learning can be Q-learning, such as Q-learning that utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. Par. 50; Presented below is an overview of one example algorithm for performing asynchronous NAF with N collector threads and one training thread. Par. 90; For example, the system may initialize a target policy network with weight par. 93; For instance, the system may update the weight of the Q network by minimizing the loss)
Claim 5 depends on claim 1:
Levine teaches wherein the updating, by the processor, the policy comprises: updating the policy such that a reward value for the action of the robot increases.  (e.g., updating the policy such that a reward value for the action increases to reach predetermined threshold par. 18; The reward for the action can be generated based on a reward function for the reinforcement learning policy. Par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold.  Par. 99; For example, the reward function can be composed of two parts: the closeness of the end-effector to the handle, and the measure of how much the door is opened in the right direction. The first part of the reward function depends on the distance between end-effector position e and the handle position h in its neutral state. The second part of the reward function depends on the distance between the quaternion of the handle q and its value when the handle is turned and door is opened q.sub.O.)

Claim 6 depends on claim 1:
Levine teaches further comprising: acquiring, by the processor, direct learning data of the robot generated based on the updated policy; (e.g. acquiring current state sensor data of the robot based the updated trained policy par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. The current state can include the state of the robot (e.g., angles of joints of the robot, position(s) of end effector(s) of the robot, and/or their time derivatives) and/or the current state of one or more components in the robot's environment (e.g., a current state of sensor(s) in the robot's environment, current pose(s) of target object(s) in the robot's environment).)
generating, by the processor, a direct learning database including the direct learning data; (e.g., generating a collected batch of experience data (i.e., direct learning database) including state information par. 6; Each instance of experience data can indicate a corresponding: current/beginning state, subsequent state transitioned to from the beginning state, robotic action executed to transition from the beginning state to the subsequent state (where the action is based on application of the beginning state to the policy network and its current policy parameters), and optionally a reward for the action (as determined based on the reward function). The collected experience data is generated during the episodes and is used to train the policy network by iteratively updating policy parameters of the policy network based on a batch of collected experience data.)
and updating, by the processor,  the policy based on the direct learning database.  (e.g., updating policy based on collected experience data par. 6; The collected experience data is generated during the episodes and is used to train the policy network by iteratively updating policy parameters of the policy network based on a batch of collected experience data. Further, prior to performance of each of a plurality of episodes performed by the robots, the current updated policy parameters can be provided (or retrieved) for utilization in performance of the episode. )

Claim 7 depends on claim 6:
Levine teaches wherein the weighted learning database includes the direct learning database such that the updating the policy based on the direct learning database comprises: updating, by the processor, the policy based on the weighted learning database. (e.g., generating a weighted buffer database based on the experience data and weight sets associated with the robots based on learning functions algorithm as shown in the table of par. 50, par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy learning can be Q-learning, such as Q-learning that utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. Par. 50; Presented below is an overview of one example algorithm for performing asynchronous NAF with N collector threads and one training thread. Par. 90; For example, the system may initialize a target policy network with weight par. 93; For instance, the system may update the weight of the Q network by minimizing the loss)

Claim 8 depends on claim 6:
Levine teaches wherein the updating, by the processor, the policy based on the direct learning database comprises: updating the policy in response to a set number of items of the direct learning data being generated.  (e.g., updating policy based on all available experience data being processed par. 96; At block 566, the system determines whether training is complete. In some implementations, determining that training is complete may be based on: determining that convergence has been achieved, a threshold quantity of iterations of blocks 558-564 have occurred, all available experience data has been processed, a threshold amount of time has passed, and/or other criteria has been satisfied.)
Claim 9 depends on claim 6:
Levine teaches wherein the updating the policy based on the direct learning database comprises: updating, by the processor, the policy in response to a reward value calculated based on the policy being greater than or equal to a set value. (e.g., updating the policy based on thresholds being met by the reward value par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold. Par. 74; If the system determines success or other criteria has been met, the system proceeds to block 352 and starts a new episode. It is noted that in the new episode the system can, at block 354 of the new episode, sync the policy parameters with one or more updated parameters that are updated relative to those in the immediately preceding episode (as a result of simultaneous updating of those parameters per method 500 of FIG. 5 and/or other methods).)

Claim 10 depends on claim 6:
Levine teaches wherein the acquiring the direct learning data of the robot based on the updated policy comprises: generating, by the processor, a current state of the robot using at least one sensor associated with the robot; (e.g. acquiring current state sensor data of the robot based the updated trained policy par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. The current state can include the state of the robot (e.g., angles of joints of the robot, position(s) of end effector(s) of the robot, and/or their time derivatives) and/or the current state of one or more components in the robot's environment (e.g., a current state of sensor(s) in the robot's environment, current pose(s) of target object(s) in the robot's environment).)
controlling, by the processor, the action of the robot using the updated policy; (e.g., update parameters of policy associated with controlling a robotic action based on a weighted buffer to generate an updated policy Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. par. 95; For example, the system may provide updated policy parameters and/or other parameters for use by the robots in upcoming episodes.)
calculating, by the processor, a reward for the action of the robot; (e.g., updating the policy based on thresholds being met by the reward value par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold. Par. 74; If the system determines success or other criteria has been met, the system proceeds to block 352 and starts a new episode. It is noted that in the new episode the system can, at block 354 of the new episode, sync the policy parameters with one or more updated parameters that are updated relative to those in the immediately preceding episode (as a result of simultaneous updating of those parameters per method 500 of FIG. 5 and/or other methods).)
and generating, by the processor, the direct learning data including the current state of the robot, the action of the robot, and the reward for the action of the robot. (each iteration of experience data is a learning data that include a current state, action and reward par. 6; Each instance of experience data can indicate a corresponding: current/beginning state, subsequent state transitioned to from the beginning state, robotic action executed to transition from the beginning state to the subsequent state (where the action is based on application of the beginning state to the policy network and its current policy parameters), and optionally a reward for the action (as determined based on the reward function). Par. 25; In some implementations, the method further includes generating a given exploration of the explorations during the given episode by: applying a current state representation as input to the policy network, the current state representation indicating a current state of at least the given robot; generating output by processing the input using the policy network; and providing control commands to one or more actuators of the given robot based on the output.)  
Claim 11 depends on claim 1:
Levine teaches A non-transitory computer-readable medium comprising computer readable instructions that, when executed by a computer, cause the computer to perform the method of claim 1. (e.g., Figure 7; storage subsystem par. 31; Other implementations may include a non-transitory computer readable storage medium storing instructions executable by one or more processors (e.g., one or more central processing units (CPUs).)

Claim 12:
Levine teaches An electronic device configured to update a policy associated with controlling an action of a robot, the electronic device comprising:  (e.g., device configured  to update a policy associated with controlling physical actions of a robot par. 103; For example, all or aspects of control system 660 may be implemented on one or more computing devices that are in wired and/or wireless communication with the robot 620, such as computing device 710. par. 3; The one or more robots may perform in accordance with each new improved iteration of the policy/approach for a particular task as the new iterations are passed through to the computing apparatus(es) responsible for controlling the robots' physical actions.)
a memory configured to store a program for updating the action of the robot; (e.g., storage configured to store a program for updating the action of the robot based on current state data par. 107; Storage subsystem 724 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 724 may include the logic to perform selected aspects of the method of FIGS. 3, 4, and/or 5. Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state.)

and a processor configured to execute the program to, receive a plurality of learning datasets generated by a plurality of heterogeneous agents through performance of respective heterogeneous actions, (e.g., processor 714 of Figure 7 configured to execute the program to, receive experience data (i.e., learning datasets) from robots (i.e., heterogeneous agents) par. 43; For example, various implementations disclosed herein collect experience data from multiple robots that operate asynchronously from one another. Moreover, various implementations utilize the collected experience data in training a policy neural network asynchronously from (but simultaneous with) the operation of the multiple robots. For example, a buffer of the collected experience data from an episode of one of the robots can be utilized to update the policy neural network, and updated policy parameters from the updated policy neural network provided for implementation by one or more of the multiple robots before performance of corresponding next episodes. Par. 100; FIG. 6 schematically depicts an example architecture of a robot 640. The robot 640 includes a robot control system 660, one or more operational components 640a-640n, and one or more sensors 642a-642m. The sensors 642a-642m may include, for example, vision sensors, light sensors, pressure sensors, pressure wave sensors (e.g., microphones), proximity sensors, accelerometers, gyroscopes, thermometers, barometers, and so forth.) generate a weighted learning database based on the plurality of learning datasets and weight sets associated with the plurality of heterogeneous agents, (e.g., generating a weighted buffer database based on the experience data and weight sets associated with the robots based on learning functions algorithm as shown in the table of par. 50, par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy learning can be Q-learning, such as Q-learning that utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. Par. 50; Presented below is an overview of one example algorithm for performing asynchronous NAF with N collector threads and one training thread. Par. 90; For example, the system may initialize a target policy network with weight par. 93; For instance, the system may update the weight of the Q network by minimizing the loss)
acquire direct learning data of the robot generated based on the weighted learning database (e.g. acquiring current state sensor data of the robot based the updated weighted trained policy par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. The current state can include the state of the robot (e.g., angles of joints of the robot, position(s) of end effector(s) of the robot, and/or their time derivatives) and/or the current state of one or more components in the robot's environment (e.g., a current state of sensor(s) in the robot's environment, current pose(s) of target object(s) in the robot's environment).) and the policy associated with controlling the action of the robot, (e.g., update parameters of policy associated with controlling a robotic action based on a weighted buffer to generate an updated policy Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. par. 95; For example, the system may provide updated policy parameters and/or other parameters for use by the robots in upcoming episodes.) and update the policy based on at least the direct learning data. (e.g., updating policy based on collected experience data par. 6; The collected experience data is generated during the episodes and is used to train the policy network by iteratively updating policy parameters of the policy network based on a batch of collected experience data. Further, prior to performance of each of a plurality of episodes performed by the robots, the current updated policy parameters can be provided (or retrieved) for utilization in performance of the episode. )

Claim 13 depends on claim 12:
Levin teaches wherein the processor is configured to update the policy by, updating the policy in response to a set number of items of the direct learning data being generated. (e.g., updating policy based on all available experience data being processed par. 96; At block 566, the system determines whether training is complete. In some implementations, determining that training is complete may be based on: determining that convergence has been achieved, a threshold quantity of iterations of blocks 558-564 have occurred, all available experience data has been processed, a threshold amount of time has passed, and/or other criteria has been satisfied.)

Claim 14 depends on claim 12:
Levin teaches wherein the processor is configured to update the policy by, updating the policy in response to a reward value calculated based on the policy being greater than or equal to a set value. (e.g., updating the policy based on thresholds being met by the reward value par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold. Par. 74; If the system determines success or other criteria has been met, the system proceeds to block 352 and starts a new episode. It is noted that in the new episode the system can, at block 354 of the new episode, sync the policy parameters with one or more updated parameters that are updated relative to those in the immediately preceding episode (as a result of simultaneous updating of those parameters per method 500 of FIG. 5 and/or other methods).)


Claim 15 depends on claim 1:
Levin teaches wherein the processor is configured to update the policy by, updating the policy such that a reward value for the action of the robot increases. (e.g., updating the policy such that a reward value for the action increases to reach predetermined threshold par. 18; The reward for the action can be generated based on a reward function for the reinforcement learning policy. Par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold.  Par. 99; For example, the reward function can be composed of two parts: the closeness of the end-effector to the handle, and the measure of how much the door is opened in the right direction. The first part of the reward function depends on the distance between end-effector position e and the handle position h in its neutral state. The second part of the reward function depends on the distance between the quaternion of the handle q and its value when the handle is turned and door is opened q.sub.O.)
Claim 16 depends on claim 12:
Levine teaches wherein the processor is configured to acquire the direct learning data by, generating a current state of the robot using at least one sensor associated with the robot, (e.g. acquiring current state sensor data of the robot based the updated trained policy par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. The current state can include the state of the robot (e.g., angles of joints of the robot, position(s) of end effector(s) of the robot, and/or their time derivatives) and/or the current state of one or more components in the robot's environment (e.g., a current state of sensor(s) in the robot's environment, current pose(s) of target object(s) in the robot's environment).) controlling the action of the robot using the policy, (e.g., update parameters of policy associated with controlling a robotic action based on a weighted buffer to generate an updated policy Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. par. 95; For example, the system may provide updated policy parameters and/or other parameters for use by the robots in upcoming episodes.) calculating a reward for the action of the robot, (e.g., updating the policy based on thresholds being met by the reward value par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold. Par. 74; If the system determines success or other criteria has been met, the system proceeds to block 352 and starts a new episode. It is noted that in the new episode the system can, at block 354 of the new episode, sync the policy parameters with one or more updated parameters that are updated relative to those in the immediately preceding episode (as a result of simultaneous updating of those parameters per method 500 of FIG. 5 and/or other methods).) and generating the direct learning data including the current state, the action, and the reward for the action of the robot. . (each iteration of experience data is a learning data that include a current state, action and reward par. 6; Each instance of experience data can indicate a corresponding: current/beginning state, subsequent state transitioned to from the beginning state, robotic action executed to transition from the beginning state to the subsequent state (where the action is based on application of the beginning state to the policy network and its current policy parameters), and optionally a reward for the action (as determined based on the reward function). Par. 25; In some implementations, the method further includes generating a given exploration of the explorations during the given episode by: applying a current state representation as input to the policy network, the current state representation indicating a current state of at least the given robot; generating output by processing the input using the policy network; and providing control commands to one or more actuators of the given robot based on the output.)  

Claim 17:
Claim 17 is substantially encompassed in claim 12, therefore, Examiner relies on the same rationale set forth in claim 12 to reject claim 17.
Claim 18 depends on claim 17:
Claim 18 is substantially encompassed in claim 13, therefore, Examiner relies on the same rationale set forth in claim 13 to reject claim 18.Claim 19 depends on claim 17:
Claim 19 is substantially encompassed in claim 16, therefore, Examiner relies on the same rationale set forth in claim 16 to reject claim 19.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Levine as cited above and applied to claim 1, in view of Li et al. (hereinafter “Li”), U.S. Published Application No. 20190250568 A1.

Claim 20 depends on claim 1:
Levine fails to expressly teach wherein the generating the weighted learning database comprises: generating, by the processor, the weighted learning database based on the plurality of learning datasets and the weight sets associated with the plurality of heterogeneous agents such that the weighted learning database favors ones of the plurality of learning datasets associated with ones of the plurality of heterogenous agents performing actions similar to the action of the robot over other ones of the plurality of learning datasets.

However, Li teaches wherein the generating the weighted learning database comprises: generating, by the processor, the weighted learning database based on the plurality of learning datasets and the weight sets associated with the plurality of heterogeneous agents such that the weighted learning database favors ones of the plurality of learning datasets associated with ones of the plurality of heterogenous agents performing actions similar to the action of the robot over other ones of the plurality of learning datasets. (e.g., applying weights to learning datasets  (i.e., weighted learning database favors ones of the plurality of learning datasets) for a training iteration to update learning policy associated with robots  par. 8; During the training of the learning agent, the first combined signal is used to control, in real-time, the object in the performance of the task. A supervisor coefficient weights the combination of the learning signal and the supervisor signal. During training iterations of the learning agent, training data is accumulated. After this initial training of the learning agent terminates, a pioneer agent is updated to include a learning policy of the trained learning agent. The supervisor coefficient is reduced. The pioneer agent may then be trained based on the training data accumulated during the previous training of the learning agent. The training of the pioneer agent may be further based on a second combined signal. The second combined signal includes a combination of the supervisor signal and a pioneer signal generated by the pioneer agent. The second combined signal is weighted by the reduced supervisor coefficient. After this training of the pioneer agent terminates, the learning agent is updated to include a pioneer policy of the trained pioneer agent. The updated learning agent may then be re-trained, via the reduced supervisor coefficient.)

In the analogous art of deep learning, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the learning method as taught by Levine to apply weights as taught by Li to provide the benefit of  safe, effective, and successful control agents within acceptable training times. (see Li; par. 6)

Response to Arguments
Applicant's arguments filed 2/16/2022 have been fully considered but they are not persuasive. 
Prior Art Rejections
Applicant argues that Levine does not disclose that the robots are heterogenous robots that each perform different actions, and, thus, generate different experience data.

In contrast, Levine simple discloses collecting experience data through the operation of multiple ones of the same type of robot (i.e., homogenous robots). (see Response; page 11)

Examiner respectfully disagrees. 
Levine teaches robots with one or more operational components and one or more sensors (par. 100; FIG. 6 schematically depicts an example architecture of a robot 640. The robot 640 includes a robot control system 660, one or more operational components 640a-640n, and one or more sensors 642a-642m. The sensors 642a-642m may include, for example, vision sensors, light sensors, pressure sensors, pressure wave sensors (e.g., microphones), proximity sensors, accelerometers, gyroscopes, thermometers, barometers, and so forth.). (emphasis added) Examiner submits that robots with the same operational components and sensors may be considered “homogeneous robots” and robots with different operational components and sensors may be considered “heterogenous robots”. Therefore, Levine teaches both homogeneous and heterogenous robots. Examiner further submits in the instance, that a first robot has more operational components and sensors than a second robot allows for the first robot to perform different actions with the additional operational components and sensors over the second robot with less operational components and sensors. Thus, Levine’s robots with varying amounts of a variety of operational components and sensors teaches or suggests robots that are heterogenous robots that each perform different actions and thus, generate different experience data.

In respect newly added claim 20, a new grounds of rejection have been applied using the “Li et al.” reference. (see office action above)

For at least the foregoing reasons, the claims are not in conditioned for allowance. 


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HENRY ORR whose telephone number is (571)270-1308. The examiner can normally be reached 9AM-5PM EST M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Adam Queler can be reached on (571)272-4140. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

HENRY ORR
Primary Examiner
Art Unit 2145



/HENRY ORR/           Primary Examiner, Art Unit 2145