Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 7/5/2022 has been entered.
 
DETAILED ACTION
1. 	This action is responsive to applicant’s amendment dated 7/5/2022.
2. 	Claims 1-19 are pending in the case. 
3.	Claim 20 is cancelled.
4.	Claims 1, 12 and 17 are an independent claims. 

Applicant’s Response
5.	In Applicant’s response dated 7/5/2022, applicant has amended the following:
a) Claims 1, 12 and 17


	
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.



Claims 1-9, 11-15, 17 and 18 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Claim 1 generally recites accumulating data, weighing the data, and then using the data to update a policy based on the weight. The abstract idea (i.e., judicial exception) limitation of generating a weighted learning database under BRI, covers performance of the limitation that can be performed either in the head or with the aid of pen and paper, where the weights are based on how similar the data is to the action of the robot. There are no details on how the weighting is being done and the BRI could be simply be weighting one piece of data from each of the four agents. 

  This judicial exception is not integrated into a practical application because the updating limitation is only updating a policy associated with a robot. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because receiving a plurality of learning datasets is merely insignificant pre-solution data gathering and the updating limitation is merely  post-solution activity of only updating a policy associated with a robot. Therefore, the data gathering and the policy updating does not integrate the recited abstract idea into a practical application. In other words, there is no additional element reflecting an improvement in the functioning of a computer or an improvement to other technology or technical field. The preamble recitation of the memory and processor describes generic computer components and amounts to mere instructions to implement the abstract idea on a computer. Therefore, the additional elements are not sufficient to make the claim patent element. Thus, the claim is not patent eligible.
Dependent claims 2-9 and 11 do not recite any additional elements that integrate the judicial exception of base claim 1 into a practical application. Therefore, the claimed invention does not recite additional limitations that amount to significantly more. 


Claim 12 generally recites accumulating data, weighing the data, and then using the data to update a policy based on the weight. The abstract idea (i.e., judicial exception) limitation of generating a weighted learning database under BRI, covers performance of the limitation that can be performed either in the head or with the aid of pen and paper, where the weights are based on how similar the data is to the action of the robot. There are no details on how the weighting is being done and the BRI could be simply be weighting one piece of data from each of the four agents. 
  This judicial exception is not integrated into a practical application because the updating limitation is only updating a policy associated with a robot. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because receiving a plurality of learning datasets is merely insignificant pre-solution data gathering and the acquiring and updating limitation is merely post-solution activity of acquiring data and only updating a policy associated with a robot. Therefore, the data gathering and the policy updating does not integrate the recited abstract idea into a practical application. In other words, there is no additional element reflecting an improvement in the functioning of a computer or an improvement to other technology or technical field. Thus the claim is not patent eligible. 
Applicant’s Specification recites the additional elements of at least one processor and memory. By describing these computer-related components at a high level without details of structure or implementation, Applicant’s Specification indicates that these additional elements were well understood, routine and conventional. (see par. 123; any other device capable of responding to and executing instructions in a defined manner par. 124; in any type of machine, component, physical or virtual equipment, computer storage medium par. 12; they may be of the kind well-known and available to those having skill in the computer software arts) Furthermore, Applicant’s Specification does not indicate that consideration of these conventional elements as an ordered combination adds any significance beyond the additional elements, as considered individually. Therefore, claim  12 does not recite additional elements that, either individually or as an ordered combination, amount to significantly more than the judicial exception within the meaning of the 2019 Guidance. 2019 Guidance, 84 Fed. Reg. at 52-55; MPEP § 2106.05(d).
Dependent claims 13-15 do not recite any additional elements that integrate the judicial exception of base claim12 into a practical application. Therefore, the claimed invention does not recite additional limitations that amount to significantly more. 



Claim 17 generally recites accumulating data, weighing the data, and then using the data to update a policy based on the weight. The abstract idea (i.e., judicial exception) limitation of generating a weighted learning database under BRI, covers performance of the limitation that can be performed either in the head or with the aid of pen and paper, where the weights are based on how similar the data is to the action of the robot. There are no details on how the weighting is being done and the BRI could be simply be weighting one piece of data from each of the four agents. 
  This judicial exception is not integrated into a practical application because the updating limitation is only updating a policy associated with a robot. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because receiving a plurality of learning datasets is merely insignificant pre-solution data gathering and the acquiring and updating limitation is merely post-solution activity of acquiring data and only updating a policy associated with a robot. Therefore, the data gathering and the policy updating does not integrate the recited abstract idea into a practical application. In other words, there is no additional element reflecting an improvement in the functioning of a computer or an improvement to other technology or technical field. The preamble recitation of the memory and processor describes generic computer components and amounts to mere instructions to implement the abstract idea on a computer. Therefore, the additional elements are not sufficient to make the claim patent element. Thus, the claim is not patent eligible. 
Dependent claim 18 does not recite any additional elements that integrate the judicial exception of base claim 17 into a practical application. Therefore, the claimed invention does not recite additional limitations that amount to significantly more.



	
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1, 12 and 17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
The term “similar” in claims 1, 12 and 17  is a relative term which renders the claim indefinite. The term “similar” is not defined by the claims, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. 


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-19 are rejected under 35 U.S.C. 103 as being unpatentable over Levine et al. (hereinafter “Levine”), U.S. Published Application No. 20190232488 A1 in view of Ponulak et al. (hereinafter “Ponulak”), U.S. Published Application No. 9008840 B1
Claim 1:
Levine teaches A method of updating a policy associated with controlling an action of a robot using an electronic device including a memory and a processor, the method comprising: (e.g., with each learning process iteration, updating a policy associated with controlling physical actions of a robot Figure 7; illustrates memory system and processor par. 3; The one or more robots may perform in accordance with each new improved iteration of the policy/approach for a particular task as the new iterations are passed through to the computing apparatus(es) responsible for controlling the robots' physical actions.)

receiving a plurality of learning datasets generated by a plurality of heterogeneous agents through performance of respective heterogeneous actions; (e.g., receiving experience data (i.e., learning datasets) from robots (i.e., heterogeneous agents) par. 43; For example, various implementations disclosed herein collect experience data from multiple robots that operate asynchronously from one another. Moreover, various implementations utilize the collected experience data in training a policy neural network asynchronously from (but simultaneous with) the operation of the multiple robots. For example, a buffer of the collected experience data from an episode of one of the robots can be utilized to update the policy neural network, and updated policy parameters from the updated policy neural network provided for implementation by one or more of the multiple robots before performance of corresponding next episodes. Par. 100; FIG. 6 schematically depicts an example architecture of a robot 640. The robot 640 includes a robot control system 660, one or more operational components 640a-640n, and one or more sensors 642a-642m. The sensors 642a-642m may include, for example, vision sensors, light sensors, pressure sensors, pressure wave sensors (e.g., microphones), proximity sensors, accelerometers, gyroscopes, thermometers, barometers, and so forth.  )

generating, by the processor, a weighted learning database based on the plurality of learning datasets and weight sets associated with the plurality of heterogeneous agents; (e.g., generating a weighted buffer database based on the experience data and weight sets associated with the robots based on learning functions algorithm as shown in the table of par. 50, par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy learning can be Q-learning, such as Q-learning that utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. Par. 50; Presented below is an overview of one example algorithm for performing asynchronous NAF with N collector threads and one training thread. Par. 90; For example, the system may initialize a target policy network with weight par. 93; For instance, the system may update the weight of the Q network by minimizing the loss)

and updating, by the processor, the policy associated with controlling the action of the robot based on the weighted learning database to generate an updated policy.  (e.g., update parameters of policy associated with controlling a robotic action based on a weighted buffer to generate an updated policy Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. par. 95; For example, the system may provide updated policy parameters and/or other parameters for use by the robots in upcoming episodes.)






Levine fails to expressly teach 
wherein the plurality of heterogeneous agents includes at least two of a first agent corresponding to human demonstration or teaching, a second agent corresponding to a motion planner that performs motion planning, a third agent corresponding to computer simulation, and a fourth agent corresponding to a method of directly controlling the robot using a controller;
generating, by the processor, the weighted learning database based on the plurality of learning datasets and the weight sets associated with the plurality of heterogeneous agents such that the weighted learning database favors ones of the plurality of learning datasets associated with ones of the plurality of heterogenous agents performing actions similar to the action of the robot over other ones of the plurality of learning datasets. (emphasis added)

However, Ponulak teaches 
wherein the plurality of heterogeneous agents includes at least two of a first agent corresponding to human demonstration or teaching, a second agent corresponding to a motion planner that performs motion planning, a third agent corresponding to computer simulation, and a fourth agent corresponding to a method of directly controlling the robot using a controller; (e.g., human first agent and computerized third agent for implementing reinforcement learning process Col. 28 line 59; During individual trials, an external agent (e.g., a human and/or a computerized agent) may provide reinforcement signal guiding the controller learning.)

such that the weighted learning database favors ones of the plurality of learning datasets associated with ones of the plurality of heterogenous agents performing actions similar to the action of the robot over other ones of the plurality of learning datasets; (e.g., database of tables favoring a learning data set by adjusting the weights associated with the neuron network col. 25 line 57; In one or more implementations, the predictor state may comprise one or more lookup tables (e.g., as described in U.S. patent application Ser. No. 13/842,562 entitled "ADAPTIVE PREDICTOR APPARATUS AND METHODS FOR ROBOTIC CONTROL", incorporated supra), a database comprising one or more tables; and/or a hash-table. Col. 25 line 61; In some implementations of a predictor comprising a spiking neuron network, the association information may comprise one or more network connectivity, neuron state, and/or connection efficacy (e.g., weights). Col. 28 line 59; During individual trials, an external agent (e.g., a human and/or a computerized agent) may provide reinforcement signal guiding the controller learning.)
Col. 19 line 34;Upon receiving the reinforcement signal 234, the spiking neural network of the robot controller 252, 256 may change its parameters (e.g., neuron connection weights) in order to maximize control policy performance function (e.g., maximize the reward and minimize the punishment).))

In the analogous art of reinforcement learning, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the learning agents as taught by Levine to include heterogenous agents while adjusting weights associated with the learning data as taught by Ponulak to provide the benefit of maximizing control policy performance functions during a task.


Claim 2 depends on claim 1:
Levine teaches wherein a first agent of the plurality of heterogeneous agents is configured to generate a first learning dataset of the plurality of learning datasets such that the first learning dataset includes a plurality of learning data items including a current state, the action, and a reward, (each iteration of experience data is a learning data that include a current state, action and reward par. 6; Each instance of experience data can indicate a corresponding: current/beginning state, subsequent state transitioned to from the beginning state, robotic action executed to transition from the beginning state to the subsequent state (where the action is based on application of the beginning state to the policy network and its current policy parameters), and optionally a reward for the action (as determined based on the reward function). Par. 25; In some implementations, the method further includes generating a given exploration of the explorations during the given episode by: applying a current state representation as input to the policy network, the current state representation indicating a current state of at least the given robot; generating output by processing the input using the policy network; and providing control commands to one or more actuators of the given robot based on the output.)  the current state including information on a surrounding environment of the first agent measured by the first agent, the action being performed by the first agent for the current state, and the reward being an assessment value of the action.  (e.g., current state indicating environmental objects and reward being an assessment value of the action par. 51; As described herein, in various implementations a neural network may parametrize the action-value functions and policies. In some of those implementations, various state representations may be utilized, as input to the model, in generating output indicative of an action to be implemented based on the policies. The state representations can indicate the state of the robot and optionally the state of one or more environmental objects. As one example, a robot state representation may include joint angles and end-effector positions, as well as their time derivatives. In some implementations, a success signal (e.g., a target position) may be appended to a robot state representation. As described herein, the success signal may be utilized in determining a reward for an action and/or for other purposes. Par. 58; This may continue to be performed iteratively (e.g., at each control cycle of the robot) until the success signal is achieved (e.g., as determined based on a reward satisfying a criteria) and/or other criteria is met.)

Claim 3 depends on claim 1:
Levine teaches wherein the plurality of learning datasets include a first learning dataset generated by a first agent of the plurality of heterogeneous agents and a second learning dataset generated by a second agent of the plurality of heterogeneous agents, and the weight sets include a first weight set associated with the first agent and a second weight set associated with the second agent, and the generating the weighted learning database comprises: 
generating, by the processor, at least one first weighted learning data item based on the first learning dataset and the first weight set; 
generating, by the processor, at least one second weighted learning data item based on the second learning dataset and the second weight set; 
and generating, by the processor, the weighted learning database including the first weighted learning data item and the second weighted learning data item. (e.g., generating a weighted buffer database based on the experience data and weight sets associated with the robots based on learning functions algorithm as shown in the table of par. 50, Examiner notes that weight of the algorithm can be modified to correspond with data from a particular robot (i.e., varying the weight of algorithm to obtain a first and second weight set) par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy learning can be Q-learning, such as Q-learning that utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. Par. 50; Presented below is an overview of one example algorithm for performing asynchronous NAF with N collector threads and one training thread. Par. 90; For example, the system may initialize a target policy network with weight par. 93; For instance, the system may update the weight of the Q network by minimizing the loss)

Claim 4 depends on claim 3:
Levine teaches wherein the generating of the first weighted learning data item comprises: calculating, by the processor, a number of data items corresponding to the first weight set for the first agent; (e.g., updating policy based on all available experience data being processed par. 96; At block 566, the system determines whether training is complete. In some implementations, determining that training is complete may be based on: determining that convergence has been achieved, a threshold quantity of iterations of blocks 558-564 have occurred, all available experience data has been processed, a threshold amount of time has passed, and/or other criteria has been satisfied.) and generating, by the processor, the first weighted learning data item based on the number of data items and the first learning dataset. (e.g., weighted learning function algorithm using the experience data to update the policy par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy learning can be Q-learning, such as Q-learning that utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. Par. 50; Presented below is an overview of one example algorithm for performing asynchronous NAF with N collector threads and one training thread. Par. 90; For example, the system may initialize a target policy network with weight par. 93; For instance, the system may update the weight of the Q network by minimizing the loss)
Claim 5 depends on claim 1:
Levine teaches wherein the updating, by the processor, the policy comprises: updating the policy such that a reward value for the action of the robot increases.  (e.g., updating the policy such that a reward value for the action increases to reach predetermined threshold par. 18; The reward for the action can be generated based on a reward function for the reinforcement learning policy. Par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold.  Par. 99; For example, the reward function can be composed of two parts: the closeness of the end-effector to the handle, and the measure of how much the door is opened in the right direction. The first part of the reward function depends on the distance between end-effector position e and the handle position h in its neutral state. The second part of the reward function depends on the distance between the quaternion of the handle q and its value when the handle is turned and door is opened q.sub.O.)

Claim 6 depends on claim 1:
Levine teaches further comprising: acquiring, by the processor, direct learning data of the robot generated based on the updated policy; (e.g. acquiring current state sensor data of the robot based the updated trained policy par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. The current state can include the state of the robot (e.g., angles of joints of the robot, position(s) of end effector(s) of the robot, and/or their time derivatives) and/or the current state of one or more components in the robot's environment (e.g., a current state of sensor(s) in the robot's environment, current pose(s) of target object(s) in the robot's environment).)
generating, by the processor, a direct learning database including the direct learning data; (e.g., generating a collected batch of experience data (i.e., direct learning database) including state information par. 6; Each instance of experience data can indicate a corresponding: current/beginning state, subsequent state transitioned to from the beginning state, robotic action executed to transition from the beginning state to the subsequent state (where the action is based on application of the beginning state to the policy network and its current policy parameters), and optionally a reward for the action (as determined based on the reward function). The collected experience data is generated during the episodes and is used to train the policy network by iteratively updating policy parameters of the policy network based on a batch of collected experience data.)
and updating, by the processor,  the policy based on the direct learning database.  (e.g., updating policy based on collected experience data par. 6; The collected experience data is generated during the episodes and is used to train the policy network by iteratively updating policy parameters of the policy network based on a batch of collected experience data. Further, prior to performance of each of a plurality of episodes performed by the robots, the current updated policy parameters can be provided (or retrieved) for utilization in performance of the episode. )

Claim 7 depends on claim 6:
Levine teaches wherein the weighted learning database includes the direct learning database such that the updating the policy based on the direct learning database comprises: updating, by the processor, the policy based on the weighted learning database. (e.g., generating a weighted buffer database based on the experience data and weight sets associated with the robots based on learning functions algorithm as shown in the table of par. 50, par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy learning can be Q-learning, such as Q-learning that utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. Par. 50; Presented below is an overview of one example algorithm for performing asynchronous NAF with N collector threads and one training thread. Par. 90; For example, the system may initialize a target policy network with weight par. 93; For instance, the system may update the weight of the Q network by minimizing the loss)

Claim 8 depends on claim 6:
Levine teaches wherein the updating, by the processor, the policy based on the direct learning database comprises: updating the policy in response to a set number of items of the direct learning data being generated.  (e.g., updating policy based on all available experience data being processed par. 96; At block 566, the system determines whether training is complete. In some implementations, determining that training is complete may be based on: determining that convergence has been achieved, a threshold quantity of iterations of blocks 558-564 have occurred, all available experience data has been processed, a threshold amount of time has passed, and/or other criteria has been satisfied.)
Claim 9 depends on claim 6:
Levine teaches wherein the updating the policy based on the direct learning database comprises: updating, by the processor, the policy in response to a reward value calculated based on the policy being greater than or equal to a set value. (e.g., updating the policy based on thresholds being met by the reward value par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold. Par. 74; If the system determines success or other criteria has been met, the system proceeds to block 352 and starts a new episode. It is noted that in the new episode the system can, at block 354 of the new episode, sync the policy parameters with one or more updated parameters that are updated relative to those in the immediately preceding episode (as a result of simultaneous updating of those parameters per method 500 of FIG. 5 and/or other methods).)

Claim 10 depends on claim 6:
Levine teaches wherein the acquiring the direct learning data of the robot based on the updated policy comprises: generating, by the processor, a current state of the robot using at least one sensor associated with the robot; (e.g. acquiring current state sensor data of the robot based the updated trained policy par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. The current state can include the state of the robot (e.g., angles of joints of the robot, position(s) of end effector(s) of the robot, and/or their time derivatives) and/or the current state of one or more components in the robot's environment (e.g., a current state of sensor(s) in the robot's environment, current pose(s) of target object(s) in the robot's environment).)
controlling, by the processor, the action of the robot using the updated policy; (e.g., update parameters of policy associated with controlling a robotic action based on a weighted buffer to generate an updated policy Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. par. 95; For example, the system may provide updated policy parameters and/or other parameters for use by the robots in upcoming episodes.)
calculating, by the processor, a reward for the action of the robot; (e.g., updating the policy based on thresholds being met by the reward value par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold. Par. 74; If the system determines success or other criteria has been met, the system proceeds to block 352 and starts a new episode. It is noted that in the new episode the system can, at block 354 of the new episode, sync the policy parameters with one or more updated parameters that are updated relative to those in the immediately preceding episode (as a result of simultaneous updating of those parameters per method 500 of FIG. 5 and/or other methods).)
and generating, by the processor, the direct learning data including the current state of the robot, the action of the robot, and the reward for the action of the robot. (each iteration of experience data is a learning data that include a current state, action and reward par. 6; Each instance of experience data can indicate a corresponding: current/beginning state, subsequent state transitioned to from the beginning state, robotic action executed to transition from the beginning state to the subsequent state (where the action is based on application of the beginning state to the policy network and its current policy parameters), and optionally a reward for the action (as determined based on the reward function). Par. 25; In some implementations, the method further includes generating a given exploration of the explorations during the given episode by: applying a current state representation as input to the policy network, the current state representation indicating a current state of at least the given robot; generating output by processing the input using the policy network; and providing control commands to one or more actuators of the given robot based on the output.)  
Claim 11 depends on claim 1:
Levine teaches A non-transitory computer-readable medium comprising computer readable instructions that, when executed by a computer, cause the computer to perform the method of claim 1. (e.g., Figure 7; storage subsystem par. 31; Other implementations may include a non-transitory computer readable storage medium storing instructions executable by one or more processors (e.g., one or more central processing units (CPUs).)

Claim 12:
Levine teaches An electronic device configured to update a policy associated with controlling an action of a robot, the electronic device comprising:  (e.g., device configured  to update a policy associated with controlling physical actions of a robot par. 103; For example, all or aspects of control system 660 may be implemented on one or more computing devices that are in wired and/or wireless communication with the robot 620, such as computing device 710. par. 3; The one or more robots may perform in accordance with each new improved iteration of the policy/approach for a particular task as the new iterations are passed through to the computing apparatus(es) responsible for controlling the robots' physical actions.)
a memory configured to store a program for updating the action of the robot; (e.g., storage configured to store a program for updating the action of the robot based on current state data par. 107; Storage subsystem 724 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 724 may include the logic to perform selected aspects of the method of FIGS. 3, 4, and/or 5. Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state.)

and a processor configured to execute the program to, receive a plurality of learning datasets generated by a plurality of heterogeneous agents through performance of respective heterogeneous actions, (e.g., processor 714 of Figure 7 configured to execute the program to, receive experience data (i.e., learning datasets) from robots (i.e., heterogeneous agents) par. 43; For example, various implementations disclosed herein collect experience data from multiple robots that operate asynchronously from one another. Moreover, various implementations utilize the collected experience data in training a policy neural network asynchronously from (but simultaneous with) the operation of the multiple robots. For example, a buffer of the collected experience data from an episode of one of the robots can be utilized to update the policy neural network, and updated policy parameters from the updated policy neural network provided for implementation by one or more of the multiple robots before performance of corresponding next episodes. Par. 100; FIG. 6 schematically depicts an example architecture of a robot 640. The robot 640 includes a robot control system 660, one or more operational components 640a-640n, and one or more sensors 642a-642m. The sensors 642a-642m may include, for example, vision sensors, light sensors, pressure sensors, pressure wave sensors (e.g., microphones), proximity sensors, accelerometers, gyroscopes, thermometers, barometers, and so forth.) generate a weighted learning database based on the plurality of learning datasets and weight sets associated with the plurality of heterogeneous agents, (e.g., generating a weighted buffer database based on the experience data and weight sets associated with the robots based on learning functions algorithm as shown in the table of par. 50, par. 13; Each of the iterations of the iteratively generating can include off-policy learning in view of a group of one or more of the instances of the robot experience data in the buffer during the generating iteration. For example, the off-policy learning can be Q-learning, such as Q-learning that utilizes a normalized advantage function (NAF) algorithm or a deep deterministic policy gradient (DDPG) algorithm. Par. 50; Presented below is an overview of one example algorithm for performing asynchronous NAF with N collector threads and one training thread. Par. 90; For example, the system may initialize a target policy network with weight par. 93; For instance, the system may update the weight of the Q network by minimizing the loss)
acquire direct learning data of the robot generated based on the weighted learning database (e.g. acquiring current state sensor data of the robot based the updated weighted trained policy par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. The current state can include the state of the robot (e.g., angles of joints of the robot, position(s) of end effector(s) of the robot, and/or their time derivatives) and/or the current state of one or more components in the robot's environment (e.g., a current state of sensor(s) in the robot's environment, current pose(s) of target object(s) in the robot's environment).) and the policy associated with controlling the action of the robot, (e.g., update parameters of policy associated with controlling a robotic action based on a weighted buffer to generate an updated policy Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. par. 95; For example, the system may provide updated policy parameters and/or other parameters for use by the robots in upcoming episodes.) and update the policy based on at least the direct learning data. (e.g., updating policy based on collected experience data par. 6; The collected experience data is generated during the episodes and is used to train the policy network by iteratively updating policy parameters of the policy network based on a batch of collected experience data. Further, prior to performance of each of a plurality of episodes performed by the robots, the current updated policy parameters can be provided (or retrieved) for utilization in performance of the episode. )

Levine fails to expressly teach 
wherein the plurality of heterogeneous agents includes at least two of a first agent corresponding to human demonstration or teaching, a second agent corresponding to a motion planner that performs motion planning, a third agent corresponding to computer simulation, and a fourth agent corresponding to a method of directly controlling the robot using a controller;
generating, by the processor, the weighted learning database based on the plurality of learning datasets and the weight sets associated with the plurality of heterogeneous agents such that the weighted learning database favors ones of the plurality of learning datasets associated with ones of the plurality of heterogenous agents performing actions similar to the action of the robot over other ones of the plurality of learning datasets. (emphasis added)

However, Ponulak teaches 
wherein the plurality of heterogeneous agents includes at least two of a first agent corresponding to human demonstration or teaching, a second agent corresponding to a motion planner that performs motion planning, a third agent corresponding to computer simulation, and a fourth agent corresponding to a method of directly controlling the robot using a controller; (e.g., human first agent and computerized third agent for implementing reinforcement learning process Col. 28 line 59; During individual trials, an external agent (e.g., a human and/or a computerized agent) may provide reinforcement signal guiding the controller learning.)

such that the weighted learning database favors ones of the plurality of learning datasets associated with ones of the plurality of heterogenous agents performing actions similar to the action of the robot over other ones of the plurality of learning datasets; (e.g., database of tables favoring a learning data set by adjusting the weights associated with the neuron network col. 25 line 57; In one or more implementations, the predictor state may comprise one or more lookup tables (e.g., as described in U.S. patent application Ser. No. 13/842,562 entitled "ADAPTIVE PREDICTOR APPARATUS AND METHODS FOR ROBOTIC CONTROL", incorporated supra), a database comprising one or more tables; and/or a hash-table. Col. 25 line 61; In some implementations of a predictor comprising a spiking neuron network, the association information may comprise one or more network connectivity, neuron state, and/or connection efficacy (e.g., weights). Col. 28 line 59; During individual trials, an external agent (e.g., a human and/or a computerized agent) may provide reinforcement signal guiding the controller learning.)
Col. 19 line 34;Upon receiving the reinforcement signal 234, the spiking neural network of the robot controller 252, 256 may change its parameters (e.g., neuron connection weights) in order to maximize control policy performance function (e.g., maximize the reward and minimize the punishment).))

In the analogous art of reinforcement learning, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the learning agents as taught by Levine to include heterogenous agents while adjusting weights associated with the learning data as taught by Ponulak to provide the benefit of maximizing control policy performance functions during a task.
Claim 13 depends on claim 12:
Levin teaches wherein the processor is configured to update the policy by, updating the policy in response to a set number of items of the direct learning data being generated. (e.g., updating policy based on all available experience data being processed par. 96; At block 566, the system determines whether training is complete. In some implementations, determining that training is complete may be based on: determining that convergence has been achieved, a threshold quantity of iterations of blocks 558-564 have occurred, all available experience data has been processed, a threshold amount of time has passed, and/or other criteria has been satisfied.)

Claim 14 depends on claim 12:
Levin teaches wherein the processor is configured to update the policy by, updating the policy in response to a reward value calculated based on the policy being greater than or equal to a set value. (e.g., updating the policy based on thresholds being met by the reward value par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold. Par. 74; If the system determines success or other criteria has been met, the system proceeds to block 352 and starts a new episode. It is noted that in the new episode the system can, at block 354 of the new episode, sync the policy parameters with one or more updated parameters that are updated relative to those in the immediately preceding episode (as a result of simultaneous updating of those parameters per method 500 of FIG. 5 and/or other methods).)


Claim 15 depends on claim 1:
Levin teaches wherein the processor is configured to update the policy by, updating the policy such that a reward value for the action of the robot increases. (e.g., updating the policy such that a reward value for the action increases to reach predetermined threshold par. 18; The reward for the action can be generated based on a reward function for the reinforcement learning policy. Par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold.  Par. 99; For example, the reward function can be composed of two parts: the closeness of the end-effector to the handle, and the measure of how much the door is opened in the right direction. The first part of the reward function depends on the distance between end-effector position e and the handle position h in its neutral state. The second part of the reward function depends on the distance between the quaternion of the handle q and its value when the handle is turned and door is opened q.sub.O.)
Claim 16 depends on claim 12:
Levine teaches wherein the processor is configured to acquire the direct learning data by, generating a current state of the robot using at least one sensor associated with the robot, (e.g. acquiring current state sensor data of the robot based the updated trained policy par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. The current state can include the state of the robot (e.g., angles of joints of the robot, position(s) of end effector(s) of the robot, and/or their time derivatives) and/or the current state of one or more components in the robot's environment (e.g., a current state of sensor(s) in the robot's environment, current pose(s) of target object(s) in the robot's environment).) controlling the action of the robot using the policy, (e.g., update parameters of policy associated with controlling a robotic action based on a weighted buffer to generate an updated policy Par. 5; Implementations disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining a robotic action based on a current state. par. 95; For example, the system may provide updated policy parameters and/or other parameters for use by the robots in upcoming episodes.) calculating a reward for the action of the robot, (e.g., updating the policy based on thresholds being met by the reward value par. 73; At block 366, the system determines if success or other criteria has been met. For example, the system may determine success if the reward observed at block 362 satisfies a threshold. Par. 74; If the system determines success or other criteria has been met, the system proceeds to block 352 and starts a new episode. It is noted that in the new episode the system can, at block 354 of the new episode, sync the policy parameters with one or more updated parameters that are updated relative to those in the immediately preceding episode (as a result of simultaneous updating of those parameters per method 500 of FIG. 5 and/or other methods).) and generating the direct learning data including the current state, the action, and the reward for the action of the robot. . (each iteration of experience data is a learning data that include a current state, action and reward par. 6; Each instance of experience data can indicate a corresponding: current/beginning state, subsequent state transitioned to from the beginning state, robotic action executed to transition from the beginning state to the subsequent state (where the action is based on application of the beginning state to the policy network and its current policy parameters), and optionally a reward for the action (as determined based on the reward function). Par. 25; In some implementations, the method further includes generating a given exploration of the explorations during the given episode by: applying a current state representation as input to the policy network, the current state representation indicating a current state of at least the given robot; generating output by processing the input using the policy network; and providing control commands to one or more actuators of the given robot based on the output.)  
Claim 17:
Claim 17 is substantially encompassed in claim 12, therefore, Examiner relies on the same rationale set forth in claim 12 to reject claim 17.
Claim 18 depends on claim 17:
Claim 18 is substantially encompassed in claim 13, therefore, Examiner relies on the same rationale set forth in claim 13 to reject claim 18.Claim 19 depends on claim 17:
Claim 19 is substantially encompassed in claim 16, therefore, Examiner relies on the same rationale set forth in claim 16 to reject claim 19.


Response to Arguments

Applicant’s arguments, with respect to the previously cited prior art failing to disclose the new limitations has been fully considered and are persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new grounds of rejection is made in view of newly applied “Ponulak“ reference. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
NPL, Wang teaches wherein the plurality of heterogeneous agents includes at least two of a first agent corresponding to human demonstration or teaching, a second agent corresponding to a motion planner that performs motion planning, a third agent corresponding to computer simulation, and a fourth agent corresponding to a method of directly controlling the robot using a controller; (e.g., second agent as software agents of Figure 2 corresponding to coordination planning (i.e., motion planner) and fourth agent as controller agents that correspond to a method of directly controlling robots see abstract; In this architecture, four software agents form a high-level coordination subsystem while two heterogeneous robots constitute the low-level control subsystem.)

Any inquiry concerning this communication or earlier communications from the examiner should be directed to HENRY ORR whose telephone number is (571)270-1308. The examiner can normally be reached 9AM-5PM EST M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Adam Queler can be reached on (571)272-4140. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

HENRY ORR
Primary Examiner
Art Unit 2145



/HENRY ORR/           Primary Examiner, Art Unit 2145