DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-7, 10-17, and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by US20170228662 [hereinafter Gu].
Regarding claim 1, Gu teaches:
A method of training an action selection neural network used to control an agent interacting with an environment to perform a plurality of different tasks, wherein the action selection neural network has a plurality of parameters and is configured to (Gu; 0007-0008; One or more programs can be configured to perform operations or actions… Implementations can include: agent interacting with a real-world environment. The agent may be a robot configured to perform a task in an environment based on actions):
receive network inputs each comprising (i) a goal signal identifying a task from the plurality of tasks that is being performed by the agent and (ii) an observation characterizing a state of the environment (Gu; Claim 11 “for the training observation…”; Experience tuple includes a training observation characterizing a training state of the environment. Obtaining an output action in the set of actions that lie on the continuous domain…),
Examiner notes that obtaining an output action in the set of actions that lie on the continuous domain maps to (i) a goal signal identifying a task from the plurality of tasks that is being performed by the agent. Under the broadest reasonable interpretation, and in light of the specification, a goal signal from the plurality of tasks is interpreted as identifying a specific task from the continuous stream of K tasks being performed (see [36] in the specification).
and process each network input in accordance with the parameters to generate a respective policy output for each network input that defines a control policy for the agent for performing the task identified by the goal signal, the method comprising (Gu; 0009; Processing in accordance with current values of the parameters of the policy subnetwork, the training observation to generate an ideal point in the continuous action space for the training observation… determining an update to the current values of the parameters of policy subnetwork…):
Examiner notes that the generated ideal point in the continuous action space maps to a goal signal under the broadest reasonable interpretation, as previously stated.
obtaining a first trajectory of transitions generated while the agent was performing an episode of the first task from the plurality of tasks, each transition in the first trajectory comprising an initial observation characterizing a state of the environment, an action performed by the agent in response to the observation, a reward received as a result of the agent performing the action, and another observation characterizing a subsequent state of the environment (Gu; 0009; System configured to compute Q values for actions to be performed by an agent interacting with an environment from a continuous action space of actions comprising: obtaining a tuple identifying training observation characterizing a training state of the environment, an action performed by the agent in response to the observation, a reward received as a result of the agent performing the action, a subsequent observation characterizing a subsequent state of the environment.);
Examiner notes that the actions to be performed by an agent interacting with an environment from a continuous action space of actions maps to task from a plurality of tasks.
and training the action selection neural network on the first trajectory to adjust the control policies for the plurality of tasks, comprising, for each transition in the first trajectory (Gu; 0009; Processing the subsequent observation using the value subnetwork to generate a new value estimate for the subsequent state, determining an update to the current values of the parameters of the policy.):
Examiner notes that a person having ordinary skill in the art understands update and adjust to be synonymous in the ways that they are used in this instance.
generating, in accordance with current values of the parameters, respective policy outputs for the initial observation in the transition for each task in a subset of the plurality of tasks that includes the first task and at least one other task (Gu; 0009; Processing, using the policy subnetwork in accordance with current values of the parameters of the policy… training observation to generate an ideal point in the continuous space.);
Examiner notes that Gu processes through ideal points (or actions) within a continuous action space, meaning it goes through specific actions from a set or plurality of possible tasks.
generating respective target policy outputs for each task in the subset using at least the reward in the transition, and (Gu; 0009; Combining the reward and the new value estimate to generate a target Q value for a particular action in the continuous action space.)
determining an update to the current values of the parameters based on, for each task in the subset, a gradient of a loss between the policy output for the task and the target policy output for the task (Gu; 0009; Determining an update to the current values of the parameters of the policy using an error between the Q value for the particular action and the target Q value.).
Examiner notes that in this case, the word “loss” in the claims and “error” in the reference are synonymous in meaning, even if the words themselves are different.
Regarding claim 2, Gu teaches:
The method of claim 1, further comprising: generating the first trajectory of transitions by selecting actions to be performed by the agent while performing the episode of the first task using the action selection neural network and in accordance with the current values of the parameters (Gu; 0022; System selects actions to be performed by the agent interacting with the environment… the system receives observations, with each observation characterizing a current state of the environment…).
Examiner notes that as previously stated, the observations are processed to generate an ideal point in a continuous space of actions (see 0024).
Regarding claim 3, Gu teaches:
The method of claim 1, further comprising: selecting, by each of a plurality of actor computing units that each control a respective instance of the agent, a respective task from the plurality of tasks (Gu; 0008, 0018; Selecting an action performed by the agent in response to the particular observation… action from a set of actions.);
generating, by each of the plurality of actor computing units and in parallel, a respective trajectory of transitions by selecting actions to be performed by the agent while performing an episode of the selected task using the action selection neural network in accordance with the current values of the parameters and while the action selection neural network is conditioned on the goal signal for the selected task; (Gu; 0008, 0018, 0077; Selecting an action performed by the agent in response to the particular observation… action from a set of actions… multitasking and parallel processing may be advantageous.)and
Examiner notes that by definition, if it is run in parallel, then there will be a plurality of actor computing units. Further, under the broadest reasonable interpretation, units do not have to necessarily mean physical computing parts. Parallel processing is anticipated by Gu.
adding, by each of the plurality of actor computing units, the generated trajectory to a queue of trajectories (Gu; 0008; Replay memory storing experience tuples used to train the network… Tuples include a particular observation, the selection action, and the next observation.).
Examiner notes that the memory storing experience tuples acts as the queue of trajectories. Tuples are mapped to trajectories.
Regarding claim 4, Gu teaches:
The method of claim 3, further comprising: generating a batch of trajectories from the trajectories in the queue, wherein the batch includes the first trajectory (Gu; 0008; Subsystem generates rollouts, wherein each rollout is a synthetic experience tuple; and rollouts are added to replay memory.);
Examiner notes that as previously explained, the tuples, which were previously explained to be mapped to trajectories, is a rollout as defined by Gu, and the rollouts are then added to a replay memory, also previously explained as the replay memory.
training the action selection neural network on each trajectory in the batch to determine a respective update to the current values of the parameters for each trajectory (Gu; 0009; Training subnetwork to compute Q values for actions performed by agent comprises: obtaining an experience tuple… determining an update to the current values of the parameters of the policy using error between Q value of particular action and target Q value.);
and generating updated values of the parameters from the current values using the updates for the trajectories (Gu; 0008-0009; Rollouts, or tuples are generated… determine an update to the current values of the parameters involves using a generated target Q value.).
Examiner notes that Gu processes through ideal points (or actions) within a continuous action space, meaning it goes through specific actions from a set or plurality of possible tasks.
Regarding claim 5, Gu teaches:
The method of claim 1, wherein generating, in accordance with current values of the parameters, respective policy outputs for the initial observation in the transition for each task in a subset of the plurality of tasks that includes the first task and at least one other task comprises (Gu; 0009; Processing, using the policy subnetwork in accordance with current values of the parameters of the policy… training observation to generate an ideal point in the continuous space.):
Examiner notes that Gu processes through ideal points (or actions) within a continuous action space, meaning it goes through specific actions from a set or plurality of possible tasks.
processing, for each task in the subset, a network input comprising the initial observation in the transition and the goal signal for the task using the action selection neural network and in accordance with the current values of the parameters (Gu; 0009; Computing values for actions to be performed comprises obtaining an experience tuple… processing, using the policy in accordance with current values of the parameters.).
Examiner notes that Gu mentions subsequent observations characterizing a subsequent state of the environment, which inherently means that the original state that was observed is “initial,” under the broadest reasonable interpretation.
Regarding claim 6, Gu teaches:
The method of claim 1, wherein the reward includes a respective pseudo-reward for each task in the subset, and wherein generating respective target policy outputs for each task in the subset using at least the reward in the transition comprises (Gu; 0022; Reinforcement learning system receives a reward. Each reward is a numerical value received from the environment as a consequence of the agent performing an action… the reward will be different depending on the state that the environment transitions into as a result of the agent performing the action.):
Examiner notes that there is a reward calculated/associated with each action an agent may perform. Examiner further notes that a pseudo-reward is a value representing progress towards performing the corresponding task. Gu teaches that the reward and the new value estimate to generate a target Q value for the particular action. This means that the reward is not the full reward for the action, but rather a partial, or in-progress reward. It is complete after combined with the new value estimate.
generating the target policy outputs for each task in the subset using the pseudo-reward for the task and not using the pseudo-rewards for any of the other tasks in the subset (Gu; 0009; Generating a Q value for the particular action…).
Examiner notes that the Q value that is used in combination with the partial reward, as previously explained, is generated for a particular action. This inherently means that it is done for one task first, and none of the others.
Regarding claim 7, Gu teaches:
The method of claim 1, wherein the policy output includes a respective Q value for each action in a set of possible actions that can be performed by the agent, wherein the Q value is an estimate of a return that would be received if the agent performed the action in response to the observation (Gu; 0006; Q value for the particular action that is an estimate of an expected return resulting from the agent performing the particular action when the environment is the current state.).
Regarding claim 10, Gu teaches:
The method of claim 1, wherein the action selection neural network is configured to generate an internal goal-independent representation of the state of the environment and to generate the policy output for the task based on the goal signal identifying the task and the goal- independent representation (Gu; 0051; system for selecting actions to be performed by an agent interacting with an environment from a continuous action space of actions… system comprising a value subnetwork configured to receive an observation characterizing a current state of the environment… policy subnetwork configured to receive the observation, and process the observation to generate an ideal point in the continuous action space.).
Examiner notes that as previously noted, generating an ideal point in the continuous action space is considered selecting an action. Further, the environment in Gu does not take into account the final goal or target. Thus, it is considered a goal-independent environment representation.
Regarding claim 11, Gu teaches all the limitations and motivations of claim 1 in apparatus/system form rather than method form. Therefore, the supporting rationale of the rejection to claim 1 applies equally as well to those elements of claim 11. Claim 11 additionally recites “one or more non-transitory computer readable storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations...” Gu teaches “one or more non-transitory computer storage media encoded with computer program instructions configured to be executed by a plurality of computers…” (Gu; Claim 27).
Regarding claim 12, Gu teaches all the limitations and motivations of claim 2 in apparatus/system form rather than method form. Therefore, the supporting rationale of the rejection to claim 2 applies equally as well to those elements of claim 12. Claim 12 additionally recites “one or more computer readable storage media” Gu teaches “one or more non-transitory computer storage media” (Gu; Claim 27).
Regarding claim 13, Gu teaches all the limitations and motivations of claim 3 in apparatus/system form rather than method form. Therefore, the supporting rationale of the rejection to claim 3 applies equally as well to those elements of claim 13. Claim 13 additionally recites “one or more computer readable storage media” Gu teaches “one or more non-transitory computer storage media” (Gu; Claim 27).
Regarding claim 14, Gu teaches all the limitations and motivations of claim 4 in apparatus/system form rather than method form. Therefore, the supporting rationale of the rejection to claim 4 applies equally as well to those elements of claim 14. Claim 14 additionally recites “one or more computer readable storage media” Gu teaches “one or more non-transitory computer storage media” (Gu; Claim 27).
Regarding claim 15, Gu teaches all the limitations and motivations of claim 5 in apparatus/system form rather than method form. Therefore, the supporting rationale of the rejection to claim 5 applies equally as well to those elements of claim 15. Claim 15 additionally recites “one or more computer readable storage media” Gu teaches “one or more non-transitory computer storage media” (Gu; Claim 27).
Regarding claim 16, Gu teaches all the limitations and motivations of claim 6 in apparatus/system form rather than method form. Therefore, the supporting rationale of the rejection to claim 6 applies equally as well to those elements of claim 16. Claim 16 additionally recites “one or more computer readable storage media” Gu teaches “one or more non-transitory computer storage media” (Gu; Claim 27).
Regarding claim 17, Gu teaches all the limitations and motivations of claim 7 in apparatus/system form rather than method form. Therefore, the supporting rationale of the rejection to claim 7 applies equally as well to those elements of claim 17. Claim 17 additionally recites “one or more computer readable storage media” Gu teaches “one or more non-transitory computer storage media” (Gu; Claim 27).
Regarding claim 20, Gu teaches all the limitations and motivations of claim 1 in apparatus/system form rather than method form. Therefore, the supporting rationale of the rejection to claim 1 applies equally as well to those elements of claim 20. Claim 20 additionally recites “a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform a plurality of different tasks” Gu teaches “a system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, the system configured to cause the one or more computers to perform a method for training a policy neural network…” (Gu; Claim 21).


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 8-9 and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Gu in view of Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction Second Edition, in Progress. The MIT Press, 2015 [hereinafter Sutton].
Regarding claim 8, Sutton teaches:
The method of claim 7, wherein generating respective target policy outputs for each task in the subset using at least the reward in the transition comprises: generating respective n-step returns for each task in the subset (Sutton; 6.1 TD Prediction, pages 143-147, 7.1 n-Step TD Prediction, pages 168-172; Rt(n) is the n-step return at time t. T is a variable representing time in this example, but it can also be used to represent specific/incremental tasks.).
It would have been obvious before the effective filing date for a person having ordinary skill in the art to take the n-step teachings of Sutton and combine it with the teachings of Gu because both are in the field of reinforced learning and n-step return and calculations, and n-step methods can potentially perform better than either previously explored extreme methods (Sutton; page 172) 
Regarding claim 9, Gu teaches: 
The method of claim 8, wherein generating n-step returns for each task in the subset comprises: determining whether the performed action in the transition is the action having the highest Q value in the policy output for the task (Gu; 0028; Action value subsystem determines the advantage estimate in such a way that the action having the highest Q value is always the action represented by the ideal point.);
Gu does not explicitly teach and when the performed action in the transition is not the action having the highest Q value, truncating the n-step return using bootstrapping.
Sutton teaches:
and when the performed action in the transition is not the action having the highest Q value, truncating the n-step return using bootstrapping (Sutton; 7.7 Off-policy without importance sampling: The n-step backup tree algorithm, pages 188-189; If there are no sample for the unselected actions, they are bootstrapped with their estimated values).
Examiner notes that the action in the transition that isn’t the action having the highest Q value means it would not have been selected by the system by Gu. Sutton then teaches that unselected actions are bootstrapped.
The motivation for combining the teachings of Sutton and Gu for claim 9 is the same as the motivation previously set forth for claim 8.
Regarding claim 18, Gu in view of Sutton [hereinafter Gu-Sutton] teaches all the limitations and motivations of claim 8 in apparatus/system form rather than method form. Therefore, the supporting rationale of the rejection to claim 8 applies equally as well to those elements of claim 18. Claim 18 additionally recites “one or more computer readable storage media” Gu teaches “one or more non-transitory computer storage media” (Gu; Claim 27).
The motivation for combining the teachings of Sutton and Gu for claim 18 is the same as the motivation previously set forth for claim 8.
Regarding claim 19, Gu in view of Sutton [hereinafter Gu-Sutton] teaches all the limitations and motivations of claim 9 in apparatus/system form rather than method form. Therefore, the supporting rationale of the rejection to claim 9 applies equally as well to those elements of claim 19. Claim 19 additionally recites “one or more computer readable storage media” Gu teaches “one or more non-transitory computer storage media” (Gu; Claim 27).
The motivation for combining the teachings of Sutton and Gu for claim 19 is the same as the motivation previously set forth for claim 8.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC WU whose telephone number is (571)272-3380. The examiner can normally be reached Monday-Friday between 9AM and 6PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, OMAR FERNANDEZ RIVAS can be reached on (571)272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/ERIC C WU/               Examiner, Art Unit 2128                                                                                                                                                                                         
/ERIC NILSSON/               Primary Examiner, Art Unit 2122