DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 12 is rejected under 35 U.S.C. 101 because he claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because the claims are directed to one or more computer readable storage media.  Computer readable storage media include non-statutory embodiments such as signal carriers.  (See MPEP 2106).  Claim 12 may be amended to be directed to a non-transitory computer readable medium to overcome this rejection.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-7, 10, 12-16 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over US Patent Application Publication 2016/0232445 to Srinivasan et al. (Srinivasan) in view of the article “Prioritized Experience Replay,” found at https://arxiv.org/abs/1511.05952v4  published February 2016 by Schaul et al. (Schaul).
Claim 1
With regard to a plurality of actor computing units, each of the actor computing units configured to maintain a respective replica of the action selection neural network and to perform actor operations, Srinivasan teaches a plurality of actors that each have an actor Q network replica (Fig. 2, Actors 220A-N; pars. 36, 37).
With regard to receiving an observation characterizing a current state of an instance of the environment, Srinivasan teaches that each actor receives current observations that characterize the current state of the environment (pars. 15, 23, 37).
With regard to selecting an action to be performed by the agent using the action selection neural network replica and in accordance with current values of the network parameters, Srinivasan teaches that the actor determines the action to perform in response to the observation (pars. 24, 37).
With regard to obtaining transition data characterizing the environment instance subsequent to the agent performing the selected action, Srinivasan teaches obtaining another observation of the environment after the actor has performed the determined action (pars. 25, 37).
With regard to generating an experience tuple from the observation, the selected action, and the transition data, Srinivasan teaches that the actors generate experience tuples that are stored in a central replay memory (par. 37).
With regard to storing the experience tuple in a shared memory that is accessible to each of the actor computing units; Srinivasan teaches storing the experience tuple in a central replay memory (par. 37).
With regard to one or more learner computing units, wherein each of the one or more learner computing units is configured to perform learner operations, Srinivasan teaches learners (Fig. 2, learners 230A-N; par. 38).
With regard to sampling a batch of experience tuples from the shared memory, Srinivasan teaches that the learners select experience tuples from the central replay memory (par. 38).
With regard to determining, using the sampled experience tuples, an update to the network parameters using a reinforcement learning technique, Srinivasan teaches that the learner updates parameters of the Q network in a reinforcement learning training system (par. 38).
Srinivasan does not teach determining a priority for the experience tuple, storing the experience tuple in association with the priority, or that the sampling is biased by the priorities for the experience tuples in the shared memory.  Schaul teaches prioritizing a replay using a transition’s TD error (page 3, Section 3.2 Prioritizing TD-Error; Section 3.3 Stochastic Prioritization).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training, as taught by Srinivasan, to include prioritizing transitions in replay with TD-error, as taught by Schaul, because then experience replay would have been more efficient (Schaul, page 1, Section 1, second paragraph of Section 1).
Claim 2
Srinivasan does not teach that determining the priority for the experience tuple comprises: determining a learning error for the selected action according to the reinforcement learning technique; and determining the priority from the learning error.  Schaul teaches prioritizing a replay using a transition’s TD error (page 3, Section 3.2 Prioritizing TD-Error; Section 3.3 Stochastic Prioritization).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training, as taught by Srinivasan, to include prioritizing transitions in replay with TD-error, as taught by Schaul, because then experience replay would have been more efficient (Schaul, page 1, Section 1, second paragraph of Section 1).
Claim 3
Srinivasan does not teach that the priority is an absolute value of the learning error.  Schaul teaches using a magnitude of the TD error (Section 3.2).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training, as taught by Srinivasan, to include using a magnitude of the TD error, as taught by Schaul, because then experience replay would have been more efficient (Schaul, page 1, Section 1, second paragraph of Section 1).
Claim 4
Srinivasan teaches that two or more of the actor computing units select actions using different exploration policies (pars. 44, 45, 50, 53).
Claim 5
Srinivasan teaches that the different exploration policies are epsilon-greedy policies with different values of epsilon (par. 45).
Claim 6
Srinivasan does not teach determining for each sampled experience tuple a respective updated priority; and updating the shared memory to associate the updated priorities with the sampled experience tuples.  Schaul teaches prioritizing a replay using a transition’s TD error (page 3, Section 3.2 Prioritizing TD-Error; Section 3.3 Stochastic Prioritization).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training, as taught by Srinivasan, to include prioritizing transitions in replay with TD-error, as taught by Schaul, because then experience replay would have been more efficient (Schaul, page 1, Section 1, second paragraph of Section 1). 
Claim 7
Srinivasan does not teach determining whether criteria for removing any experience tuples from the shared memory are satisfied; and when the criteria are satisfied, updating the shared memory to remove one or more of the tuples.  Schaul teaches prioritizing a replay using a transition’s TD error (page 3, Section 3.2 Prioritizing TD-Error; Section 3.3 Stochastic Prioritization).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training, as taught by Srinivasan, to include prioritizing transitions in replay with TD-error, as taught by Schaul, because then experience replay would have been more efficient (Schaul, page 1, Section 1, second paragraph of Section 1).
Claim 10
Srinivasan teaches determining whether criteria for updating the actor computing units are satisfied; and when the criteria are satisfied, transmitting updated parameter values to the actor computing units (Fig. 5, Determine whether to accept or discard gradient; pars. 64-66).
Claim 12
With regard to maintaining a plurality of actor computing units, each of the actor computing units configured to maintain a respective replica of the action selection neural network; Srinivasan teaches a plurality of actors that each have an actor Q network replica (Fig. 2, Actors 220A-N; pars. 36, 37).
With regard to for each of the plurality of actor computing units: receiving an observation characterizing a current state of an instance of the environment, Srinivasan teaches that each actor receives current observations that characterize the current state of the environment (pars. 15, 23, 37).
With regard to selecting, using the actor computing unit, an action to be performed by the agent using the action selection neural network replica and in accordance with current values of the network parameters, Srinivasan teaches that the actor determines the action to perform in response to the observation (pars. 24, 37).
With regard to obtaining transition data characterizing the environment instance subsequent to the agent performing the selected action, Srinivasan teaches obtaining another observation of the environment after the actor has performed the determined action (pars. 25, 37).
With regard to generating an experience tuple from the observation, the selected action, and the transition data, Srinivasan teaches that the actors generate experience tuples that are stored in a central replay memory (par. 37).
With regard to storing the experience tuple in a shared memory that is accessible to each of the plurality of actor computing units; Srinivasan teaches storing the experience tuple in a central replay memory (par. 37).
With regard to maintaining one or more learner computing units; Srinivasan teaches learners (Fig. 2, learners 230A-N; par. 38).
With regard to for each of the one or more learner computing units: sampling, using the learner computing unit, a batch of experience tuples from the shared memory; Srinivasan teaches that the learners select experience tuples from the central replay memory (par. 38).
With regard to determining, using the sampled experience tuples, an update to the network parameters using a reinforcement learning technique, Srinivasan teaches that the learner updates parameters of the Q network in a reinforcement learning training system (par. 38).
Srinivasan does not teach determining a priority for the experience tuple, storing the experience tuple in association with the priority, or that the sampling is biased by the priorities for the experience tuples.  Schaul teaches prioritizing a replay using a transition’s TD error (page 3, Section 3.2 Prioritizing TD-Error; Section 3.3 Stochastic Prioritization).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training, as taught by Srinivasan, to include prioritizing transitions in replay with TD-error, as taught by Schaul, because then experience replay would have been more efficient (Schaul, page 1, Section 1, second paragraph of Section 1).
Claim 13
With regard to maintaining a plurality of actor computing units, each of the actor computing units configured to maintain a respective replica of the action selection neural network; Srinivasan teaches a plurality of actors that each have an actor Q network replica (Fig. 2, Actors 220A-N; pars. 36, 37).
With regard to for each of the plurality of actor computing units: receiving an observation characterizing a current state of an instance of the environment, Srinivasan teaches that each actor receives current observations that characterize the current state of the environment (pars. 15, 23, 37).
With regard to selecting, using the actor computing unit, an action to be performed by the agent using the action selection neural network replica and in accordance with current values of the network parameters, Srinivasan teaches that the actor determines the action to perform in response to the observation (pars. 24, 37).
With regard to obtaining transition data characterizing the environment instance subsequent to the agent performing the selected action, Srinivasan teaches obtaining another observation of the environment after the actor has performed the determined action (pars. 25, 37).
With regard to generating an experience tuple from the observation, the selected action, and the transition data, Srinivasan teaches that the actors generate experience tuples that are stored in a central replay memory (par. 37).
With regard to storing the experience tuple in a shared memory that is accessible to each of the plurality of actor computing units; Srinivasan teaches storing the experience tuple in a central replay memory (par. 37).
With regard to maintaining one or more learner computing units; Srinivasan teaches learners (Fig. 2, learners 230A-N; par. 38).
With regard to for each of the one or more learner computing units: sampling, using the learner computing unit, a batch of experience tuples from the shared memory; Srinivasan teaches that the learners select experience tuples from the central replay memory (par. 38).
With regard to determining, using the sampled experience tuples, an update to the network parameters using a reinforcement learning technique, Srinivasan teaches that the learner updates parameters of the Q network in a reinforcement learning training system (par. 38).
Srinivasan does not teach determining a priority for the experience tuple, storing the experience tuple in association with the priority, or that the sampling is biased by the priorities for the experience tuples.  Schaul teaches prioritizing a replay using a transition’s TD error (page 3, Section 3.2 Prioritizing TD-Error; Section 3.3 Stochastic Prioritization).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training, as taught by Srinivasan, to include prioritizing transitions in replay with TD-error, as taught by Schaul, because then experience replay would have been more efficient (Schaul, page 1, Section 1, second paragraph of Section 1).
Claim 14
Srinivasan does not teach determining a learning error for the selected action according to the reinforcement learning technique; and determining the priority from the learning error.  Schaul teaches prioritizing a replay using a transition’s TD error (page 3, Section 3.2 Prioritizing TD-Error; Section 3.3 Stochastic Prioritization).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training, as taught by Srinivasan, to include prioritizing transitions in replay with TD-error, as taught by Schaul, because then experience replay would have been more efficient (Schaul, page 1, Section 1, second paragraph of Section 1).
Claim 15
Srinivasan does not teach determining for each sampled experience tuple a respective updated priority; and updating, using the learner computing unit, the shared memory to associate the updated priorities with the sampled experience tuples.  Schaul teaches prioritizing a replay using a transition’s TD error (page 3, Section 3.2 Prioritizing TD-Error; Section 3.3 Stochastic Prioritization).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training, as taught by Srinivasan, to include prioritizing transitions in replay with TD-error, as taught by Schaul, because then experience replay would have been more efficient (Schaul, page 1, Section 1, second paragraph of Section 1).
Claim 16
Srinivasan does not teach determining whether criteria for removing any experience tuples from the shared memory are satisfied; and when the criteria are satisfied, updating, using the learner computing unit, the shared memory to remove one or more of the tuples.  Schaul teaches prioritizing a replay using a transition’s TD error (page 3, Section 3.2 Prioritizing TD-Error; Section 3.3 Stochastic Prioritization).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training, as taught by Srinivasan, to include prioritizing transitions in replay with TD-error, as taught by Schaul, because then experience replay would have been more efficient (Schaul, page 1, Section 1, second paragraph of Section 1).
Claim 19
Srinivasan teaches determining whether criteria for updating the actor computing units are satisfied; and when the criteria are satisfied, transmitting updated parameter values to the actor computing units (Fig. 5, Determine whether to accept or discard gradient; pars. 64-66)
Claim(s) 8, 11, 17 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Srinivasan in view of Schaul as applied to claims 1 and 13 above, and further in view of the article “Rainbow: Combining Improvements in Deep Reinforcement Learning” found at https://arxiv.org/abs/1710.02298 published on October 2017 by Hessel et al. (Hessel).
Claim 8
Srinivasan and Schaul teach all the limitations of claim 1 upon which claim 8 depends.  Srinivasan and Schaul do not teach that the reinforcement learning technique is an n-step Q learning technique.  Hessel teaches an n-step Q learning technique (page 3, left hand column, lines 14-24, multi-step learning).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training combination, as taught by Srinivasan and Schaul, to include multi-step learning, as taught by Hessel, because then faster learning would have been achieved (Hessel, page 3, left hand column, lines 14-24).
Claim 11
Srinivasan and Schaul teach all the limitations of claim 1 upon which claim 11 depends.  Srinivasan do not teach that obtaining transition data characterizing the environment instance subsequent to the agent performing the selected action comprises: selecting additional actions to be performed by the agent in response to subsequent observations using the action selection neural network replica to generate an n-step transition.  Hessel teaches an n-step Q learning technique (page 3, left hand column, lines 14-24, multi-step learning).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training combination, as taught by Srinivasan and Schaul, to include multi-step learning, as taught by Hessel, because then faster learning would have been achieved (Hessel, page 3, left hand column, lines 14-24).
Claim 17
Srinivasan and Schaul teach all the limitations of claim 13 upon which claim 17 depends.  Srinivasan and Schaul do not teach that the reinforcement learning technique is an n- step Q learning technique.  Hessel teaches an n-step Q learning technique (page 3, left hand column, lines 14-24, multi-step learning).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training combination, as taught by Srinivasan and Schaul, to include multi-step learning, as taught by Hessel, because then faster learning would have been achieved (Hessel, page 3, left hand column, lines 14-24).
Claim 20
Srinivasan and Schaul teach all the limitations of claim 13 upon which claim 20 depends.  Srinivasan and Schaul do not teach that obtaining transition data characterizing the environment instance subsequent to the agent performing the selected action comprises: selecting additional actions to be performed by the agent in response to subsequent observations using the action selection neural network replica to generate an n-step transition.  Hessel teaches an n-step Q learning technique (page 3, left hand column, lines 14-24, multi-step learning).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training combination, as taught by Srinivasan and Schaul, to include multi-step learning, as taught by Hessel, because then faster learning would have been achieved (Hessel, page 3, left hand column, lines 14-24).
Claim(s) 9 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Srinivasan in view of Schaul as applied to claims 1 and 13 above, and further in view of the article “The Reactor: A Sample-Efficient Actor-Critic Architecture” found at https://arxiv.org/abs/1704.04651v1 published on April 2017 by Gruslys et al. (Gruslys).
Claim 9
Srinivasan and Schaul teach all the limitations of claim 1 upon which claim 9 depends.  Srinivasan and Schaul do not teach that the reinforcement learning technique is an actor-critic technique.  Gruslys teaches an actor-critic algorithm (page 2, Section 2, the actor critic algorithm).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training combination, as taught by Srinivasan and Schaul, to include the actor-critic algorithm, as taught by Gruslys, because then reinforcement learning with a prioritized relay would have been useful with an additional reinforcement learning technique.
Claim 18
Srinivasan and Schaul teach all the limitations of claim 13 upon which claim 18 depends.  Srinivasan and Schaul do not teach that the reinforcement learning technique is an actor-critic technique.  Gruslys teaches an actor-critic algorithm (page 2, Section 2, the actor critic algorithm).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the reinforcement learning training combination, as taught by Srinivasan and Schaul, to include the actor-critic algorithm, as taught by Gruslys, because then reinforcement learning with a prioritized relay would have been useful with an additional reinforcement learning technique.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MANUEL L BARBEE whose telephone number is (571)272-2212. The examiner can normally be reached M-F: 9-5:30..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John E Breene can be reached on 571-272-4107. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MANUEL L BARBEE/Primary Examiner, Art Unit 2864