DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Allowable Subject Matter
Claims 1-18, 20 and 24 are allowed. 
The following is an examiner’s statement of reasons for allowance: With respect to claim 1, the prior art on record does not teach or fairly suggest the combinations of the limitations: “obtaining a sequence of one or more experience tuples wherein each experience tuple comprises: (i) an observation characterizing a state of an instance of the environment at a respective time step, (ii) an action that was selected to be performed by the agent at the respective time step using a behavior policy, (iii) a behavior policy score assigned to the selected action by the behavior policy when the action was selected, (iv) a subsequent observation characterizing a subsequent state of the environment instance subsequent to the agent performing the selected action, and (iv) a reward received subsequent to the agent performing the selected action; adjusting current parameter values of a state value neural network, wherein the state value neural network is configured to process an input comprising an observation of the environment in accordance with current parameter values of the state value neural network to generate an output comprising a state value for the observation, the adjusting including: determining, using the state value neural network, in accordance with current parameter values of the state value neural network, and based on the observation included in the first experience tuple in the sequence, a state value for the observation included in the first experience tuple in the sequence; for each experience tuple of the sequence of experience tuples: determining, using the action selection neural network, in accordance with current parameter values of the action selection neural network, and based on the observation included in the experience tuple, a learner policy score for the selected action from the experience tuple; determining a trace coefficient based on a ratio of the learner policy score for the selected action and the behavior policy score for the selected action; determining a correction factor for the experience tuple based on: (i) the trace coefficient for the experience tuple, and (ii) the trace coefficients for any experience tuples that precede the experience tuple in the sequence; determining a state value temporal difference for the experience tuple based on at least: (i) the reward included in the experience tuple, (ii) a state value for the observation included in the experience tuple generated by processing the observation included in the experience tuple in accordance with current parameter values of the state value neural network, and (iii) a state value for the subsequent observation included in the experience tuple generated by processing the subsequent observation included in the experience tuple in accordance with current parameter values of the state value neural network; determining a state value target for the observation included in the first experience tuple in the sequence based on at least: (i) the correction factors, (ii) the state value temporal differences, and (iii) the state value for the observation included in the first experience tuple in the sequence; determining a gradient of a state value loss function with respect to parameters of the state value neural network, wherein the state value loss function is based on at least the state value target; and adjusting the current parameter values of the state value neural network based on the gradient; and adjusting current parameter values of the action selection neural network based at least on: (i) a ratio of the learner policy score and the behavior policy score for the selected action from the first experience tuple of the sequence, and (ii) state values generated by the state value neural network by processing observations included in one or more experience tuples in accordance with current parameter values of the state value neural network.” Claims 2-12 depend on claim 1 and are allowed with the same rationale thereto. 
With respect to claim 13, the prior art on record does not teach or fairly suggest the combination of the limitations: “a plurality of actor computing units, each of the actor computing units configured to maintain a respective actor action selection neural network and to perform actor operations comprising: generating a trajectory of one or more experience tuples, wherein generating an experience tuple comprises: receiving an observation characterizing a current state of an instance of the environment, determining, using the actor action selection neural network, in accordance with current parameter values of the actor action selection neural network, and based on the observation, a selected action to be performed by the agent and a policy score for the selected action; obtaining transition data including: (i) a subsequent observation characterizing a subsequent state of the environment instance subsequent to the agent performing the selected action and (ii) a reward received subsequent to the agent performing the selected action; generating an experience tuple from the observation, the selected action, the policy score for the selected action, the subsequent observation, and the reward; storing the trajectory of experience tuples in a queue, wherein the queue is accessible to each of the actor computing units, and the queue comprises an ordered sequence of different experience tuple trajectories; and one or more learner computing units, wherein each of the one or more learner computing units is configured to perform learner operations comprising: obtaining a batch of experience tuple trajectories from the queue; and determining, using the batch of experience tuple trajectories, an update to the learner action selection neural network parameters using a reinforcement learning technique.” Claims 14-18 and 20 depend on claim 13 and are allowed with the same rationale thereto. 
With respect to claim 20, the prior art on record does not teach or fairly suggest the combination of the limitations: “obtaining a sequence of one or more experience tuples wherein each experience tuple comprises: (i) an observation characterizing a state of an instance of the environment at a respective time step, (ii) an action that was selected to be performed by the agent at the respective time step using a behavior policy, (iii) a behavior policy score assigned to the selected action by the behavior policy when the action was selected, (iv) a subsequent observation characterizing a subsequent state of the environment instance subsequent to the agent performing the selected action, and (iv) a reward received subsequent to the agent performing the selected action; adjusting current parameter values of a state value neural network, wherein the state value neural network is configured to process an input comprising an observation of the environment in accordance with current parameter values of the state value neural network to generate an output comprising a state value for the observation, the adjusting including: determining, using the state value neural network, in accordance with current parameter values of the state value neural network, and based on the observation included in the first experience tuple in the sequence, a state value for the observation included in the first experience tuple in the sequence; for each experience tuple of the sequence of experience tuples: determining, using the action selection neural network, in accordance with current parameter values of the action selection neural network, and based on the observation included in the experience tuple, a learner policy score for the selected action from the experience tuple; determining a trace coefficient based on a ratio of the learner policy score for the selected action and the behavior policy score for the selected action; determining a correction factor for the experience tuple based on: (i) the trace coefficient for the experience tuple, and (ii) the trace coefficients for any experience tuples that precede the experience tuple in the sequence; determining a state value temporal difference for the experience tuple based on at least: (i) the reward included in the experience tuple, (ii) a state value for the observation included in the experience tuple generated by processing the observation included in the experience tuple in accordance with current parameter values of the state value neural network, and (iii) a state value for the subsequent observation included in the experience tuple generated by processing the subsequent observation included in the experience tuple in accordance with current parameter values of the state value neural network; determining a state value target for the observation included in the first experience tuple in the sequence based on at least: (i) the correction factors, (ii) the state value temporal differences, and (iii) the state value for the observation included in the first experience tuple in the sequence; determining a gradient of a state value loss function with respect to parameters of the state value neural network, wherein the state value loss function is based on at least the state value target; and adjusting the current parameter values of the state value neural network based on the gradient; and adjusting current parameter values of the action selection neural network based at least on: (i) a ratio of the learner policy score and the behavior policy score for the selected action from the first experience tuple of the sequence, and (ii) state values generated by the state value neural network by processing observations included in one or more experience tuples in accordance with current parameter values of the state value neural network.”
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BEEMNET W DADA whose telephone number is (571)272-3847. The examiner can normally be reached Monday-Friday, 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Joseph Hirl can be reached on 571-272-3685. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

BEEMNET W. DADA
Primary Examiner
Art Unit 2435



/BEEMNET W DADA/Primary Examiner, Art Unit 2435