DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Allowable Subject Matter
Claims 26-29 are objected to as being dependent upon a rejected base claim, but would be allowable upon proper overcoming of the rejections as discussed below under double patent rejection and if rewritten in independent form including all of the limitations of the base claim and any intervening claims and overcomes the double patent rejection.

Double Patenting
The non-statutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper time wise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A non-statutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on non-statutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159.  See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.

Claims 21, 25-36 and 40-41 are provisionally rejected on the ground of non-statutory double patenting as being unpatentable over claims 1-20 of Patent No: US 10,650,310 B2. Although the claims at issue are not identical, they are not patentably distinct from each other because the instant claim is broader than the patent claims.


Instant Application:
16/866,365
Patent No: US 
10,650,310 B2

Claim 21
Claim 1

A method for training a neural network used to select actions performed by a reinforcement learning agent interacting with an environment by performing actions that cause the environment to transition states, the method comprising: maintaining a replay memory, the replay memory storing a plurality of pieces of experience data that each represents information about an interaction of the agent with the environment, 

each piece of experience data having a respective expected learning progress measure, wherein for each of one or more pieces of experience data, the respective 
 




selecting a piece of experience data from the replay memory by prioritizing for selection pieces of experience data having relatively higher expected learning progress measures; and training, using a reinforcement learning technique, the neural network on the selected piece of experience data.

with an environment by performing actions that cause the environment to transition states, the method comprising: maintaining a replay memory, the replay memory storing a plurality of pieces of experience data that are generated as a result of the reinforcement learning agent interacting with the environment, 

each piece of experience data having a respective expected learning progress measure; selecting a piece of experience data from the replay memory by prioritizing for selection pieces of experience data having relatively higher 
training, using a reinforcement learning technique, the neural network on the selected piece of experience data, wherein training the neural network on the selected piece of experience data comprises determining a temporal difference learning error for the selected piece of experience data; determining an updated expected learning progress measure for the selected piece of experience data based on an absolute value of the temporal difference learning error;  and associating, in the replay memory, the selected piece of experience data with the updated expected learning progress measure.



Claim 36
Claim 14

A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations for training a neural network used to select actions performed by a reinforcement learning agent interacting with an environment by performing actions that cause the environment to transition states, the operations comprising: maintaining a replay memory, the replay memory storing a plurality of pieces of experience data that each represents information about an interaction of the agent with the environment, each piece of experience data having a respective expected learning progress measure, wherein for each of one or more pieces of experience data, the respective expected learning process measure is derived from a result of a preceding time that the piece of experience data was used in training the 

maintaining a replay memory, the replay memory storing a plurality of pieces of experience data that are generated as a result of the reinforcement learning agent interacting with the environment, each piece of experience data having a respective expected learning progress measure;  selecting a piece of experience 
data from the replay memory by prioritizing for selection pieces of experience data having relatively higher expected learning progress measures, comprising: determining, based on the respective expected learning progress measures for the pieces of experience data, a respective 



Claim 41
Claim 19

A non-transitory computer storage medium encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations for training a neural network used to select actions performed by a reinforcement learning agent interacting with an environment by performing actions that cause the environment to transition states, the operations comprising: maintaining a replay memory, the replay memory storing a plurality of pieces of experience data that each represents information about an interaction of the agent with the environment, each piece of experience data having a respective expected learning progress measure, wherein for each of one or more pieces of experience data, the respective expected learning process measure is derived from a result of a preceding time that the piece of experience data was used in training the neural network and is computed based on 

actions performed by a reinforcement learning agent interacting with an environment by performing actions that cause the environment to transition states, the method comprising: maintaining a replay memory, the replay memory storing a plurality of pieces of experience data that are generated as a result of the reinforcement learning agent interacting with the environment, each piece of experience data having a respective expected learning progress measure;  selecting a piece of experience data from the replay memory by prioritizing for selection pieces of experience data having relatively higher expected learning progress measures, comprising: determining, based on the 
respective expected learning progress measures for the pieces of experience data, a respective probability for each of the pieces of experience 
in accordance with the determined probabilities;  training, using a reinforcement learning technique, the neural network on the selected piece of experience data, wherein training the neural network on the selected piece of experience data comprises determining a temporal difference learning error for the selected piece of experience data;  determining an updated expected learning progress measure for the selected piece of experience data based on an absolute value of the temporal difference learning error; and associating, in the replay memory, the selected piece of experience data with the updated expected learning progress measure.




The dependent claims 25-35 of the instant application are fully anticipated by claims 2-13 of the patent application 10,650,310 respectfully.
The independent claim 36 of the instant application is fully anticipated by claim 14 of the patent application 10,650,310.
The dependent claim 40 of the instant application is fully anticipated by claim 15 of the patent application 10,650,310 respectfully.
	The independent claim 41 of the instant application is fully anticipated by claim 19 of the patent application 10,650,310.

Claims 21, 25-36 and 40-41 are provisionally rejected on the ground of non-statutory double patenting as being unpatentable over claims 1-13, 16-17 and 20 of Patent No: US 10,282,662 B2. Although the claims at issue are not identical, they are not patentably distinct from each other because the instant claim is broader than the patent claims.
Instant Application:
16/866,365
Patent No: US 
10,282,662 B2
Claim 21
Claim 1
A method for training a neural network used to select actions performed by a reinforcement learning agent interacting with an environment by performing actions that cause the information about an interaction of the agent with the environment, 
each piece of experience data having a respective expected learning progress measure, wherein for each of one or more pieces of experience data, the respective expected learning process measure is derived from a result of a preceding time that the piece of experience data was used in training the neural network and is computed based on an error measured at least with respect to a target expected return resulting from the interaction; selecting a piece of experience data from the replay memory by prioritizing for selection pieces of experience data having relatively higher expected learning progress measures; and training, using a reinforcement learning technique, the neural network on the selected piece of experience data.
 method for training a neural network used to select actions performed by a reinforcement learning agent interacting with an environment 
the method comprising: 
maintaining a replay memory, the replay memory storing pieces of experience 
data for use in training the neural network, 
wherein: 
each piece of experience data has been generated as a result of the reinforcement learning agent interacting with the environment, each piece of experience data comprises a respective current observation characterizing a respective current state of the environment, a respective current action performed by the agent in response to the current observation, a respective next observation characterizing a respective next state of the environment, and a reward received in response to the agent performing the current action, a plurality of the pieces of experience data are each associated with a respective expected learning progress 
would be made in the training of the neural network if the neural network is trained on the piece of experience data and (ii) is derived from a result of a preceding time that the piece of experience data was used in training the neural network; selecting a piece of experience data from the replay memory by 
prioritizing for selection pieces of experience data having relatively higher expected learning progress measures, comprising: determining, based on the respective expected learning progress measures for the pieces of experience 
data, a respective probability for each of the pieces of experience data in the replay memory, and sampling a piece of experience data from the replay memory in accordance with the determined probabilities; training, using a reinforcement learning technique, the neural network on the selected piece of experience data;  and associating, in the replay memory, the selected piece of 
experience data with a new expected learning progress measure derived from a 
result of training the neural network on the selected piece of experience data.


Claim 36
Claim 16
A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations for training a neural network used to select actions performed by a reinforcement learning agent interacting with an environment by performing actions that cause the environment to transition states, the operations comprising: maintaining a replay memory, the replay memory storing a plurality of pieces of experience data that each represents information about an interaction of the agent with the environment, each piece of experience data having a respective expected 

more computers, to cause the one or more computers to perform operations for a method for training a neural network used to select actions performed by a reinforcement learning agent interacting with an environment by performing actions that cause the environment to transition states, the method comprising:
 maintaining a replay memory, the replay memory storing pieces of experience 
data for use in training the neural network, wherein: each piece of experience data has been generated as a result of the reinforcement learning agent interacting with the environment, each piece of experience data comprises a 
respective current observation characterizing a respective current state of the environment, a respective current action performed by the agent in response to the current observation, a respective next observation characterizing a respective next state of the environment, and a reward received in response to the agent performing the current action, a plurality of the pieces of experience data are each associated with a respective expected learning progress measure that (i) is a measure of an expected amount of progress that would be made in the training of the neural network if the neural network is trained on the piece of experience data and (ii) is derived from a result of a preceding time that the piece of experience data was used in training the neural network; selecting a piece of experience data from the replay memory by 
prioritizing for selection pieces of experience data having relatively higher expected learning progress measures, comprising: determining, based on the respective expected learning progress measures for the pieces of experience 
data, a respective probability for each of the pieces of experience data in the 
replay memory, and sampling a piece of experience data from the replay memory 
in accordance with the determined probabilities;  training, using a reinforcement learning technique, the neural network on the selected piece of experience data;  and associating, in the replay memory, the selected piece of experience data with a new expected learning progress measure derived from a result of training the neural network on the selected piece of experience data.


Claim 41
Claim 20
A non-transitory computer storage medium encoded with instructions that, when executed 
maintaining a replay memory, the replay memory storing pieces of experience data for use in training the neural network, wherein: each piece of experience data has been generated as a result of the reinforcement learning agent interacting with the environment, each piece of experience data 
comprises a respective current observation characterizing a respective current state of the environment, a respective current action performed by the agent in response to the current observation, a respective next observation characterizing a respective next state of the environment, and a reward received in response to the agent performing the current action, a plurality of the pieces of experience data are each associated with a respective expected learning progress measure that (i) is a measure of an expected amount of progress that would be made in the training of the neural network if the neural network is trained on the piece of experience data and (ii) is derived from a result of a preceding time that the piece of experience data was used in training the neural network;  selecting a piece of experience data from the replay memory by prioritizing for selection pieces of experience data having relatively higher expected learning progress measures, comprising: determining, based on the respective expected learning progress measures for the pieces of experience data, a respective probability for each of the pieces of experience data in the replay memory, and sampling a piece of experience data from the 
replay memory in accordance with the determined probabilities;  training, using a reinforcement learning technique, the neural network on the selected piece of experience data;  and associating, in the replay memory, the selected piece of experience data with a new expected learning progress measure derived from a result of training the neural network on the selected piece of experience data.


The independent claim 21 of the instant application is fully anticipated by claim 1 of the patent application 10,282,662. 
The dependent claims 25-35 of the instant application are fully anticipated by claims 2-13 of the patent application 10,282,662 respectfully.
The independent claim 36 of the instant application is fully anticipated by claim 16 of the patent application 10,282,662.
The dependent claim 40 of the instant application is fully anticipated by claim 17 of the patent application 10,282,662 respectfully.
The independent claim 41 of the instant application is fully anticipated by claim 20 of the patent application 10,282,662.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.



Claims 21-24, 30, 36-39, and 41 are rejected under 35 U.S.C. 102(a) (1) as being anticipated by Mnih et al. (Playing Atari with Deep Reinforcement Learning).
 
Regarding claims 21, 
Mnih discloses a method for training a neural network used to select actions performed by a reinforcement learning agent interacting with an environment by performing actions that cause the environment to transition states (see P6 §5.1 ¶1 “In addition to seeing relatively smooth improvement to predicted Q during training we did not experience any divergence issues in any of our experiments. This suggests that, despite lacking any theoretical convergence guarantees, our method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner”), 
the method comprising: maintaining a replay memory, the replay memory storing a plurality of pieces of experience data that each represents information about an interaction of the agent with the environment (see page 4 §4 ¶3, a technique known as experience replay [13] where it stores the agents experience at each time step), each piece of experience data having a respective expected learning progress measure wherein for each of one or more pieces of experience data (see P4 §4 ¶3, and p5 algorithm 1, a replay memory is maintained which stores experiences from each time step from the agent interacting with the environment, Q is a function for an expected amount of progress), the respective expected learning process measure is derived from a result of a preceding time that the piece of experience data was used in training the neural network and is computed based on an error measured at least with respect to a target expected return resulting from the interaction (see P4 §4 ¶3, and p5 algorithm 1 “
    PNG
    media_image1.png
    134
    537
    media_image1.png
    Greyscale
”,

    PNG
    media_image2.png
    494
    975
    media_image2.png
    Greyscale
 (i.e. a replay memory is maintained which stores experiences from each time step from the agent interacting with the environment, Q is a function for an expected amount of progress); 
selecting a piece of experience data from the replay memory by prioritizing for selection pieces of experience data having relatively higher expected learning progress measures (see page 5 algorithm 1 and page 3 equation (1) “maxaQ*”, [i.e. maxaQ* selects a piece of experience data based on maximum expected progress and 

    PNG
    media_image3.png
    61
    540
    media_image3.png
    Greyscale
”); 
and training, using a reinforcement learning technique, the neural network on the selected piece of experience data (“see P6 §5 ¶2, “ 

    PNG
    media_image3.png
    61
    540
    media_image3.png
    Greyscale
”, “see page 6 §5.1 ¶1, In addition to seeing relatively smooth improvement to predicted Q during training we did not experience any divergence issues in any of our experiments. This suggests that, despite lacking any theoretical convergence guarantees, our method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner”, i.e. training using reinforcement learning on selecting piece of experience data).

Regarding claim 22. 
Mnih discloses the method of claim 21, 
Mnih further discloses wherein: the target expected return resulting from the interaction comprises a target expected total reward that could have been received by the agent following the interaction characterized by the selected piece of experience data (see page 6 § 5.1, “In reinforcement learning, however, accurately evaluating the progress of an agent during training can be challenging. Since he total reward the agent collects in an episode or game averaged over a number of games, we periodically compute it during training. The average total reward metric tends to be very noisy because small changes to the weights of a policy can lead to large changes in the distribution of states the policy visits.”); and training the neural network on the selected piece of experience data comprises determining, with respect to the target expected total reward, an updated error for the selected piece of experience data (see page 6 §5, “Since the scale of scores varies greatly from game to game, we fixed all positive rewards to be 1 and all negative rewards to be −1, leaving 0 rewards unchanged. Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games”, also see page 6 section 5.1, “Both averaged reward plots are indeed quite noisy, giving one the impression that the learning algorithm is not making steady progress. Another, more stable, metric is the policy’s estimated action-value function Q, which provides an estimate of how much discounted reward the agent can obtain by following its policy from any given state”, target total [i.e. action-value function Q] updated error for experience data).

Regarding claim 23. 
Mnih discloses the method of claim 22,
Mnih further discloses further comprising: determining an updated expected learning progress measure for the selected piece of experience data based on an absolute value of the updated error (see page 4, §4 ¶ 2, “Tesauro’s TD-Gammon architecture provides a starting point for such an approach. This architecture updates t, at, rt, st+1, at+1, drawn from the algorithm’s interactions with the environment (or by self-play, in the case of backgammon).”); and associating, in the replay memory, the selected piece of experience data with the updated expected learning progress measure (see P4 §4 ¶3, and p5 algorithm 1 “
    PNG
    media_image1.png
    134
    537
    media_image1.png
    Greyscale
”).

Regarding claim 24. 
Mnih discloses the method of claim 21, 
Mnih further discloses wherein selecting the piece of experience data from the replay memory by prioritizing for selection pieces of experience data having relatively higher expected learning progress measures (see page 5 algorithm 1 and page 3 equation (1) “maxaQ*”, [i.e. maxaQ* selects a piece of experience data based on maximum expected progress and can be set to be .1 which is interpreted as relatively lower than the maximum expected progress, thus the maxQ value is selected with relatively higher frequency], also see P6 §5 ¶2, “ 

    PNG
    media_image3.png
    61
    540
    media_image3.png
    Greyscale
”); comprises: determining, based on the respective expected learning progress measures for the pieces of experience data, a respective probability for each of the pieces of experience data in the replay memory (see P4 §4 ¶3, and p5 algorithm 1 “
    PNG
    media_image1.png
    134
    537
    media_image1.png
    Greyscale
”); and sampling a piece of experience data from the replay memory in accordance with the determined probabilities (see P4 §4 ¶3, and p5 algorithm also discloses sampling a piece of experience data from the replay memory according to the probabilities).

Regarding claim 30.
Mnih discloses the method of claim 1, 
Mnih further discloses wherein each piece of experience data is an experience tuple that comprises a respective current observation characterizing a respective current state of the environment, a respective current action performed by the agent in response to the current observation, a respective next state characterizing a respective next state of the environment, and a reward received in response to the agent performing the current action (see page 4 §4 ¶3, “In contrast to TD-Gammon and similar online approaches, we utilize a technique known as experience replay [13] where we store the agent’s experiences at each time-step, et = (st; at; rt; st+1) in a data-set D = e1, …, eN, pooled over many episodes into a replay memory” [i.e. st is a current state, at is an action, rt is a reward and st+1 is a next state]).

Claims 36-39 recite a system to perform the method recited in claim 21-24. Therefore the rejection of claims 21-24 above applies equally here.
Claim 41 recite a non-transitory computer storage medium to perform the method recited in claim 21. Therefore the rejection of claim 21 above applies equally here.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
 
Claims 25 and 40 are rejected under 35 U.S.C. 103 as being unpatentable over Mnih et al. (Playing Atari with Deep Reinforcement Learning) in view of Narasimhan et al. (Language Understanding for Text-based Games using Deep Reinforcement Learning).
 
Regarding claim 25. 
Mnih teaches the method of claim 24, 
Mnih teaches wherein determining, based on the respective expected learning progress measures for the pieces of experience data, a respective probability for each of the pieces of experience data in the replay memory (see P4 §4 ¶3 as cited in claim 24).
 determining a respective probability for each piece of experience data such that pieces of experience data having higher expected learning progress measures have higher probabilities than pieces of experience data having relatively lower expected learning progress measures. 
Narasimhan teaches determining a respective probability for each piece of experience data such that pieces of experience data having higher expected learning progress measures have higher probabilities than pieces of experience data having relatively lower expected learning progress measures (see page 5 ¶ 2, “The simplest method to create these minibatches from the experience memory D is to sample uniformly at random. However, certain experiences are more valuable than others for the agent to learn from. For instance, rare transitions that provide positive rewards can be used more often to learn optimal Q-values faster. In our experiments, we consider such positive-reward transitions to have higher priority and keep track of them in D. We use prioritized sampling (inspired by Moore and Atkeson (1993)) to sample a fraction p of transitions from the higher priority pool and a fraction 1-p from the rest.”).
  	Both Mnih and Narasimhan pertain to the problem of teach training a neural network to play games using reinforcement learning, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Mnih and Narasimhan to disclose a probability based on a relative expected learning progress as taught by Narasimhan. The motivation for doing so would be to use prioritization which allows for higher value training experiences to be see page 5 ¶ 2).

Claim 40 recites a system to perform the method recited in claim 25. Therefore the rejection of claim 25 above applies equally here.

Claims 31-35 are rejected under 35 U.S.C. 103 as being unpatentable over Mnih et al. (Playing Atari with Deep Reinforcement Learning) in view of Maei et al. (Toward Off-Policy Learning Control with Function Approximation).
 
Regarding claim 31. 
Mnih teaches the method of claim 22, 
Mnih further discloses wherein training the neural network on the selected piece of experience data (see page 6 § 5.1, “In reinforcement learning, however, accurately evaluating the progress of an agent during training can be challenging. Since our evaluation metric, as suggested by [3], is the total reward the agent collects in an episode or game averaged over a number of games, we periodically compute it during training. The average total reward metric tends to be very noisy because small changes to the weights of a policy can lead to large changes in the distribution of states the policy visits.”) 
However, Mnih does not teach using the updated error in adjusting values of the parameters of the neural network.
using the updated error in adjusting values of the parameters of the neural network (see page 3 ¶4, 

    PNG
    media_image4.png
    341
    511
    media_image4.png
    Greyscale
, wherein the temporal error i.e. updated error adjusts values of the parameters of the neural network).
  	Both Mnih and Maei pertain to the problem of teach training a neural network using reinforcement learning, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Mnih and Maei to determine a temporal difference error for adjusting the values of the neural network as taught by Maei. The motivation for doing so would be to improve stability when using approximation and to yield predictable results (See Maei abstract and see page 3 ¶4).

Regarding claim 32. 
Mnih and Maei teaches the method of claim 31, 
Maei further teaches wherein using the updated error in adjusting the values of the parameters comprises: determining a weight for the updated error using the expected learning progress measure for the selected experience tuple; adjusting the updated error using the weight; and using the adjusted error as a target error for adjusting the values of the parameters of the neural network (see p3 ¶4
    PNG
    media_image4.png
    341
    511
    media_image4.png
    Greyscale
 and page 4 right column ¶5 “Greedy-GQ uses an update-rule for parameter θ analogous to that of Q-learning with function approximation except that we have a correction term. The update of the second set of weights, wt, follows the least mean square (LMS) rule. These weights are normally initialized to zero. As promised, the computation of an update takes linear time in the dimension of the features, d.”)
Both Mnih and Maei pertain to the problem of teach training a neural network using reinforcement learning, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Mnih and Maei to determine a temporal difference error for adjusting the values of the neural network as taught by Maei. The motivation for doing so would be to improve stability when using approximation and to yield predictable results (See Maei abstract and see page 3 ¶4).


Regarding claim 33. 
Mnih and Maei teaches the method of claim 32, 
further comprising annealing an exponent used in computing the weight during the training of the neural network (see page 6 §5 ¶2 “In these experiments, we used the RMSProp algorithm with minibatches of size 32. The behavior policy during training was -greedy with annealed linearly from 1 to 0:1 over the first million frames, and fixed at 0:1 thereafter. We trained for a total of 10 million frames and used a replay memory of one million most recent frames.”).
 
Regarding claim 34. 
Mnih teaches the method of claim 30, 
Mnih do not teach wherein the expected learning progress measure for each experience tuple in the replay memory is a derivative of an absolute value of a temporal difference learning error determined for the experience tuple the preceding time the experience tuple was used in training the neural network.
Maei teaches wherein the expected learning progress measure for each experience tuple in the replay memory is a derivative of an absolute value of a temporal difference learning error determined for the experience tuple the preceding time the experience tuple was used in training the neural network (see page 4 ¶3

    PNG
    media_image5.png
    489
    536
    media_image5.png
    Greyscale
 and page 4 right column ¶5 “Greedy-GQ uses an update-rule for parameter θ analogous to that of Q-learning with function approximation except that we have a correction term. The update of the second set of weights, wt, follows the least mean square (LMS) rule. These weights are normally initialized to zero. As promised, the computation of an update takes linear time in the dimension of the features, d.”)
Both Mnih and Maei pertain to the problem of teach training a neural network using reinforcement learning, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Mnih and Maei to determine a temporal difference error for adjusting the values of the neural network as taught by Maei. The motivation for doing so would be to improve stability when using approximation and to yield predictable results (See Maei abstract and see page 3 ¶4).

Regarding claim 35. 
 30, 
Mnih do not teach wherein the expected learning progress measure for each experience tuple in the replay memory is a norm of an induced weight-change by using the experience tuple to train the neural network.
Maei teach wherein the expected learning progress measure for each experience tuple in the replay memory is a norm of an induced weight-change by using the experience tuple to train the neural network (see page 4 ¶3
 
    PNG
    media_image5.png
    489
    536
    media_image5.png
    Greyscale
 and page 4 right column ¶5 “Greedy-GQ uses an update-rule for parameter θ analogous to that of Q-learning with function approximation except that we have a correction term. The update of the second set of weights, wt, follows the least mean square (LMS) rule. These weights are normally initialized to zero. As promised, the computation of an update takes linear time in the dimension of the features, d.”)
Both Mnih and Maei pertain to the problem of teach training a neural network using reinforcement learning, thus being analogous. It would have been obvious to one 3 ¶4).

					Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IMAD M KASSIM whose telephone number is (571)272-2958.  The examiner can normally be reached on mon-fri 730-500.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private 






/IMAD KASSIM/Examiner, Art Unit 2125                                                                                                                                                                                                        
/MICHAEL J HUNTLEY/Primary Examiner, Art Unit 2116