DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Allowable Subject Matter
Claims 26-29 are objected to as being dependent upon a rejected base claim, but would be allowable upon rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Response to Arguments
Regarding double patent rejection: applicant has filed terminal disclaimer for patent No. 10,282,662, and Patent No. 10,650,310, therefore double patent rejection has been withdrawn.
Regarding Applicant's amendments/arguments filed 12/13/2021 have been fully considered but they are not persuasive. 
Based on the amendment to clarify ‘piece of experience data’ “piece of experience data is an experience tuple that comprises a respective current observation characterizing a respective current state of the environment, and a respective current action performed by the agent in response to the current observation”. Examiner has sited wherein each piece of experience data is an experience tuple (see page 4 §4 ¶3, et = (st; at; rt; st+1) in a data-set D = e1, …, eN, [where data-set D = e1, …, eN  is the experience tuple], also see page 5, “In practice, our algorithm only stores the last N experience tuples in the replay memory, and samples uniformly at random from D when performing updates”) that comprises a respective current observation characterizing a respective current state of the environment, and a respective current action performed by the agent in response to the current observation (see page 4 §4 ¶3, “In contrast to TD-Gammon and similar online approaches, we utilize a technique known as experience replay [13] where we store the agent’s experiences at each time-step, et = (st; at; rt; st+1) in a data-set D = e1, …, eN, pooled over many episodes into a replay memory” [i.e. st is a current state, at is an action, rt is a reward and st+1 is a next state]).
Regarding the selecting a piece of experience data from the replay memory by prioritizing for selection pieces of experience data having relatively higher expected learning progress measures,  where in after performing experience reply, the agent selects an action according to ε-greedy policy, ε-greedy policy chooses the action with maximum expected return or highest estimated reward, which corresponds to prioritizing for selection pieces of experience data having relatively higher expected learning progress measures. See page 4 section 4 Deep Reinforcement Learning. 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.



Claims 21-24, 30, 36-39, and 41 are rejected under 35 U.S.C. 102(a) (1) as being anticipated by Mnih et al. (Playing Atari with Deep Reinforcement Learning).
 
Regarding claims 21, 
Mnih discloses a method for training a neural network used to select actions performed by a reinforcement learning agent interacting with an environment by performing actions that cause the environment to transition states (see P6 §5.1 ¶1 “In addition to seeing relatively smooth improvement to predicted Q during training we did not experience any divergence issues in any of our experiments. This suggests that, despite lacking any theoretical convergence guarantees, our method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner”), 
the method comprising: maintaining a replay memory, the replay memory storing a plurality of pieces of experience data that each represents information about an interaction of the agent with the environment (see page 4 §4 ¶3, a technique known as experience replay [13] where it stores the agents experience at each time step), 
wherein each piece of experience data is an experience tuple (see page 4 §4 ¶3, et = (st; at; rt; st+1) in a data-set D = e1, …, eN, [where data-set D = e1, …, eN  is the experience tuple], also see page 5, “In practice, our algorithm only stores the last N experience tuples in the replay memory, and samples uniformly at random from D when performing updates”) that comprises a respective current observation characterizing a respective current state of the environment, and a respective current action performed by the agent in response to the current observation (see page 4 §4 ¶3, “In contrast to TD-Gammon and similar online approaches, we utilize a technique known as experience replay [13] where we store the agent’s experiences at each time-step, et = (st; at; rt; st+1) in a data-set D = e1, …, eN, pooled over many episodes into a replay memory” [i.e. st is a current state, at is an action, rt is a reward and st+1 is a next state]).
associating, with each piece of experience data in the replay memory having a respective expected learning progress measure wherein for each of one or more pieces of experience data (see P4 §4 ¶3, and p5 algorithm 1, a replay memory is maintained which stores experiences from each time step from the agent interacting with the environment, Q is a function for an expected amount of progress, “During the inner loop of the algorithm, we apply Q-learning updates, or minibatch updates, to samples of experience, e ∼ D, drawn at random from the pool of stored samples”), the respective expected learning process measure is derived from a result of a preceding time that the piece of experience data was used in training the neural network and is computed based on an error measured at least with respect to a target expected return resulting from the interaction (see P4 §4 ¶3, and p5 algorithm 1 “
    PNG
    media_image1.png
    134
    537
    media_image1.png
    Greyscale
”,

    PNG
    media_image2.png
    494
    975
    media_image2.png
    Greyscale
 (i.e. a replay memory is maintained which stores experiences from each time step from the agent interacting with the environment, Q is a function for an expected amount of progress); 
selecting a piece of experience data from the replay memory by prioritizing for selection pieces of experience data having relatively higher expected learning progress measures ( see P4 §4 ¶3, and p5 algorithm 1 
    PNG
    media_image1.png
    134
    537
    media_image1.png
    Greyscale
, where in after performing experience reply, the agent selects an action according to ε-greedy policy, ε-greedy policy chooses the action with maximum expected return or highest estimated reward); 
and training, using a reinforcement learning technique, the neural network on the selected piece of experience data (see Pages 4-5 §4 Deep Reinforcement Learning, training using reinforcement learning on selecting piece of experience data also see page 6 §5.1 ¶1, In addition to seeing relatively smooth improvement to predicted Q during training we did not experience any divergence issues in any of our experiments. This suggests that, despite lacking any theoretical convergence guarantees, our method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner”, i.e. training using reinforcement learning on selecting piece of experience data).

Regarding claim 22. 
Mnih discloses the method of claim 21, 
Mnih further discloses wherein: the target expected return resulting from the interaction comprises a target expected total reward that could have been received by the agent following the interaction characterized by the selected piece of experience data (see page 6 § 5.1, “In reinforcement learning, however, accurately evaluating the progress of an agent during training can be challenging. Since our evaluation metric, as suggested by [3], is the total reward the agent collects in an episode or game averaged over a number of games, we periodically compute it during training. The average total reward metric tends to be very noisy because small changes to the weights of a policy can lead to large changes in the distribution of states the policy visits.”); and training the neural network on the selected piece of experience data comprises determining, with respect to the target expected total reward, an updated error for the selected piece of experience data (see page 6 §5, “Since the scale of scores varies greatly from game to game, we fixed all positive rewards to be 1 and all negative rewards to be −1, leaving 0 rewards unchanged. Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games”, also see page 6 section 5.1, “Both averaged reward plots are indeed quite noisy, giving one the impression that the learning algorithm is not making steady progress. Another, more stable, metric is the policy’s estimated action-value function Q, which provides an estimate of how much discounted reward the agent can obtain by following its policy from any given state”, target total [i.e. action-value function Q] updated error for experience data).

Regarding claim 23. 
Mnih discloses the method of claim 22,
Mnih further discloses further comprising: determining an updated expected learning progress measure for the selected piece of experience data based on an absolute value of the updated error (see page 4, §4 ¶ 2, “Tesauro’s TD-Gammon architecture provides a starting point for such an approach. This architecture updates the parameters of a network that estimates the value function, directly from on-policy samples of experience, st, at, rt, st+1, at+1, drawn from the algorithm’s interactions with the environment (or by self-play, in the case of backgammon).”); and associating, in the replay memory, the selected piece of experience data with the updated expected learning progress measure (see P4 §4 ¶3, and p5 algorithm 1 “
    PNG
    media_image1.png
    134
    537
    media_image1.png
    Greyscale
”).

Regarding claim 24. 
Mnih discloses the method of claim 21, 
Mnih further discloses wherein selecting the piece of experience data from the replay memory by prioritizing for selection pieces of experience data having relatively higher expected learning progress comprises: determining, based on the respective expected learning progress measures for the pieces of experience data, a respective probability for each of the pieces of experience data in the replay memory (see P4 §4 ¶3, and p5 algorithm 1 “
    PNG
    media_image1.png
    134
    537
    media_image1.png
    Greyscale
”); and sampling a piece of experience data from the replay memory in accordance with the determined probabilities (see P4 §4 ¶3, and p5 algorithm also discloses sampling a piece of experience data from the replay memory according to the probabilities).

Regarding claim 30.

Mnih further discloses wherein each piece of experience data comprises a respective next state characterizing a respective next state of the environment, and a reward received in response to the agent performing the current action (see page 4 §4 ¶3, “In contrast to TD-Gammon and similar online approaches, we utilize a technique known as experience replay [13] where we store the agent’s experiences at each time-step, et = (st; at; rt; st+1) in a data-set D = e1, …, eN, pooled over many episodes into a replay memory” [i.e. st is a current state, at is an action, rt is a reward and st+1 is a next state]).

Claims 36-39 recite a system to perform the method recited in claim 21-24. Therefore the rejection of claims 21-24 above applies equally here.
Claim 41 recite a non-transitory computer storage medium to perform the method recited in claim 21. Therefore the rejection of claim 21 above applies equally here.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
 
Claims 25 and 40 are rejected under 35 U.S.C. 103 as being unpatentable over Mnih et al. (Playing Atari with Deep Reinforcement Learning) in view of Narasimhan et al. (Language Understanding for Text-based Games using Deep Reinforcement Learning).
 
Regarding claim 25. 
Mnih teaches the method of claim 24, 
Mnih teaches wherein determining, based on the respective expected learning progress measures for the pieces of experience data, a respective probability for each of the pieces of experience data in the replay memory (see P4 §4 ¶3 as cited in claim 24).
However Mnih does not teach determining a respective probability for each piece of experience data such that pieces of experience data having higher expected learning progress measures have higher probabilities than pieces of experience data having relatively lower expected learning progress measures. 
Narasimhan teaches determining a respective probability for each piece of experience data such that pieces of experience data having higher expected learning progress measures have higher probabilities than pieces of experience data having relatively lower expected learning progress measures (see page 5 ¶ 2, “The simplest method to create these minibatches from the experience memory D is to sample uniformly at random. However, certain experiences are more valuable than others for the agent to learn from. For instance, rare transitions that provide positive rewards can be used more often to learn optimal Q-values faster. In our experiments, we consider such positive-reward transitions to have higher priority and keep track of them in D. We use prioritized sampling (inspired by Moore and Atkeson (1993)) to sample a fraction p of transitions from the higher priority pool and a fraction 1-p from the rest.”).
  	Both Mnih and Narasimhan pertain to the problem of teach training a neural network to play games using reinforcement learning, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Mnih and Narasimhan to disclose a probability based on a relative expected learning progress as taught by Narasimhan. The motivation for doing so would be to use prioritization which allows for higher value training experiences to be encountered more often and to yield predictable results (See Narasimhan abstract and see page 5 ¶ 2).

Claim 40 recites a system to perform the method recited in claim 25. Therefore the rejection of claim 25 above applies equally here.

Claims 31-35 are rejected under 35 U.S.C. 103 as being unpatentable over Mnih et al. (Playing Atari with Deep Reinforcement Learning) in view of Maei et al. (Toward Off-Policy Learning Control with Function Approximation).
 
Regarding claim 31. 
Mnih teaches the method of claim 22, 
Mnih further discloses wherein training the neural network on the selected piece of experience data (see page 6 § 5.1, “In reinforcement learning, however, accurately evaluating the progress of an agent during training can be challenging. Since he total reward the agent collects in an episode or game averaged over a number of games, we periodically compute it during training. The average total reward metric tends to be very noisy because small changes to the weights of a policy can lead to large changes in the distribution of states the policy visits.”) 
However, Mnih does not teach using the updated error in adjusting values of the parameters of the neural network.
Maei teaches: using the updated error in adjusting values of the parameters of the neural network (see page 3 ¶4, 

    PNG
    media_image3.png
    341
    511
    media_image3.png
    Greyscale
, wherein the temporal error i.e. updated error adjusts values of the parameters of the neural network).
  	Both Mnih and Maei pertain to the problem of teach training a neural network using reinforcement learning, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Mnih and Maei to determine a temporal difference error for adjusting the values of the neural network as taught by Maei. The motivation for doing so would be to improve stability when using approximation and to yield predictable results (See Maei abstract and see page 3 ¶4).

Regarding claim 32. 
Mnih and Maei teaches the method of claim 31, 
Maei further teaches wherein using the updated error in adjusting the values of the parameters comprises: determining a weight for the updated error using the expected learning progress measure for the selected experience tuple; adjusting the updated error using the weight; and using the adjusted error as a target error for adjusting the values of the parameters of the neural network (see p3 ¶4
    PNG
    media_image3.png
    341
    511
    media_image3.png
    Greyscale
 and page 4 right column ¶5 “Greedy-GQ uses an update-rule for parameter θ analogous to that of Q-learning with function approximation except that we have a correction term. The update of the second set of weights, wt, follows the least mean square (LMS) rule. These weights are normally initialized to zero. As promised, the computation of an update takes linear time in the dimension of the features, d.”)
Both Mnih and Maei pertain to the problem of teach training a neural network using reinforcement learning, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Mnih and Maei to determine a temporal difference error for adjusting the values of the neural 3 ¶4).


Regarding claim 33. 
Mnih and Maei teaches the method of claim 32, 
Mnih further teaches further comprising annealing an exponent used in computing the weight during the training of the neural network (see page 6 §5 ¶2 “In these experiments, we used the RMSProp algorithm with minibatches of size 32. The behavior policy during training was -greedy with annealed linearly from 1 to 0:1 over the first million frames, and fixed at 0:1 thereafter. We trained for a total of 10 million frames and used a replay memory of one million most recent frames.”).
 
Regarding claim 34. 
Mnih teaches the method of claim 30, 
Mnih do not teach wherein the expected learning progress measure for each experience tuple in the replay memory is a derivative of an absolute value of a temporal difference learning error determined for the experience tuple the preceding time the experience tuple was used in training the neural network.
Maei teaches wherein the expected learning progress measure for each experience tuple in the replay memory is a derivative of an absolute value of a temporal difference learning error determined for the experience tuple the preceding time the experience tuple was used in training the neural network (see page 4 ¶3
 
    PNG
    media_image4.png
    489
    536
    media_image4.png
    Greyscale
 and page 4 right column ¶5 “Greedy-GQ uses an update-rule for parameter θ analogous to that of Q-learning with function approximation except that we have a correction term. The update of the second set of weights, wt, follows the least mean square (LMS) rule. These weights are normally initialized to zero. As promised, the computation of an update takes linear time in the dimension of the features, d.”)
Both Mnih and Maei pertain to the problem of teach training a neural network using reinforcement learning, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Mnih and Maei to determine a temporal difference error for adjusting the values of the neural network as taught by Maei. The motivation for doing so would be to improve stability when using approximation and to yield predictable results (See Maei abstract and see page 3 ¶4).

Regarding claim 35. 
Mnih teaches the method of claim 30, 
Mnih do not teach wherein the expected learning progress measure for each experience tuple in the replay memory is a norm of an induced weight-change by using the experience tuple to train the neural network.
Maei teach wherein the expected learning progress measure for each experience tuple in the replay memory is a norm of an induced weight-change by using the experience tuple to train the neural network (see page 4 ¶3
 
    PNG
    media_image4.png
    489
    536
    media_image4.png
    Greyscale
 and page 4 right column ¶5 “Greedy-GQ uses an update-rule for parameter θ analogous to that of Q-learning with function approximation except that we have a correction term. The update of the second set of weights, wt, follows the least mean square (LMS) rule. These weights are normally initialized to zero. As promised, the computation of an update takes linear time in the dimension of the features, d.”)
3 ¶4).

					Conclusion
The prior art made of record and not relied upon is considered pertinent to
applicant's disclosure:
	MNIH et al. (US 2015/0100530 A1), METHODS AND APPARATUS FOR REINFORCEMENT LEARNING. This invention relates to improved techniques for reinforcement learning, in particular Q-learning, and to related data processors and processor control code. A method of reinforcement learning for a subject system having multiple states and actions to move from one state to the next. Training data is generated by operating on the system with a succession of actions and used to train a second neural network. Target values for training the second neural network are derived from a first neural network which is generated by copying weights of the second neural network at intervals.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP 
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IMAD M KASSIM whose telephone number is (571)272-2958. The examiner can normally be reached mon-fri 730-500.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J. Huntley can be reached on (303) 297 - 4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is 





/I.K./Examiner, Art Unit 2129                                                                                                                                                                                                        
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129