DETAILED ACTION
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2.	This communication is in response to the Applicant’s submissions filed 24 April 2020 and 23 April 2020, where:
Claims 1-20 have been cancelled.
New claims 21-40 are presented for examination.
Claims 21-40 are pending.
Claims 21-40 are rejected.
Examiner notes the Applicant’s priority claim to the US Provisional Application 62463558, filed 24 February 2017.
Information Disclosure Statement
3.	The information disclosure statement was submitted on 23 April 2020 and 16 December 2020. 37 CFR 1.98(d). With regard to the information disclosure statement submitted 23 April 2020, Examiner notes that parent application US Application S/N 16455523 includes [a] legible copy of (i) [e]ach foreign patent; (ii) [e]ach publication or that portion which is caused it to be listed, other than U.S. patents and U.S. patent application publications unless required by the Office. 37 CFR 1.98(d)(2). The submission complies with the provisions of 37 CFR 1.97. Accordingly, the Examiner considered the information disclosure statements.
Double Patenting
4.	The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
5.	Claims 21-40 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 4-7, and 9-12 of U.S. Patent No. 10664753. Although the claims at issue are not identical, they are not patentably distinct from each other as follows:

Instant Application ‘527
US Patent 10664753
Claim 21
Claims 1 & 12

21. A computer-implemented method for training a neural episodic controller that comprises an embedding neural network, the neural episodic controller maintaining episodic memory data that comprises a respective episodic memory module for each action of a plurality of actions that may be performed by an agent in response to an observation, the episodic memory module for each action mapping each of a respective plurality of key embeddings to a respective return estimate, the method comprising:











sampling, by one or more computers, a training tuple from the replay memory, the training tuple comprising a training observation, a training selected action, and a training return;

processing the training observation using the embedding neural network in accordance with current values of parameters of the embedding neural network to generate a training key embedding for the training observation;





determining whether the training key embedding associated with the training observation matches any of the key embeddings in the episodic memory module for the training selected action;



























when the training key embedding matches a key embedding in the episodic memory module for the training selected action, updating the episodic memory module to map the matching key embedding to a new return estimate that is computed based on (i) the training return, (ii) the return estimate currently mapped to by the matching key embedding, and (iii) an episodic memory learning rate;























determining, by the one or more computers, a Q value for the training selected action from the training observation, wherein the Q value for the selected action is a predicted return that would result from the agent performing the training selected action in response to the training observation; and

backpropagating, by the one or more computers, a gradient of an error between the Q value for the training selected action and the training return to update current values of parameters of the embedding neural network using an embedding neural network learning rate that is smaller than the episodic memory learning rate.
1. A method comprising:













maintaining, by one or more computers, respective episodic memory data for each action of a plurality of actions, wherein the episodic memory data for each action maps each of a respective plurality of key embeddings to a respective return estimate;



receiving, by the one or more computers, a current observation characterizing a current state of an environment being interacted with by an agent;


processing, by the one or more computers, the current observation using an embedding neural network in accordance with current values of parameters of the embedding neural network to generate a current key embedding for the current observation;

for each action of the plurality of actions:

determining, by one or more computers, the p nearest key embeddings in the episodic memory data for the action to the current key embedding according to a distance measure, and

determining, by one or more computers, a Q value for the action from the return estimates mapped to by the p nearest key embeddings in the episodic memory data for the action, wherein the Q value for the action is a predicted return that would result from the agent performing the action in response to the current observation;

selecting, by the one or more computers and using the Q values for the actions, an action from the plurality of actions as the action to be performed by the agent in response to the current observation;

determining a current return resulting from the agent performing the selected action in response to the current observation;

determining whether the current key embedding matches any of the key embeddings in the episodic memory data for the selected action; and

when the current key embedding matches a key embedding in the episodic memory data for the selected action, updating the episodic memory data to map the matching key embedding to a new return estimate that is computed based on the current return, the return estimate currently mapped to by the matching key embedding, and a learning rate,



wherein the learning rate of the updating of the episodic memory module is greater than an embedding neural network learning rate of updating current values of parameters of the embedding neural network during joint training of the embedding neural network and the episodic memory module.

** Dependent claim 12 **
12. The method of claim 1, further comprising:

during the joint training of the embedding neural network and the episodic memory module:

sampling a training tuple from a replay memory, the training tuple comprising a training observation, a training selected action, and a training return;

determining a Q value for the training selected action from the training observation; and






backpropagating a gradient of an error between the Q value for the training selected action and the training return to update the key embeddings, the estimated returns, and the current values of the parameters of the embedding neural network using the embedding neural network learning rate.
Claim 22
Claim 1
The method of claim 21, wherein determining the Q value for the training selected action comprises:

determining the p nearest key embeddings in the episodic memory module for the training selected action to the training key embedding according to a distance measure, and

determining the Q value for the training selected action from the return estimates mapped to by the p nearest key embeddings in the episodic memory module for the training selected action.


* * *

determining, by one or more computers, a Q value for the action from the return estimates mapped to by the p nearest key embeddings in the episodic memory data for the action, 

wherein the Q value for the action is a predicted return that would result from the agent performing the action in response to the current observation;

* * *
Instant Claim 23

Claim 4
The method of claim 22, wherein determining the Q value for the training selected action from the return estimates mapped to by the p nearest key embeddings in the episodic memory module for the training selected action comprises:

determining a respective weight for each of the p nearest key embeddings in the episodic memory module for the training selected action from distances between the p nearest key embeddings and the training key embedding according to the distance measure; and

for each of the p nearest key embeddings in the episodic memory module for the training selected action, weighting the return estimate mapped to the training key embedding by the weight for the training key embedding to determine a respective weighted estimated return.
The method of claim 1, wherein determining a Q value for the action from the return estimates mapped to by the p nearest key embeddings in the episodic memory data for the action comprises:



determining a respective weight for each of the p nearest key embeddings in the episodic memory data for the action from distances between the p nearest key embeddings and the current key embedding according to the distance measure; and

for each of the p nearest key embeddings in the episodic memory data for the action, weighting the estimated return mapped to the key embedding by the weight for the key embedding to determine a respective weighted estimated return.

Instant Claim 24

Claim 5
The method of claim 23, wherein determining the Q value for the training selected action comprises:

summing the weighted return estimates for the training selected action; and

using the summed weighted return estimate as the Q value.
The method of claim 4, wherein determining the Q value for the action comprises:

summing the weighted estimated returns for the action; and

using the summed weighted estimated return as the Q value.
Instant Claim 25

Claim 6
The method of claim 23, wherein determining the Q value for the training selected action comprises:

summing the weighted return estimates for the training selected action; and

processing a network input that comprises the summed weighted return estimate through a return neural network in accordance with current values of parameters of the return neural network to generate the Q value.
6. The method of claim 4, wherein determining the Q value for the action comprises:

summing the weighted estimated returns for the action; and

processing a network input that comprises the summed weighted estimated return through a return neural network in accordance with current values of parameters of the return neural network to generate the Q value.

Instant Claim 26

Claim 7
The method of claim 23, wherein determining the Q value for the training selected action comprises:

processing a network input that comprises the weighted return estimates through a return neural network in accordance with current values of parameters of the return neural network to generate the Q value.
7. The method of claim 4, wherein determining the Q value for the action comprises:

processing a network input that comprises the weighted estimated returns through a return neural network in accordance with current values of parameters of the return neural network to generate the Q value.

Instant Claim 27

Claim 9
The method of claim 21, wherein the embedding neural network is a convolutional neural network.
The method of claim 1, wherein the embedding neural network is a convolutional neural network.
Instant Claim 28

Claim 10
The method of claim 21, wherein the embedding neural network comprises one or more recurrent neural network layers.
The method of claim 1, wherein the embedding neural network comprises one or more recurrent neural network layers.
Instant Claim 29

Claim 11
The method of claim 21, further comprising:

when the training key embedding does not match any of the key embedding in the episodic memory module for the training selected action, 

adding data mapping the current key embedding to the training return to the episodic memory module for the training selected action.
The method of claim 1, further comprising:

when the current key embedding does not match any of the key embeddings in the episodic memory data for the selected action,

adding data mapping the current key embedding to the current return to the episodic memory data for the selected action.
Instant Claim 30
Claim 1

One or more non-transitory computer storage media storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations for training a neural episodic controller that comprises an embedding neural network, the neural episodic controller maintaining episodic memory data that comprises a respective episodic memory module for each action of a plurality of actions that may be performed by an agent in response to an observation, the episodic memory module for each action mapping each of a respective plurality of key embeddings to a respective return estimate, the operations comprising:











sampling, by one or more computers, a training tuple from the replay memory, the training tuple comprising a training observation, a training selected action, and a training return;

processing the training observation using the embedding neural network in accordance with current values of the parameters of the embedding neural network to generate a training key embedding for the training observation;





determining whether the training key embedding associated with the training observation matches any of the key embeddings in the episodic memory module for the training selected action;



























when the training key embedding matches a key embedding in the episodic memory module for the training selected action, updating the episodic memory module to map the matching key embedding to a new return estimate that is computed based on (i) the training return, (ii) the return estimate currently mapped to by the matching key embedding, and (iii) an episodic memory learning rate;


























determining, by the one or more computers, a Q value for the training selected action from the training observation, wherein the Q value for the selected action is a predicted return that would result from the agent performing the training selected action in response to the training observation; and

backpropagating, by the one or more computers, a gradient of an error between the Q value for the training selected action and the training return to update current values of parameters of the embedding neural network using an embedding neural network learning rate that is smaller than the episodic memory learning rate.
1. A method comprising:

















maintaining, by one or more computers, respective episodic memory data for each action of a plurality of actions, wherein the episodic memory data for each action maps each of a respective plurality of key embeddings to a respective return estimate;



receiving, by the one or more computers, a current observation characterizing a current state of an environment being interacted with by an agent;


processing, by the one or more computers, the current observation using an embedding neural network in accordance with current values of parameters of the embedding neural network to generate a current key embedding for the current observation;

for each action of the plurality of actions:

determining, by one or more computers, the p nearest key embeddings in the episodic memory data for the action to the current key embedding according to a distance measure, and

determining, by one or more computers, a Q value for the action from the return estimates mapped to by the p nearest key embeddings in the episodic memory data for the action, wherein the Q value for the action is a predicted return that would result from the agent performing the action in response to the current observation;

selecting, by the one or more computers and using the Q values for the actions, an action from the plurality of actions as the action to be performed by the agent in response to the current observation;

determining a current return resulting from the agent performing the selected action in response to the current observation;

determining whether the current key embedding matches any of the key embeddings in the episodic memory data for the selected action; and

when the current key embedding matches a key embedding in the episodic memory data for the selected action, updating the episodic memory data to map the matching key embedding to a new return estimate that is computed based on the current return, the return estimate currently mapped to by the matching key embedding, and a learning rate,



wherein the learning rate of the updating of the episodic memory module is greater than an embedding neural network learning rate of updating current values of parameters of the embedding neural network during joint training of the embedding neural network and the episodic memory module.

** Dependent claim 12 **
12. The method of claim 1, further comprising:

during the joint training of the embedding neural network and the episodic memory module:

sampling a training tuple from a replay memory, the training tuple comprising a training observation, a training selected action, and a training return;

determining a Q value for the training selected action from the training observation; and






backpropagating a gradient of an error between the Q value for the training selected action and the training return to update the key embeddings, the estimated returns, and the current values of the parameters of the embedding neural network using the embedding neural network learning rate.
Instant Claim 31

Claim 1
The one or more non-transitory computer storage media of claim 30, wherein the operations for determining the Q value for the training selected action comprise:

determining the p nearest key embeddings in the episodic memory module for the training selected action to the training key embedding according to a distance measure, and

determining the Q value for the training selected action from the return estimates mapped to by the p nearest key embeddings in the episodic memory module for the training selected action.



* * *

determining, by one or more computers, a Q value for the action from the return estimates mapped to by the p nearest key embeddings in the episodic memory data for the action, 

wherein the Q value for the action is a predicted return that would result from the agent performing the action in response to the current observation;

* * *
Instant Claim 32

Claim 4
The one or more non-transitory computer storage media of claim 31, wherein the operations for determining the Q value for the training selected action from the return estimates mapped to by the p nearest key embeddings in the episodic memory module for the training selected action comprise:

determining a respective weight for each of the p nearest key embeddings in the episodic memory module for the training selected action from distances between the p nearest key embeddings and the training key embedding according to the distance measure; and

for each of the p nearest key embeddings in the episodic memory module for the training selected action, weighting the return estimate mapped to the training key embedding by the weight for the training key embedding to determine a respective weighted estimated return.
The method of claim 1, wherein determining a Q value for the action from the return estimates mapped to by the p nearest key embeddings in the episodic memory data for the action comprises:




determining a respective weight for each of the p nearest key embeddings in the episodic memory data for the action from distances between the p nearest key embeddings and the current key embedding according to the distance measure; and

for each of the p nearest key embeddings in the episodic memory data for the action, weighting the estimated return mapped to the key embedding by the weight for the key embedding to determine a respective weighted estimated return.

Instant Claim 33

Claim 5
The one or more non-transitory computer storage media of claim 32, wherein the operations for determining the Q value for the training selected action comprise:
summing the weighted return estimates for the training selected action; and
using the summed weighted return estimate as the Q value.
The method of claim 4, wherein determining the Q value for the action comprises:

summing the weighted estimated returns for the action; and

using the summed weighted estimated return as the Q value.
Instant Claim 34

Claim 6
The one or more non-transitory computer storage media of claim 32, wherein the operations for determining the Q value for the training selected action comprise:

summing the weighted return estimates for the training selected action; and

processing a network input that comprises the summed weighted return estimate through a return neural network in accordance with current values of parameters of the return neural network to generate the Q value.
The method of claim 4, wherein determining the Q value for the action comprises:


summing the weighted estimated returns for the action; and

processing a network input that comprises the summed weighted estimated return through a return neural network in accordance with current values of parameters of the return neural network to generate the Q value.

Instant Claim 35

Claim 7
The one or more non-transitory computer storage media of claim 32, wherein the operations for determining the Q value for the training selected action comprise:

processing a network input that comprises the weighted return estimates through a return neural network in accordance with current values of parameters of the return neural network to generate the Q value.
7. The method of claim 4, wherein determining the Q value for the action comprises:


processing a network input that comprises the weighted estimated returns through a return neural network in accordance with current values of parameters of the return neural network to generate the Q value.

Instant Claim 36

Claim 9
The one or more non-transitory computer storage media of claim 30, wherein the embedding neural network is a convolutional neural network.
The method of claim 1, wherein the embedding neural network is a convolutional neural network.
Instant Claim 37

Claim 10
The one or more non-transitory computer storage media of claim 30, wherein the embedding neural network comprises one or more recurrent neural network layers.
The method of claim 1, wherein the embedding neural network comprises one or more recurrent neural network layers.
Instant Claim 38

Claim 11
The one or more non-transitory computer storage media of claim 30, wherein the operations further comprise:

when the training key embedding does not match any of the key embedding in the episodic memory module for the training selected action, 

adding data mapping the current key embedding to the training return to the episodic memory module for the training selected action.
The method of claim 1, further comprising:


when the current key embedding does not match any of the key embeddings in the episodic memory data for the selected action,

adding data mapping the current key embedding to the current return to the episodic memory data for the selected action.
Instant Claim 39

Claim 1
A system comprising one or more computers and one or more non-transitory computer storage media storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations for training a neural episodic controller that comprises an embedding neural network, the neural episodic controller maintaining episodic memory data that comprises a respective episodic memory module for each action of a plurality of actions that may be performed by an agent in response to an observation, the episodic memory module for each action mapping each of a respective plurality of key embeddings to a respective return estimate, the operations comprising:











sampling, by one or more computers, a training tuple from the replay memory, the training tuple comprising a training observation, a training selected action, and a training return;

processing the training observation using the embedding neural network in accordance with current values of the parameters of the embedding neural network to generate a training key embedding for the training observation;





determining whether the training key embedding associated with the training observation matches any of the key embeddings in the episodic memory module for the training selected action;



























when the training key embedding matches a key embedding in the episodic memory module for the training selected action, updating the episodic memory module to map the matching key embedding to a new return estimate that is computed based on (i) the training return, (ii) the return estimate currently mapped to by the matching key embedding, and (iii) an episodic memory learning rate;























determining, by the one or more computers, a Q value for the training selected action from the training observation, wherein the Q value for the selected action is a predicted return that would result from the agent performing the training selected action in response to the training observation; and

backpropagating, by the one or more computers, a gradient of an error between the Q value for the training selected action and the training return to update current values of parameters of the embedding neural network using an embedding neural network learning rate that is smaller than the episodic memory learning rate.
1. A method comprising:


















maintaining, by one or more computers, respective episodic memory data for each action of a plurality of actions, wherein the episodic memory data for each action maps each of a respective plurality of key embeddings to a respective return estimate;



receiving, by the one or more computers, a current observation characterizing a current state of an environment being interacted with by an agent;


processing, by the one or more computers, the current observation using an embedding neural network in accordance with current values of parameters of the embedding neural network to generate a current key embedding for the current observation;

for each action of the plurality of actions:

determining, by one or more computers, the p nearest key embeddings in the episodic memory data for the action to the current key embedding according to a distance measure, and

determining, by one or more computers, a Q value for the action from the return estimates mapped to by the p nearest key embeddings in the episodic memory data for the action, wherein the Q value for the action is a predicted return that would result from the agent performing the action in response to the current observation;

selecting, by the one or more computers and using the Q values for the actions, an action from the plurality of actions as the action to be performed by the agent in response to the current observation;

determining a current return resulting from the agent performing the selected action in response to the current observation;

determining whether the current key embedding matches any of the key embeddings in the episodic memory data for the selected action; and

when the current key embedding matches a key embedding in the episodic memory data for the selected action, updating the episodic memory data to map the matching key embedding to a new return estimate that is computed based on the current return, the return estimate currently mapped to by the matching key embedding, and a learning rate,



wherein the learning rate of the updating of the episodic memory module is greater than an embedding neural network learning rate of updating current values of parameters of the embedding neural network during joint training of the embedding neural network and the episodic memory module.

** Dependent claim 12 **
12. The method of claim 1, further comprising:

during the joint training of the embedding neural network and the episodic memory module:

sampling a training tuple from a replay memory, the training tuple comprising a training observation, a training selected action, and a training return;

determining a Q value for the training selected action from the training observation; and






backpropagating a gradient of an error between the Q value for the training selected action and the training return to update the key embeddings, the estimated returns, and the current values of the parameters of the embedding neural network using the embedding neural network learning rate.
Instant Claim 40

Claim 11
The system of claim 39, wherein the operations further comprise:


when the training key embedding does not match any of the key embedding in the episodic memory module for the training selected action, 

adding data mapping the current key embedding to the training return to the episodic memory module for the training selected action.
The method of claim 1, further comprising:


when the current key embedding does not match any of the key embeddings in the episodic memory data for the selected action,

adding data mapping the current key embedding to the current return to the episodic memory data for the selected action.


Specification
6.	The disclosure is objected to because of the following informalities:
	Paragraph 0009, “form the predetermined set of actions” should read --from the predetermined set of actions--.
	Paragraph 0036, “reply memory 114” should read --replay memory 114--.
	Paragraph 0049, in two places, “reply memory 114” should read --replay memory 114--.
	Paragraph 0070, “reply memory” should read --replay memory--.
Appropriate correction is required.
Claim Rejections - 35 U.S.C. § 103 
7.	The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
8.	The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. § 103 are summarized as follows:
1. 	Determining the scope and contents of the prior art.
2. 	Ascertaining the differences between the prior art and the claims at issue.
3. 	Resolving the level of ordinary skill in the pertinent art.
4. 	Considering objective evidence present in the application indicating obviousness or nonobviousness.
9.	This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
10.	Claims 21-23, 27, 28, 30-32, 36, 37, and 39 are rejected under 35 U.S.C. 103 as being unpatentable over Oh et al., “Control of Memory, Active Perception, and Action in Minecraft,” International Conference on Machine Learning (2016) [hereinafter Oh] in view of Scott Ellison Reed, “Deep Neural Networks for Visual Reasoning, Program Induction, and Text-to-Image Synthesis (Thesis) [hereinafter Reed] and Paul J. Werbos, “Backpropagation Through Time: What it Does and How to Do It,” IEEE (1990) [hereinafter Werbos].
Regarding claims 21, 30, and 39, Oh teaches [a] computer-implemented method (Oh teaches [o]ur implementation is based on Torch7 (Collobert et al., 2011), a public DQN implementation (Mnih et al., 2015), and a Minecraft Forge Mod) for training a neural episodic controller that comprises an embedding neural network, the neural episodic controller maintaining episodic memory data that comprises a respective episodic memory module for each action of a plurality of actions that may be performed by an agent in response to an observation, the episodic memory module for each action mapping each of a respective plurality of key embeddings to a respective return estimate (Oh, Fig. 3(c)-(e) teaches (Examiner annotations in text boxes):

    PNG
    media_image1.png
    214
    404
    media_image1.png
    Greyscale

in which Oh, left column of p. 3, “4. Architectures, second paragraph, teaches [d]epending on how the context vector is constructed, we obtain three new architectures: Memory Q-Network (MQN), Recurrent Memory Q-Network (RMQN), and Feedback Recurrent Memory Q-Network (FRMQN)), [o]ne or more non-transitory computer storage media (Oh, left column of p. 2, “1. Introductions,” first full paragraph, teaches [o]ur proposed architectures store decent observations into their memory and retrieve relevant memory based on the temporal context), and [a] system (Oh, left column of p. 1, “1. Introduction,” first paragraph, teaches the Deep Q-Network (DQN) . . . architecture has been shown to successfully learn to play many Atari 2600 games in the Arcade Learning Environment (ALE) benchmark . . . by learning visual features useful for control directly from raw pixels using Q-Learning), . . . comprising:
sampling, by one or more computers, a training tuple from the replay memory, the training tuple comprising a training observation, a training selected action, and a training return (Oh, left column of p. 3, “3 Background: Deep Q-Learning,” first paragraph, teaches [denoting] the state, immediate reward, and action at time as st, rt, at, respectively (that is, a training tuple). In the DQN framework, every transition Tt = (st, st+1, at, rt) is stored in a replay memory (that is, sampling . . . a training tuple from the replay memory, the training tuple comprising a training observation, a training selected action, and a training return));
processing the training observation using the embedding neural network in accordance with current values of parameters of the embedding neural network to generate a training key embedding for the training observation (Oh at page 3, left column, “4. Architectures,” first and second full paragraphs, & FIG. 3, teaches [t]he importance of retrieving a prior observation (that is, the training observation) from memory depends on the current context. . . . Our proposed architectures (Figure 3c-e) consist of convolutional networks (that is, embedding neural network) for extracting high-level features (that is, processing) from images (that is, the training observation) . . . , and a context vector (that is, to generate a training key embedding) used both for memory retrieval and . . . action-value estimation . . . .);
* * *
determining, by the one or more computers, a Q value for the training selected action from the training observation (Oh at page 4, left column, “4.3 Context,” second full paragraph, & FIG. 3, teaches the architectures [of Figure 3] estimate action-values (that is, determining . . . a Q-value for the training selected action from the training observation) by incorporating the retrieved memory and context vector . . . .), wherein the Q value for the selected action is a predicted return that would result from the agent performing the training selected action in response to the training observation (Oh at page 4, left column, “4.3, Context,” second full paragraph, & FIG. 3, teaches qt = φq (ht, ot), where qt ∈ ℝa is the estimated action-value (that is, predicted return that would result from the agent performing the action in response to the current observation), and φq is a multi-layer perceptron . . . taking two inputs); and
* * *
Though Oh teaches the feature of new memory-based deep reinforcement learning architectures, Oh does not explicitly teach-
* * *
determining whether the training key embedding associated with the training observation matches any of the key embeddings in the episodic memory module for the training selected action;
when the training key embedding matches a key embedding in the episodic memory module for the training selected action, updating the episodic memory module to map the matching key embedding to a new return estimate that is computed based on (i) the training return, (ii) the return estimate currently mapped to by the matching key embedding, and (iii) an episodic memory learning rate;
* * *
But Reed teaches -
* * *
determining whether the training key embedding associated with the training observation matches any of the key embeddings in the episodic memory module for the training selected action (Reed, Fig. 5.2, teaches:

    PNG
    media_image2.png
    480
    1017
    media_image2.png
    Greyscale

Reed, at p. 45, “5.1 Introduction,” second paragraph, teaches a compositional architecture that learns to represent and interpret programs. We refer to this architecture as the Neural Programmer-Interpreter (NPI). The core module is an LSTM-based sequence model that takes as input a learnable program embedding, program arguments passed on by the calling program, and a feature representation
of the environment. . . . In addition to the recurrent core, the NPI architecture includes a learnable key-value memory of program embeddings (that is, training key embedding associated with training observation). This program-memory is essential for learning and re-using programs in a continual manner. Figures 5.1 and 5.2 illustrate the
NPI on two different tasks);
when the training key embedding matches a key embedding in the episodic memory module for the training selected action (Reed, at p. 51, “5.3.1 Inference,” first full paragraph, teaches [t]he feed-forward steps of program inference are summarized [by equations 5.1, 5.2, and 5.3], where rt, kt and at+1 correspond (that is, match) to the end-of-program probability, program key embedding, and output arguments at time t, respectively), updating the episodic memory module to map the matching key embedding to a new return estimate (Reed at p. 60, “5.4.3 Learning new programs with a fixed core,” first & second paragraphs, teaches [o]ne challenge for continual learning of neural-network-based agents is that training on new tasks and experiences can lead to degraded performance in old tasks. . . . When adding a new program the core module’s routing computation will be completely unaffected; all the learning for a new task occurs in program embedding space. Of course, the addition of new programs to the memory adds a new choice of program at each time step, and an old program could mistakenly call a newly added program. To overcome this, when learning a new set of program vectors with a fixed core, in practice we train not only on example traces of the new program, but also traces of existing programs. Alternatively, a simpler approach is to prevent existing programs from calling subsequently added programs, allowing addition of new programs without ever looking back at training data for known programs. In either case, note that only the memory slots of the new programs are updated, and all other weights, including other program embeddings, are fixed (that is, when the training key embedding matches a key embedding in the episodic memory module for the training selected action, updating the episodic memory module to map the matching key embedding to a new return estimate)) that is computed based on (i) the training return, (ii) the return estimate currently mapped to by the matching key embedding, and (iii) an episodic memory learning rate (Reed, at p. 52, “5.4 Experiments,” first paragraph, teaches trained the NPI model and all program embeddings jointly using RMSprop with base learning rate 0:0001, batch size 1, and decayed the learning rate by a factor of 0:95 every 10,000 steps
[Examiner note - for an update to episodic memory to occur, a match, such as a key match, functions to identify the memory to update]);
* * *
Oh and Reed are from the same or similar field of endeavor. Oh teaches memory-based deep reinforcement learning architectures via arcade games that draws soft attention over memory locations. Reed teaches the teaching machines to learn programs, and to conditionally execute these programs automatically, including reinforcement learning policies. Accordingly, it would have been obvious to one of ordinary skill in the art as of the effective filing date of Applicant’s invention to implement the teachings of Oh pertaining to memory-based deep reinforcement learning with the key embedding of Reed.
The motivation for doing so is to train a reinforcement learning model on execution traces instead of input and output pairs, which in exchange for this richer supervision to improve data efficiency on several problems. (Reed, at p. 48, “5.2 Related work,” second full paragraph).
Though Oh and Reed teach the features of memory-based deep reinforcement learning with training reinforcement learning models on execution traces, the combination of Oh and Reed, however, does not explicitly teach -
* * *
and backpropagating, by the one or more computers, a gradient of an error between the Q value for the training selected action and the training return to update current values of parameters of the embedding neural network using an embedding neural network learning rate that is smaller than the episodic memory learning rate.
But Werbos teaches -
and backpropagating, by the one or more computers, a gradient of an error between the Q value for the training selected action and the training return to update current values of parameters of the embedding neural network using an embedding neural network learning rate that is smaller than the episodic memory learning rate (Werbos, left column of p. 1559, “IV.D. Speeding Up Convergence,” second paragraph, teaches [i]n using backpropagation through time, we usually need to use much smaller learning rates than we do in basic back propagation (that is, because the learning rate is much smaller, this is backpropagating . . . to update current values of parameters of the embedding neural network using an embedding neural network learning rate that is smaller than the episodic memory learning rate)).
Oh, Reed, and Werbos are from the same or similar field of endeavor. Oh teaches memory-based deep reinforcement learning architectures via arcade games that draws soft attention over memory locations. Reed teaches the teaching machines to learn programs, and to conditionally execute these programs automatically, including reinforcement learning policies. Werbos teaches backpropagation through time (BPTT) having a reduced learning rate as applied to neural network training. Accordingly, it would have been obvious to one of ordinary skill in the art as of the effective filing date of Applicant’s invention to implement the teachings of the combination of Oh and Reed pertaining to memory-based deep reinforcement learning using key embedding to a memory with the BPTT of Werbos.
The motivation for doing so is because BPTT extends the method of backpropagation with its characteristics of efficiency and exactness to dynamic systems to maximize performance over time. (Werbos, right column of p. 1550, “I. Introduction,” last paragraph).
	Examiner notes that the Applicant’s preambles do not afford patentable weight to the Applicant’s claims because the claim preamble is not “necessary to give life, meaning, and vitality” to the claim. Moreover, because the Applicant’s preamble merely states the purpose or intended use of the invention rather than any distinct definition of any of the claimed invention’s limitations, the preamble is not considered a limitation and is of no significance to claim construction.
Regarding claims 22 and 31, the combination of Oh, Reed, and Werbos teaches all of the limitations of claims 21 and 30, respectively, as described above in detail.
Oh teaches -
wherein determining the Q value for the training selected action comprises:
determining the p nearest key embeddings in the episodic memory module for the training selected action to the training key embedding (Oh at page 3, right column, Section 4.2, third full paragraph, & FIG. 2b, teaches [t]he reading mechanism of the memory is based on soft attention . . . . Given a context vector ht ∈ ℝm (current key embedding) . . . , the memory module draws (determining) soft attention over memory locations (the nearest key embeddings in the episodic memory data for the action)) . . . , and
determining the Q value for the training selected action from the return estimates mapped to by the p nearest key embeddings in the episodic memory module for the training selected action (Oh at page 4, left column, Section 4.3, second full paragraph, & FIG. 3, teaches the architectures [of Figure 3] estimate action-values (determining . . . a Q-value for the action)) by incorporating the retrieved memory (the episodic memory data for the action mapped to by the p nearest key embeddings) and context vector . . . .).
Regarding claims 23 and 32, the combination of Oh, Reed, and Werbos teaches all of the limitations of claim 22 and 31, respectively, as described above in detail.
Oh teaches -
wherein determining the Q value for the training selected action from the return estimates mapped to by the p nearest key embeddings in the episodic memory module for the training selected action comprises:
determining a respective weight for each of the p nearest key embeddings in the episodic memory module for the training selected action from distances between the p nearest key embeddings and the training key embedding according to the distance measure (Oh at page 7, left column, Section 5.2, second full paragraph, teaches a FRMQN [architecture] pays attention to both rooms, gradually moving weight from one [room] to the other as time progresses (according to the distance measure), which means that the context vector is repeatedly refined based on the encoded features of the room retrieved through its feedback connections (determining a respective weight for each of the p nearest key embeddings in the episodic memory data for the action from distances between the p nearest key embeddings and the current key embedding according to the distance measure)); and
for each of the p nearest key embeddings in the episodic memory module for the training selected action, weighting the return estimate mapped to the training key embedding by the weight for the training key embedding to determine a respective weighted estimated return (Oh at page 3, right column, Section 4.2, third paragraph, & equation (4) teaches where pt,i ∈ ℝ is attention weight (weighting the estimated return mapped to the key embedding by the weight for the key embedding to determine a respective weighted estimated return) for the i-th memory block (t-i time-step) (for each of the p nearest key embeddings in the episodic memory data for the action)).
Regarding claims 27 and 36, the combination of Oh, Reed, and Werbos teaches all of the limitations of claims 21 and 30, respectively, as described above in detail. 
Oh teaches -
wherein the embedding neural network is a convolutional neural network (Oh at page 3, right column, Section 4.1, first full paragraph, teaches that we use a CNN to encode the observation (that is, the embedding neural network is a convolutional neural network)).
Regarding claims 28 and 37, the combination of Oh, Reed, and Werbos teaches all of the limitations of claims 21 and 30, respectively, as described above in detail 
Oh teaches -
wherein the embedding neural network comprises one or more recurrent neural network layers (Oh at page 5, left column, Section 5, first partial paragraph, teaches DRGN and our architectures can take arbitrary number of input frames using their recurrent layers (that is, the embedding neural network comprises one or more recurrent neural network layers)).
11.	Claims 24 and 33 are rejected under 35 U.S.C. 103 as being unpatentable over Oh et al., “Control of Memory, Active Perception, and Action in Minecraft,” International Conference on Machine Learning (2016) [hereinafter Oh] in view of Scott Ellison Reed, “Deep Neural Networks for Visual Reasoning, Program Induction, and Text-to-Image Synthesis (Thesis) [hereinafter Reed] and Paul J. Werbos, “Backpropagation Through Time: What it Does and How to Do It,” IEEE (1990) [hereinafter Werbos], and further in view of US Published Application 20180165603 to Van Seijen et al. [hereinafter Van Seijen].
Regarding claims 24 and 33, the combination of Oh, Reed, and Werbos teaches all of the limitations of claims 23 and 32, respectively, as described above in detail.
Oh teaches -
wherein determining the Q value for the training selected action comprises:
summing the weighted return estimates for the training selected action (Oh at page 3, right column, Section 4.2, third full paragraph, & Fig. 2b, teaches [t]he output of the read operation is the linear sum (summing) of the value memory blocks based on the attention weights (summing the weighted estimated returns for the action)); and
* * *
Though Oh, Reed, and Werbos teaches the features of a memory-based deep reinforcement learning using key embedding to a memory with BPTT, the combination of Oh, Reed, and Werbos, however, does not explicitly teach -
* * *
and using the summed weighted estimated return as the Q value.
But Van Seijen teaches -
and using the summed weighted return estimate as the Q value (Van Seijen ¶ 0073 teaches [a] goal (Q value) [of reinforcement learning] is to maximize the discounted sum of rewards, also referred to as the return (using the summed weighted estimated return) . . . .).
Oh, Reed, and Werbos are from the same or similar field of endeavor. Oh teaches memory-based deep reinforcement learning architectures via arcade games that draws soft attention over memory locations. Reed teaches the teaching machines to learn programs, and to conditionally execute these programs automatically, including reinforcement learning policies. Werbos teaches backpropagation through time (BPTT) having a reduced learning rate as applied to neural network training. Van Seijen teaches a reinforcement learning framework with weighting of several learners on a global task. Accordingly, it would have been obvious to one of ordinary skill in the art as of the effective filing date of Applicant’s invention to implement the teachings of the combination of Oh, Reed, and Werbos pertaining to memory-based deep reinforcement learning using key embedding to a memory with the BPTT of Werbos and further with the weighting of Van Seijen. 
The motivation for doing so is to improve reinforcement learning by causing the overall value function to be much smoother and more easily approximated to enable more effective learning. (Van Seijen ¶ 0010).
12.	Claims 25, 26, 34, and 35 are rejected under 35 U.S.C. § 103 as being unpatentable over Oh et al., “Control of Memory, Active Perception, and Action in Minecraft,” International Conference on Machine Learning (2016) [hereinafter Oh] in view of Scott Ellison Reed, “Deep Neural Networks for Visual Reasoning, Program Induction, and Text-to-Image Synthesis (Thesis) [hereinafter Reed] and Paul J. Werbos, “Backpropagation Through Time: What it Does and How to Do It,” IEEE (1990) [hereinafter Werbos], and further in view of US Published Application 20150100530 to Mnih et al. [hereinafter Mnih].
Regarding claims 25 and 34, the combination of Oh, Reed, and Werbos teaches all of the limitations of claims 23 and 32, respectively, as described above in detail.
wherein determining the Q value for the training selected action comprises:
Oh teaches - 
summing the weighted return estimates for the training selected action (Oh at page 3, right column, Section 4.2, third full paragraph, & Fig. 2b, teaches [t]he output of the read operation is the linear sum (summing) of the value memory blocks based on the attention weights (summing the weighted estimated returns for the action)); and
* * *
Though Oh, Reed, and Werbos teaches the features of a memory-based deep reinforcement learning using key embedding to a memory with BPTT, the combination of Oh, Reed, and Werbos, however, does not explicitly teach -
* * *
processing a network input that comprises the summed weighted return estimate through a return neural network in accordance with current values of parameters of the return neural network to generate the Q value.
But Mnih teaches -
* * *
processing a network input that comprises the summed weighted return estimate through a return neural network in accordance with current values of parameters of the return neural network to generate the Q value (Mnih ¶ 0061 teaches [t]he procedure employs first (that is, embedding neural network) and second (that is, return neural network) neural networks . . . , each of which ends up being trained through implementation of the procedure (that is, processing a network input that comprises the summed weighted estimated return) to provide action-value parameters, more particularly Q-values (that is, to generate the Q value), for each action or each definable input state).
Oh, Reed, Werbos, and Mnih are from the same or similar field of endeavor. Oh teaches memory-based deep reinforcement learning architectures via arcade games that draws soft attention over memory locations. Reed teaches the teaching machines to learn programs, and to conditionally execute these programs automatically, including reinforcement learning policies. Werbos teaches backpropagation through time (BPTT) having a reduced learning rate as applied to neural network training. Mnih teaches reinforcement learning based on arcade learning, in which an agent’s experiences are stored in a data-set as replay memory. Accordingly, it would have been obvious to one of ordinary skill in the art as of the effective filing date of Applicant’s invention to implement the teachings of the combination of Oh and Reed pertaining to memory-based deep reinforcement learning using key embedding to a memory with the BPTT of Werbos and the accessing stored agent experiences via first and second networks of Minh.
The motivation for doing so is to improve techniques for reinforcement learning, in particular Q-learning, in particular when neural networks are employed. (Mnih ¶ 0009).
Regarding claims 26 and 35, the combination of Oh, Reed, and Werbos teaches all of the limitations of claims 23 and 32, respectively, as described above in detail.
Though Oh, Reed, and Werbos teaches the features of a memory-based deep reinforcement learning using key embedding to a memory with BPTT, the combination of Oh, Reed, and Werbos, however, does not explicitly teach -
wherein determining the Q value for the training selected action comprises:
processing a network input that comprises the weighted return estimates through a return neural network in accordance with current values of parameters of the return neural network to generate the Q value.
But Mnih teaches -
* * *
processing a network input that comprises the summed weighted return estimate through a return neural network in accordance with current values of parameters of the return neural network to generate the Q value (Mnih ¶ 0061 teaches [t]he procedure employs first (embedding neural network) and second (return neural network) neural networks . . . , each of which ends up being trained through implementation of the procedure (processing a network input that comprises the summed weighted estimated return) to provide action-value parameters, more particularly Q-values (to generate the Q value), for each action or each definable input state).
Oh, Reed, Werbos, and Mnih are from the same or similar field of endeavor. Oh teaches memory-based deep reinforcement learning architectures via arcade games that draws soft attention over memory locations. Reed teaches the teaching machines to learn programs, and to conditionally execute these programs automatically, including reinforcement learning policies. Werbos teaches backpropagation through time (BPTT) having a reduced learning rate as applied to neural network training. Mnih teaches reinforcement learning based on arcade learning, in which an agent’s experiences are stored in a data-set as replay memory. Accordingly, it would have been obvious to one of ordinary skill in the art as of the effective filing date of Applicant’s invention to implement the teachings of the combination of Oh and Reed pertaining to memory-based deep reinforcement learning using key embedding to a memory with the BPTT of Werbos and the accessing stored agent experiences via first and second networks of Minh.
The motivation for doing so is to improve techniques for reinforcement learning, in particular Q-learning, in particular when neural networks are employed. (Mnih ¶ 0009).
13.	Claims 29, 38, and 40 are rejected under 35 U.S.C. § 103 as being unpatentable over Oh et al., “Control of Memory, Active Perception, and Action in Minecraft,” International Conference on Machine Learning (2016) [hereinafter Oh] in view of Scott Ellison Reed, “Deep Neural Networks for Visual Reasoning, Program Induction, and Text-to-Image Synthesis (Thesis) [hereinafter Reed] and Paul J. Werbos, “Backpropagation Through Time: What it Does and How to Do It,” IEEE (1990) [hereinafter Werbos], and further in view of and further in view of US Published Application 20140237163 to Maharana [hereinafter Maharana].
Regarding claims 29, 38, and 40, the combination of Oh, Reed, and Werbos teaches all of the limitations of claims 21, 30, and 39, respectively, as described in detail above.
Though Oh, Reed, and Werbos teaches the features of a memory-based deep reinforcement learning using key embedding to a memory with BPTT, the combination of Oh, Reed, and Werbos, however, does not explicitly teach -
when the training key embedding does not match any of the key embedding in the episodic memory module for the training selected action, adding data mapping the current key embedding to the training return to the episodic memory module for the training selected action.
But Maharana teaches -
when the training key embedding does not match any of the key embedding in the episodic memory module for the training selected action (Maharana ¶ 0004 teaches that if the hash key (current key embedding) does not match one of the hash values (when the current key embedding does not match any of the key embeddings in the episodic memory data for the action)), adding data mapping the current key embedding to the training return to the episodic memory module for the training selected action (Maharana ¶ 0004 teaches [t]he memory manager can write the received data to the cache memory (adding data mapping the current key embedding to the current return to the episodic memory data for the action)).
Oh, Reed, Werbos, and Maharana are from the same or similar field of endeavor. Oh teaches memory-based deep reinforcement learning architectures via arcade games that draws soft attention over memory locations. Reed teaches the teaching machines to learn programs, and to conditionally execute these programs automatically, including reinforcement learning policies. Werbos teaches backpropagation through time (BPTT) having a reduced learning rate as applied to neural network training. Mnih teaches reinforcement learning based on arcade learning, in which an agent’s experiences are stored in a data-set as replay memory. Maharana teaches determining whether to add or to update data to memory locations. Accordingly, it would have been obvious to one of ordinary skill in the art as of the effective filing date of Applicant’s invention to implement the teachings of the combination of Oh and Reed pertaining to memory-based deep reinforcement learning using key embedding to a memory with the BPTT of Werbos and the memory update determination of Maharana.
The motivation for doing so is for memory efficiency by comparing memory indices, such as hash values, instead of entire data entries, to keep the comparison process quick and efficient to meet the demands of throughput and latency. (Maharana ¶ 0025).
Conclusion
14.	Any inquiry concerning this communication or earlier communications from the Examiner should be directed to KEVIN L. SMITH whose telephone number is (571) 272-5964. Normally, the Examiner is available on Monday-Thursday 0730-1730. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s supervisor, KAKALI CHAKI can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/K.L.S./
Examiner, Art Unit 2122
/BABOUCARR FAAL/Primary Examiner, Art Unit 2184