DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The present application, filed on 06/25/2018, claims benefit of provisional Application No. 62/524,183 filed on 06/23/2017. Claims 1-22 are pending and have been examined.

Information Disclosure Statement
The information disclosure statement (IDS) was submitted on 07/26/2018. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Objections
Claims 1-22 are objected to because of the following informalities:  
Claim 1 line 11: “the one or more stored features” should be “.  Appropriate correction is required.
Claim 8 line 15: “the one or more stored features” should be “
Claim 16 line 12: “the one or more stored features” should be “
Each of the dependent claims is objected to based on the same rationale as the claim from which it depends.


Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-22 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites “the coordinates of the memory architecture” in lines 8-9; this limitation lacks clarity because it is unclear if the recited “the coordinates” refer to “coordinates corresponding to coordinates of the multi-dimensional environment” in lines 3-4 or to a new set of coordinates. For examination purposes, “the coordinates of the memory architecture” has been interpreted as referring to “coordinates corresponding to coordinates of the multi-dimensional environment”. 
Claim 1 recites “the coordinates in the memory architecture” in line 14 (emphasis added); this limitation lacks clarity because it is unclear if the recited “the coordinates in the memory architecture” refer to “the coordinates of the memory architecture” in lines 8-9 or to a new set of coordinates. For examination purposes, “the coordinates in the memory architecture” has been interpreted as referring to “the coordinates of the memory architecture”.
Claim 2 recites the limitation "the steps" in line 2.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the steps" has been interpreted as "steps".

Claim 5 recites “generating a new memory architecture comprising the data stored in the memory architecture, except with the one or more candidate features that correspond to one or more stored features written to the corresponding coordinates of the new memory architecture” (emphasis added); this limitation lacks clarity because it is unclear whether “one or more candidate features that correspond to one or more stored features written to the corresponding coordinates of the new memory architecture” are stored in the new memory architecture in view of the recitation of “except with”. For examination purposes, “generating a new memory architecture comprising the data stored in the memory architecture, except with the one or more candidate features that correspond to one or more stored features written to the corresponding coordinates of the new memory architecture” has been interpreted as “generating a new memory architecture comprising the data stored in the memory architecture with the one or more candidate features that correspond to one or more stored features written to a plurality of corresponding coordinates of the new memory architecture”.
Claim 8 recites “the coordinates of the memory architecture” in line 12; this limitation lacks clarity because it is unclear if the recited “the coordinates” refer to “coordinates corresponding to coordinates of the multi-dimensional environment” in lines 5-6 or to a new set of coordinates. For examination purposes, “the coordinates of the memory architecture” has been interpreted as referring to “coordinates corresponding to coordinates of the multi-dimensional environment”. 
Claim 8 recites “the coordinates in the memory architecture” in line 17 (emphasis added); this limitation lacks clarity because it is unclear if the recited “the coordinates in the memory architecture” refer to “the coordinates of the memory architecture” in line 12 or to a new set of coordinates. For in the memory architecture” has been interpreted as referring to “the coordinates of the memory architecture”.
Claim 10 recites the limitation "the steps" in line 2.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the steps" has been interpreted as "steps".
Claim 13 recites the limitation "the corresponding coordinates" in line 7.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the corresponding coordinates" has been interpreted as "a plurality of corresponding coordinates".
Claim 13 recites “generate a new memory architecture comprising the data stored in the memory architecture, except with the one or more candidate features that correspond to one or more stored features written to the corresponding coordinates of the new memory architecture” (emphasis added); this limitation lacks clarity because it is unclear whether “one or more candidate features that correspond to one or more stored features written to the corresponding coordinates of the new memory architecture” are stored in the new memory architecture in view of the recitation of “except with”. For examination purposes, “generate a new memory architecture comprising the data stored in the memory architecture, except with the one or more candidate features that correspond to one or more stored features written to the corresponding coordinates of the new memory architecture” has been interpreted as “generate a new memory architecture comprising the data stored in the memory architecture with the one or more candidate features that correspond to one or more stored features written to a plurality of corresponding coordinates of the new memory architecture”.
Claim 16 recites the limitation "the transformed memory architecture" in line 10.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the transformed memory architecture" has been interpreted as "a transformed memory architecture".
Claim 16 recites “the center coordinates in the memory architecture” in line 15 (emphasis added); this limitation lacks clarity because it is unclear if the recited “the coordinates” refer to “center of the transformed memory architecture” in line 10 or to a new set of center coordinates. For examination purposes, “the center coordinates in the memory architecture” has been interpreted as referring to “center coordinates of the transformed memory architecture”.
Claim 17 recites the limitation "the steps" in line 2.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, "the steps" has been interpreted as "steps".
Claim 20 recites “generating a new memory architecture comprising the data stored in the transformed memory architecture, except with the one or more candidate features that correspond to one or more stored features written to the center coordinates of the new memory architecture” (emphasis added); this limitation lacks clarity because it is unclear whether “one or more candidate features that correspond to one or more stored features written to the center coordinates of the new memory architecture” are stored in the new memory architecture in view of the recitation of “except with”. For examination purposes, “generating a new memory architecture comprising the data stored in the transformed memory architecture, except with the one or more candidate features that correspond to one or more stored features written to the center coordinates of the new memory architecture” has been interpreted as “generating a new memory architecture comprising the data stored in the transformed memory architecture with the one or more candidate features that correspond to one or more stored features written to the center coordinates of the new memory architecture”.
The recitation of “relative movement of the agent in the multi-dimensional environment” (emphasis added) in claim 21 lacks clarity because neither the claim nor Specification establishes what is considered “relative movement” of the agent. For example, “movement” of the agent relative to what position or metric would be considered “relative movement”? Therefore, claim 21 is indefinite. For examination purposes, “relative movement of the agent in the multi-dimensional environment” has been interpreted as “ of the agent in the multi-dimensional environment”.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 5, 7-10, 13, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Dulac-Arnold et al. (US 10,885,432 B1) in view of Nair et al. (“Massively Parallel Methods for Deep Reinforcement Learning”).
Regarding Claim 1,
Dulac-Arnold et al. teaches A computer-implemented method for storing data associated with an agent in a multi-dimensional environment via a memory architecture, the memory architecture storing one or more features at coordinates corresponding to coordinates of the multi-dimensional environment, the method comprising (Col. 3 lines 23-33: “Each action in the action space is represented by a respective point in a multi-dimensional space...As another example, in implementations where the environment is a real-world environment and the agent is a robot or autonomous vehicle, each dimension of the space can correspond to a different control dimension of the robot or autonomous vehicle” and Col. 6 lines 46-50: “The system generates an experience tuple that includes the current observation, the selected action, the reward, and the next observation and stores the generated experience tuple in a replay memory for use in training the actor policy neural network and the Q network (step 312)” teach storing generated experience tuples in a replay memory (corresponds to memory architecture) wherein the experience tuples contain data including the current observation, the reward, and the next observation (correspond to features) corresponding to the selected action as represented by a point in a multi-dimensional space (corresponds to coordinates of the multi-dimensional environment); Col. 3 lines 4-9 teaches computer-implemented):
 ...(b) retrieving, by the set of processor cores, one or more candidate features corresponding to a position at which the agent is located from the coordinates of the memory architecture corresponding to the position (Col. 6 lines 60-63: “The system obtains an experience tuple (step 402). The experience tuple is one of the experience tuples in a minibatch of experience tuples sampled from the replay memory by the system” teaches obtaining (corresponds to retrieving) experience tuples from the replay memory (corresponds to memory architecture); Col. 3 lines 23-33: “Each action in the action space is represented by a respective point in a multi-dimensional space...As another example, in implementations where the environment is a real-world environment and the agent is a robot or autonomous vehicle, each dimension of the space can correspond to a different control dimension of the robot or autonomous vehicle” and Col. 6 lines 46-50: “The system generates an experience tuple that includes the current observation, the selected action, the reward, and the next observation and stores the generated experience tuple in a replay memory for use in training the actor policy neural network and the Q network (step 312)” teach experience tuples in the replay memory (corresponds to memory architecture) contain data including the current observation, the reward, and the next observation (correspond to candidate features) corresponding to an action as represented by a point in a multi-dimensional space (corresponds to a position at which the agent is located) selected from the Col. 9 lines 53-64 teaches processors);
Dulac-Arnold et al. does not appear to explicitly teach (a) generating, by a set of one or more processor cores, a summary of the one or more features stored throughout the memory architecture;...(c) determining, by the set of processor cores, whether the one or more candidate features correspond to the summary of the one or more stored features of the memory architecture; and (d) updating, by the set of processor cores, the memory architecture with the one or more candidate features at the coordinates in the memory architecture that correspond to the summary of the one or more stored features of the memory architecture.
However, Nair et al. teaches (a) generating, by a set of one or more processor cores, a summary of the one or more features stored throughout the memory architecture (pg. 4 second full paragraph: “The experience tuples generated by the actors are stored in a replay memory...First, a local replay memory stores each actor’s experience...locally on that actor’s machine...Second, a global replay memory aggregates the experience into a distributed database” teaches that a global reply memory aggregates the experience tuples (corresponds to generating summary of features) stored throughout the local replay memories (corresponds to the memory architecture); pg. 2 first and second full paragraphs: “As in DistBelief, the parameters of the Q-network may also be distributed over many machines. We applied our distributed framework for RL, known as Gorila (General Reinforcement Learning Architecture), to create a massively distributed version of the DQN algorithm” teaches distributed computing is used to implement the distributed reinforcement learning system over many machines, thus rendering one or more computers (with processors) are utilized);
...(c) determining, by the set of processor cores, whether the one or more candidate features correspond to the summary of the one or more stored features of the memory architecture (Figure 2 and pg. 4 third full paragraph: “Each learner contains a replica of the Q-network and its job is to compute desired changes to the parameters of the Q-network. For each learner update k, a minibatch of experience tuples e = (s; a; r; s’) is sampled from either a local or global experience replay memory D (see above). The learner applies an off-policy RL algorithm such as DQN...to this minibatch of experience, in order to generate a gradient vector gi...The gradients gi are communicated to the parameter server; and the parameters of the Q-network are updated periodically from the parameter server” teaches determining the gradients that correspond to the experience tuples stored in the global replay memory (correspond to the summary of the one or more stored features of the memory architecture) wherein the gradients are sent to the parameter server to update the parameters used to generate corresponding additional experience tuples (correspond to one or more candidate features) that will be stored in local replay memory; also see pg. 2 first and second full paragraphs); 
and (d) updating, by the set of processor cores, the memory architecture with the one or more candidate features at the coordinates in the memory architecture that correspond to the summary of the one or more stored features of the memory architecture (Figure 2 and pg. 4 third full paragraph: “Each learner contains a replica of the Q-network and its job is to compute desired changes to the parameters of the Q-network. For each learner update k, a minibatch of experience tuples e = (s; a; r; s’) is sampled from either a local or global experience replay memory D (see above). The learner applies an off-policy RL algorithm such as DQN...to this minibatch of experience, in order to generate a gradient vector gi...The gradients gi are communicated to the parameter server; and the parameters of the Q-network are updated periodically from the parameter server” teaches determining the gradients that correspond to the experience tuples stored in the global replay memory (correspond to the summary of the one or more stored features of the memory architecture) wherein the gradients are sent to the parameter server to update the parameters used to generate corresponding additional experience tuples (correspond to one or more candidate features) that will be stored in local replay memory, thus updating the local replay memory (corresponds to updating memory architecture); pg. 3 Section 4: “Each actor i generates its own trajectories of experience...within the environment, and as a result each actor may visit different parts of the state space” teaches the each agent generates its own trajectories of experience in the state space (trajectories track states the agents has been in, which correspond to coordinates); also see pg. 2 first and second full paragraphs).
Dulac-Arnold et al. and Nair et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Nair et al. to the disclosed invention of Dulac-Arnold et al.
One of ordinary skill in the arts would have been motivated to make this modification to leverage distributed reinforcement learning techniques because of the following advantages: “We applied our distributed framework for RL, known as Gorila (General Reinforcement Learning Architecture), to create a massively distributed version of the DQN algorithm. We applied Gorila DQN to 49 games on the Atari 2600 platform. We outperformed single GPU DQN on 41 games and outperformed human professional on 25 games. Gorila DQN also trained much faster than the non-distributed version in terms of wall-time, reaching the performance of single GPU DQN roughly ten times faster for most games” (Nair et al. pg. 2 second full paragraph).
Regarding Claim 2,
Dulac-Arnold et al. in view of Nair et al. teaches the computer-implemented method of claim 1.
Dulac-Arnold et al. further teaches further comprising iteratively repeating [step (b)] for each position of the agent as the agent traverses the multi-dimensional environment (Col. 6 lines 60-63: “The system obtains an experience tuple (step 402). The experience tuple is one of the experience tuples in a minibatch of experience tuples sampled from the replay memory by the system” teaches obtaining (corresponds to retrieving) experience tuples from the replay memory (corresponds to memory Col. 3 lines 23-33: “Each action in the action space is represented by a respective point in a multi-dimensional space...As another example, in implementations where the environment is a real-world environment and the agent is a robot or autonomous vehicle, each dimension of the space can correspond to a different control dimension of the robot or autonomous vehicle” and Col. 6 lines 46-50: “The system generates an experience tuple that includes the current observation, the selected action, the reward, and the next observation and stores the generated experience tuple in a replay memory for use in training the actor policy neural network and the Q network (step 312)” teach experience tuples in the replay memory (corresponds to memory architecture) contain data including the current observation, the reward, and the next observation (correspond to candidate features) corresponding to an action as represented by a point in a multi-dimensional space (corresponds to a position at which the agent is located) selected from the multiple actions represented by the points in the multi-dimensional space, which corresponds to step (b); Col. 4 lines 45-50: “To train the actor policy neural network 110 and the Q network 120 using the training components 130, the reinforcement learning system 100 repeatedly selects minibatches of experience tuples from the replay memory 140. Each minibatch of experience tuples includes a predetermined number of randomly selected experience tuples” teaches iteratively processing experience tuples for each position of the agent as the agent traverses the multi-dimensional environment).
Nair et al. further teaches further comprising iteratively repeating the steps (a), [(c),] (d) for each position of the agent as the agent traverses the multi-dimensional environment (pg. 4 second full paragraph: “The experience tuples generated by the actors are stored in a replay memory...First, a local replay memory stores each actor’s experience...locally on that actor’s machine...Second, a global replay memory aggregates the experience into a distributed database” teaches that a global reply memory aggregates the experience tuples (corresponds to generating summary of features) stored throughout the local replay memories (corresponds to the memory architecture), which corresponds to Figure 2 and pg. 4 third full paragraph: “Each learner contains a replica of the Q-network and its job is to compute desired changes to the parameters of the Q-network. For each learner update k, a minibatch of experience tuples e = (s; a; r; s’) is sampled from either a local or global experience replay memory D (see above). The learner applies an off-policy RL algorithm such as DQN...to this minibatch of experience, in order to generate a gradient vector gi...The gradients gi are communicated to the parameter server; and the parameters of the Q-network are updated periodically from the parameter server” teaches determining the gradients that correspond to the experience tuples stored in the global replay memory (correspond to the summary of the one or more stored features of the memory architecture) wherein the gradients are sent to the parameter server to update the parameters used to generate corresponding additional experience tuples (correspond to one or more candidate features) that will be stored in local replay memory, thus updating the local replay memory (corresponds to updating memory architecture); pg. 3 Section 4: “Each actor i generates its own trajectories of experience...within the environment, and as a result each actor may visit different parts of the state space” teaches the each agent generates its own trajectories of experience in the state space (trajectories track states the agents has been in, which correspond to coordinates), which corresponds to steps (c) and (d); Figure 2 and Algorithm 1 teach the distributed reinforcement learning system is implemented via iteratively repeating the steps of the algorithm as the agent traverses the state space, which corresponds to multi-dimensional environment, see pg. 3 Section 4: “Each actor i generates its own trajectories of experience... within the environment, and as a result each actor may visit different parts of the state space”).
Dulac-Arnold et al. and Nair et al. are analogous art to the claimed invention because they are directed to reinforcement learning.

One of ordinary skill in the arts would have been motivated to make this modification to leverage distributed reinforcement learning techniques because of the following advantages: “We applied our distributed framework for RL, known as Gorila (General Reinforcement Learning Architecture), to create a massively distributed version of the DQN algorithm. We applied Gorila DQN to 49 games on the Atari 2600 platform. We outperformed single GPU DQN on 41 games and outperformed human professional on 25 games. Gorila DQN also trained much faster than the non-distributed version in terms of wall-time, reaching the performance of single GPU DQN roughly ten times faster for most games” (Nair et al. pg. 2 second full paragraph).
Regarding Claim 5,
Dulac-Arnold et al. in view of Nair et al. teaches the computer-implemented method of claim 1.
Nair et al. further teaches wherein updating the memory architecture with the one or more candidate features at the coordinates in the memory architecture that correspond to the summary of the one or more stored features of the memory architecture comprises: generating a new memory architecture comprising the data stored in the memory architecture, except with the one or more candidate features that correspond to one or more stored features written to the corresponding coordinates of the new memory architecture (Figure 2 and pg. 4 third full paragraph: “Each learner contains a replica of the Q-network and its job is to compute desired changes to the parameters of the Q-network. For each learner update k, a minibatch of experience tuples e = (s; a; r; s’) is sampled from either a local or global experience replay memory D (see above). The learner applies an off-policy RL algorithm such as DQN...to this minibatch of experience, in order to generate a gradient vector gi...The gradients gi are communicated to the parameter server; and the parameters of the Q-network are updated periodically from the parameter server” teaches determining the gradients that correspond to the experience tuples stored in the global replay memory (correspond to the summary of the one or more stored features of the memory architecture) wherein the gradients are sent to the parameter server to update the parameters used to generate corresponding additional experience tuples (correspond to one or more candidate features) that will be stored in local replay memory, thus updating the local replay memory wherein the updated local replay memory corresponds to a new memory architecture updated with candidate features that correspond to stored features; pg. 3 Section 4: “Each actor i generates its own trajectories of experience...within the environment, and as a result each actor may visit different parts of the state space” teaches the each agent generates its own trajectories of experience in the state space (trajectories track states the agents has been in, which correspond to coordinates); also see pg. 2 first and second full paragraphs).
Dulac-Arnold et al. and Nair et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Nair et al. to the disclosed invention of Dulac-Arnold et al.
One of ordinary skill in the arts would have been motivated to make this modification to leverage distributed reinforcement learning techniques because of the following advantages: “We applied our distributed framework for RL, known as Gorila (General Reinforcement Learning Architecture), to create a massively distributed version of the DQN algorithm. We applied Gorila DQN to 49 games on the Atari 2600 platform. We outperformed single GPU DQN on 41 games and outperformed human professional on 25 games. Gorila DQN also trained much faster than the non-distributed version in terms of wall-time, reaching the performance of single GPU DQN roughly ten times faster for most games” (Nair et al. pg. 2 second full paragraph).
Regarding Claim 7,
Dulac-Arnold et al. in view of Nair et al. teaches the computer-implemented method of claim 1.
Nair et al. further teaches wherein the agent comprises a deep reinforcement learning agent (Figure 2 teaches a distributed reinforcement learning system with a deep reinforcement learning agent).
Dulac-Arnold et al. and Nair et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Nair et al. to the disclosed invention of Dulac-Arnold et al.
One of ordinary skill in the arts would have been motivated to make this modification to leverage distributed reinforcement learning techniques because of the following advantages: “We applied our distributed framework for RL, known as Gorila (General Reinforcement Learning Architecture), to create a massively distributed version of the DQN algorithm. We applied Gorila DQN to 49 games on the Atari 2600 platform. We outperformed single GPU DQN on 41 games and outperformed human professional on 25 games. Gorila DQN also trained much faster than the non-distributed version in terms of wall-time, reaching the performance of single GPU DQN roughly ten times faster for most games” (Nair et al. pg. 2 second full paragraph).
Regarding Claim 8,
Claim 8 recites analogous limitations to claim 1 and is rejected based on the same rationale as claim 1.
Dulac-Arnold et al. further teaches A computer system for storing data associated with an agent in a multi-dimensional environment, the computer system comprising: a set of one or more processor cores; a memory coupled to the processor cores, the memory storing: a memory architecture Col. 3 lines 23-33: “Each action in the action space is represented by a respective point in a multi-dimensional space...As another example, in implementations where the environment is a real-world environment and the agent is a robot or autonomous vehicle, each dimension of the space can correspond to a different control dimension of the robot or autonomous vehicle” and Col. 6 lines 46-50: “The system generates an experience tuple that includes the current observation, the selected action, the reward, and the next observation and stores the generated experience tuple in a replay memory for use in training the actor policy neural network and the Q network (step 312)” teach storing generated experience tuples in a replay memory (corresponds to memory architecture) wherein the experience tuples contain data including the current observation, the reward, and the next observation (correspond to features) corresponding to the selected action as represented by a point in a multi-dimensional space (corresponds to coordinates of the multi-dimensional environment); Col. 10 lines 5-14 teaches computer system with processor, memory, and program instructions).
Regarding Claim 9,
Dulac-Arnold et al. in view of Nair et al. teaches the computer system of claim 8.
Dulac-Arnold et al. further teaches wherein the memory comprises a first memory storing the memory architecture and a second memory storing the instructions (Fig. 1 element 140 teaches first memory storing the replay memory (corresponds to memory architecture); Col. 10 lines 5-14 teaches computer system with second memory storing program instructions).
Regarding Claim 10,
Claim 10 recites analogous limitations to claim 2 and is rejected based on the same rationale as claim 2.
Dulac-Arnold et al. further teaches wherein the instructions further cause the computer system (Col. 10 lines 5-14 teaches computer system with processor, memory, and program instructions). 
Regarding Claim 13,
Claim 13 recites analogous limitations to claim 5 and is rejected based on the same rationale as claim 5.
Dulac-Arnold et al. further teaches wherein the instructions further cause the computer system (Col. 10 lines 5-14 teaches computer system with processor, memory, and program instructions). 
Regarding Claim 15,
Claim 15 recites analogous limitations to claim 7 and is rejected based on the same rationale as claim 7.

Claims 3 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Dulac-Arnold et al. (US 10,885,432 B1) in view of Nair et al. (“Massively Parallel Methods for Deep Reinforcement Learning”) and further in view of Zhu et al. (“On Improving Deep Reinforcement Learning for POMDPs”).
Regarding Claim 3,
Dulac-Arnold et al. in view of Nair et al. teaches the computer-implemented method of claim 1.
Nair et al. further teaches wherein generating the summary of the one or more features stored throughout the memory architecture comprises (pg. 4 second full paragraph: “The experience tuples generated by the actors are stored in a replay memory...First, a local replay memory stores each actor’s experience...locally on that actor’s machine...Second, a global replay memory aggregates the experience into a distributed database” teaches that a global reply memory aggregates the experience tuples (corresponds to generating summary of features) stored throughout the local replay memories (corresponds to the memory architecture); pg. 2 first and second full paragraphs: “As in DistBelief, the parameters of the Q-network may also be distributed over many machines. We applied our distributed framework for RL, known as Gorila (General Reinforcement Learning Architecture), to create a massively distributed version of the DQN algorithm” teaches distributed computing is used to implement the distributed reinforcement learning system over many machines, thus rendering one or more computers (with processors) are utilized).
Dulac-Arnold et al. and Nair et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Nair et al. to the disclosed invention of Dulac-Arnold et al.
One of ordinary skill in the arts would have been motivated to make this modification to leverage distributed reinforcement learning techniques because of the following advantages: “We applied our distributed framework for RL, known as Gorila (General Reinforcement Learning Architecture), to create a massively distributed version of the DQN algorithm. We applied Gorila DQN to 49 games on the Atari 2600 platform. We outperformed single GPU DQN on 41 games and outperformed human professional on 25 games. Gorila DQN also trained much faster than the non-distributed version in terms of wall-time, reaching the performance of single GPU DQN roughly ten times faster for most games” (Nair et al. pg. 2 second full paragraph).
Dulac-Arnold et al. in view of Nair et al. does not appear to explicitly teach passing the memory architecture through a neural network to generate a C-dimensional feature vector, wherein C is a number of features associated with the environment.
However, Zhu et al. teaches passing the memory architecture through a neural network to generate a C-dimensional feature vector, wherein C is a number of features associated with the environment (pg. 4 first full paragraph: “we modified the transition (                        
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            
                                
                                    ,
                                     
                                    a
                                
                                
                                    t
                                
                            
                            ,
                             
                            
                                
                                    r
                                
                                
                                    t
                                
                            
                            ,
                             
                            
                                
                                    s
                                
                                
                                    t
                                    +
                                    1
                                
                            
                        
                    ) in the experience replay mechanism of the conventional DQN to <(                        
                            {
                            
                                
                                    a
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            
                                
                                    ,
                                     
                                    o
                                
                                
                                    t
                                
                            
                            }
                            
                                
                                    ,
                                     
                                    a
                                
                                
                                    t
                                
                            
                            ,
                             
                            
                                
                                    r
                                
                                
                                    t
                                
                            
                            ,
                             
                            
                                
                                    o
                                
                                
                                    t
                                    +
                                    1
                                
                            
                        
                    )> in order to allow the framework to fetch the action-observation pair more conveniently. During the decision process for a given frame within training or the updating process of the neural network, the LSTM layer requires a sequence of action-observation pairs as its input. Thus, we store the transitions sequentially <(                        
                            {
                            
                                
                                    a
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            
                                
                                    ,
                                     
                                    o
                                
                                
                                    t
                                
                            
                            }
                            
                                
                                    ,
                                     
                                    a
                                
                                
                                    t
                                
                            
                            ,
                             
                            
                                
                                    r
                                
                                
                                    t
                                
                            
                            ,
                             
                            
                                
                                    o
                                
                                
                                    t
                                    +
                                    1
                                
                            
                        
                    )> within each episode in the replay memory” and Figure 2 and caption: “The input action is an 18-D vector followed by a fully connected (IP) layer with 512-D outputs...The 512-D outputs of LSTM are fed to another fully connected layer and produce 18-D Q-values corresponding to 18 actions in Atari games” teach passing a sequence of action-observation pairs stored as part of the replay memory (memory architecture) through a LSTM layer of a neural network to generate a 512-dimensional output feature vector wherein C=512 features associated with the environment; Figure 2 further teaches fully connected layers of the neural network process data in vector format).
Dulac-Arnold et al., Nair et al., and Zhu et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Zhu et al. to the disclosed invention of Dulac-Arnold et al. in view of Nair et al.
One of ordinary skill in the arts would have been motivated to make this modification to leverage a model that “is able to remember the past actions, particularly the last performed action” and “[allows] the framework to fetch the action-observation pair more conveniently” (Zhu et al. pg. 4 first full paragraph).
Regarding Claim 11,
Claim 11 recites analogous limitations to claim 3 and is rejected based on the same rationale as claim 3.

Claims 6 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Dulac-Arnold et al. (US 10,885,432 B1) in view of Nair et al. (“Massively Parallel Methods for Deep Reinforcement Learning”) and further in view of Chalmers et al. (“Learning to Predict Consequences as a Method of Knowledge Transfer in Reinforcement Learning”).
Regarding Claim 6,
Dulac-Arnold et al. in view of Nair et al. teaches the computer-implemented method of claim 1.
Dulac-Arnold et al. in view of Nair et al. does not appear to explicitly teach wherein the position of the agent corresponds to an absolute position of the agent in the multi-dimensional environment.
However, Chalmers et al. teaches wherein the position of the agent corresponds to an absolute position of the agent in the multi-dimensional environment (pg. 2261 fourth full paragraph: “That is, individual transition probabilities and rewards from state s can be estimated as a function of the corresponding agent-centric state. This requires that the environmental state space S have consistent dynamics (making state transitions at least partially predictable) and that the agent is equipped with sensors sufficient to allow the predictions...Navigation-style problems are a good example of a domain fitting the above description. In a navigation problem, the environment-centric state s may encode the agent’s absolute location in the environment, while the agent-centric state d encodes the location of roads, obstacles, and so on, relative to the agent. Given an environment-centric state s (e.g., “10 m West, 5 m North, heading due East”), a sufficiently detailed agent-centric state d (e.g., “obstacle to the right, all clear ahead”), and an action a (e.g., “move forward 1 m”), the subsequent environment-centric state can usually be predicted (“9 m West, 5 m North, heading due East”)” teaches absolute location (position) of the agent in a multi-dimensional environment for a navigation-style problem).
Dulac-Arnold et al., Nair et al., and Chalmers et al. are analogous art to the claimed invention because they are directed to reinforcement learning.

One of ordinary skill in the arts would have been motivated to make this modification to leverage “a framework that exploits predictability in the environment-centric space to achieve knowledge transfer between environments that share a common agent-centric space. The framework consists of an environment-centric RL system, which learns to solve the task Mi in the environment-centric space Si” (Chalmers et al. pg. 2261 fifth full paragraph).
Regarding Claim 14,
Claim 14 recites analogous limitations to claim 6 and is rejected based on the same rationale as claim 6.
Allowable Subject Matter
Claims 4 and 12 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
Claim 16 would be allowable if rewritten or amended to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action.
Claims 17-22 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: Gu et al. (US 2017/0228662 A1) teaches computing Q values for actions to be performed by .
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YING YU CHEN whose telephone number is (571)270-1484. The examiner can normally be reached Monday-Friday 7:30 am-5:00 pm (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/YING YU CHEN/               Examiner, Art Unit 2125