DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are presented for examination.

Information Disclosure Statement
The information disclosure statements (IDS) submitted on May 5, 2020; November 10, 2020; July 28, 2021; and December 21, 2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.

4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Léon et al., “Options Discovery with Budgeted Reinforcement Learning,” in arXiv preprint arXiv:1611.06824 (2016) (“Léon”) in view of Dayan et al., “Feudal Reinforcement Learning,” in 5 Advances in Neural Info. Processing Sys. 271-78 (1993) (“Dayan”) and further in view of Koutník et al., “A Clockwork RNN,” in arXiv preprint arXiv:1402.3511 (2014) (“Koutník”).
Regarding claim 1, Léon discloses “[a] system for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions (an actor model [worker neural network subsystem] updates an actor state ht and computes a next action at; the next action at is drawn from the distribution softmax(fact(ht)) [predetermined set = entire distribution] – Léon, section 4.1, subsection entitled “Actor Model”), the system comprising one or more computers and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers (in the computer science domain, works involving sequentially solving sub-tasks have led to the hierarchical reinforcement learning paradigm [implying that reinforcement learning is conducted by a computer containing a storage device with instructions] – Léon, sec. 1, second paragraph), cause the one or more computers to implement: 

generate a … representation … of a current state of the environment at the time step (in the BONN (Budgeted Options Neural Network) architecture, the agent can choose to acquire a high-level observation yt [representation of current state of environment] that will provide more relevant information in addition to a low-level observation xt – Léon, sec. 4.1, first paragraph);
generate, based at least in part on the … representation of the current state of the environment at the time step, an initial goal vector that defines, in [a] latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment (structure of BONN includes an option model [manager neural network subsystem] that uses observations xt and yt to compute a new option denoted ot [goal vector] as a vector in a latent space – Léon, sec. 4.1, second paragraph; see also Fig. 2, top row denoted “option model”; options are the mechanism by which sub-tasks [objectives] are modeled, giving rise to the question of how to select actions to apply in the environment based on the chosen option – id. at sec. 1, second paragraph); and 
pool … initial goal vectors for one or more preceding time steps to generate a final goal vector for the time step (acquisition model determines whether a new high-level observation needs to be acquired by drawing a binary number according to a Bernoulli distribution; if that number is 1, the option model computes a new option state following                         
                            
                                
                                    o
                                
                                
                                    t
                                
                            
                            =
                            
                                
                                    g
                                    r
                                    u
                                
                                
                                    o
                                    p
                                    t
                                
                            
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            t
                                        
                                    
                                    ,
                                    
                                        
                                            y
                                        
                                        
                                            t
                                        
                                    
                                    ,
                                    
                                        
                                            o
                                        
                                        
                                            l
                                            a
                                            s
                                            t
                                        
                                    
                                
                            
                        
                     in which olast [goal vector for preceding time step] is the lastly computed option before time step t and gruopt represents a GRU cell [new ot = final goal vector] – Léon, section 4.1, subsections entitled “Acquisition Model” and “Option Model”); 
a worker neural network subsystem that is configured to, at each of the plurality of time steps: 
generate a respective action score for each action in the predetermined set of actions based at least in part on the final goal vector for the time step (an actor model [worker neural network subsystem] updates an actor state ht and computes a next action at; the next action at is drawn from the distribution softmax(fact(ht)) [action score; predetermined set = entire distribution] – Léon, section 4.1, subsection entitled “Actor Model”; see also Fig. 2, bottom row denoted “Actor Model”); and 
select an action from the predetermined set of actions to be performed by the agent at the time step using the action scores (an actor model updates an actor state ht and computes a next action at; the next action at is drawn [selected] from the distribution softmax(fact(ht)) [action score; predetermined set = entire distribution] – Léon, section 4.1, subsection entitled “Actor Model”).”
Léon appears not to disclose explicitly the further limitations of the claim.  However, Dayan discloses “generat[ing] a latent representation, in a latent space, of a current state of the environment (in a maze task in a feudal reinforcement learning system, a grid is split up into successively finer grains and managers are assigned to separable parts of the maze at each level – Dayan, sec. 3, first paragraph; see also Fig. 1 [4-square grid of the maze [environment], for instance, is a latent representation of the 16-square grid])…; [and]
generat[ing], based at least in part on [a] latent representation of the current state of the environment at the time step, a[] … goal vector (in a maze task in a feudal reinforcement learning system, a goal is set at the lowest level of granularity of the grid, and the feudal system learn to navigate to the goal by learning not to try impossible actions or moving to another level at inappropriate places; if the system decides at a high level that the goal is in one part of the maze, then it has the capacity to specify large scale actions at that level to take it there – Dayan, p. 275, second full paragraph and Fig. 1 [for instance, in the example of Fig. 1, if the system quickly discovers that the goal is in the southwest quadrant of the maze, it can specify the goal vector as 1-(1,1) in the 4-square latent space and explore the 16- and 32- square grids for the goal])….”
Léon and Dayan both relate to hierarchical reinforcement learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Léon to generate a vector representing a goal in a latent space representing a state of the See Dayan, sec. 1 (indicating that high-level managers can send agents directly to a region of the state space with a high probability of reward without forcing it to explore in detail).
Neither Léon nor Dayan appears to disclose explicitly the further limitations of the claim.  However, Koutník discloses “pool[ing] the initial [output] vector for the time step and initial [output] vectors for one or more preceding time steps (long-term dependency problem is solved by having different parts (modules) of the RNN hidden layer running at different clock speeds, timing their computation with different, discrete clock periods – Koutník, sec. 1, third paragraph; at each CW-RNN time step t, only the output of modules i that satisfy (t MOD Ti) = 0 are executed, where Ti is a clock period, and the hidden weight matrix is zeroed out for (t MOD Ti) nonzero [so that the output vectors are combined/pooled for all time steps that are 0 mod Ti] – id. at sec. 3, paragraphs 3-5)….”
Léon, Dayan, and Koutník all relate to machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon and Dayan to pool the initial output vector for a given time step and for multiple previous time steps, as disclosed by Koutník, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the network to have memory such that the output is reflective of multiple points in the past.  See Koutník, abstract.

Claim 19 is a non-transitory computer storage medium claim corresponding to system claim 1 and is rejected for the same reasons as given in the rejection of that claim.  Similarly, claim 20 is a method claim corresponding to system claim 1 and is rejected for the same reasons as given in the rejection of that claim.

2 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan and Koutník and further in view of Heess et al., “Learning and Transfer of Modulated Locomotor Controllers,” in arXiv preprint arXiv:1610.05182 (2016) (“Heess”).
Regarding claim 2, neither Léon, Dayan, nor Koutník appears to disclose explicitly the further limitations of the claim.  However, Heess discloses that “generating the initial goal vector[] comprises: 
processing the … representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the … representation and to process the … representation in accordance with a hidden state of the goal recurrent neural network to generate the initial goal vector and to update the hidden state of the goal recurrent neural network (recurrent high-level controller [goal recurrent neural network] processes the observation [representation, denoted ot-1 in the figure] and generates a control signal [goal vector, denoted ct-1 in the figure] in accordance with the internal state of the system at the previous time step [denoted, say, by zt-2 in the figure] and also updates the internal state [denoted by zt-1] – Heess, Fig. 1; see also sec. 2, p. 3; note that Dayan teaches the use of latent representations and that the general procedure claimed and disclosed by Heess could be equally applied to a latent representation with predictable results, see KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398, 82 USPQ2d 1385 (2007)).”
Léon, Dayan, Koutník, and Heess all relate to machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Koutník to receive a representation of the state of the environment and process it in accordance with a hidden state of an RNN to update a hidden state and generate a goal vector, as disclosed by Heess, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would ensure that the high-level goal is always up to date so that the manager can always be appropriately inform the low-level worker.  See Heess, Fig. 1 and accompanying text.

s 3 and 6 are rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan and Koutník and further in view of Schaul et al., “Universal Value Function Approximators,” in Intl. Conf. Machine Learning 1312-1320 (2015) (“Schaul”).
Regarding claim 3, neither Léon, Dayan, nor Koutník appears to disclose explicitly the further limitations of the claim.  However, Schaul discloses “generating the respective action score for each action in the predetermined set of actions comprises: 
generating a respective … embedding vector in an embedding space for each [object] in the predetermined set of [objects] (data are viewed as a sparse table of values that contains one row for each observed state s and one column for each observed goal g, and find a low-rank factorization of the table into state embeddings and goal embeddings – Schaul, penultimate paragraph before section 2 [note that while Schaul teaches state embeddings rather than action embeddings, the embedding procedure disclosed in Schaul could be applied also to actions, as performing such a swap would involve merely performing the same action on different objects with predictable results, see KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398, 82 USPQ2d 1385 (2007)]); 
projecting the final goal vector for the time step to the embedding space to generate a goal embedding vector (data are viewed as a sparse table of values that contains one row for each observed state s and one column for each observed goal g, and find a low-rank factorization of the table into state embeddings and goal embeddings – Schaul, penultimate paragraph before section 2); and 
modulating the respective … embedding vector for each [object] by the goal embedding vector to generate the respective action score for each action in the predetermined set of actions (one possible function approximator simply concatenates state and goal together as a joint input; the mapping from concatenated input to regression target can then be dealt with a non-linear function approximator such as a multi-layer perceptron – Schaul, sec. 3, second paragraph; the result is an approximation of the action-value function Q(s, a, g) [action score for each action] – id. at Algorithm 1, last line [“modulate” is being interpreted here to mean “combine”]).”
See Schaul, penultimate paragraph before section 2.

Regarding claim 6, neither Léon, Dayan, nor Koutník appears to disclose explicitly the further limitations of the claim.  However, Schaul discloses that “the final goal vector has a higher dimensionality than the goal embedding vector (all the values are laid out in a data matrix with one row for each observed state s and one column for each observed goal g, and that matrix is factorized, finding a low-rank approximation that defines n-dimensional embedding spaces for both states and goals [“low-rank approximation” implies that n is less than the goal space dimensionality] – Schaul, section 3.1, bullet point 1).”
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Koutník to reduce the dimensionality of the goal vector, as disclosed by Schaul, and an ordinary artisan could reasonably have expected to do so successfully. One motivation for doing so would be to perform learning faster than a naïve approach without embedding.  See Schaul, penultimate paragraph before section 2.

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan and Koutník and further in view of Graepel et al. (US 20180032864) (“Graepel”).
Regarding claim 4, neither Léon, Dayan, nor Koutník appears to disclose explicitly the further limitations of the claim.  However, Graepel discloses “selecting the action comprises selecting the action having a highest action score (system selects the action represented by the outgoing edge having the highest action score as the action to be performed by the agent in response to the current observation – Graepel, paragraph 81).”
Léon, Dayan, Koutník, and Graepel all relate to machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Koutník to select the action having the highest action score, as disclosed by Graepel, and an ordinary artisan could reasonably have expected to do so successfully.  One motivation for doing so would be to maximize the likelihood that the agent will complete the objectives if the action is performed.  See Graepel, paragraph 49.

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan, Koutník, and Schaul and further in view of Lample et al., “Playing FPS Games with Deep Reinforcement Learning,” in arXiv preprint arXiv 1609:05521 (2016) (“Lample”).
Regarding claim 5, neither Léon, Dayan, Koutník, nor Schaul appears to disclose explicitly the further limitations of the claim.  However, Lample discloses that “generating the respective action embedding vector in the embedding space for each action in the predetermined set of actions comprises:
processing a representation of the current state of the environment using an action score recurrent neural network, in accordance with a hidden state of the action score recurrent neural network, to generate the action embedding vectors and to update the hidden state of the action score recurrent neural network (in a deep reinforcement learning system to play video games, the output of a CNN [representation of current state of environment] is given to a LSTM [action score recurrent neural network] that predicts a score [embedding vector] for each action based on a current frame and its hidden state – Lample, sec. 3.1, first paragraph and Fig. 2; see also Fig. 3 (showing that, for instance, the hidden state h4 of the LSTM is updated to h5 upon the input of observation o4 and outputs the action score Q(h5, a5))).” 
Léon, Dayan, Koutník, Schaul, and Lample all relate to machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed See Lample, sec. 3.1, first paragraph.

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan, Koutník, and Schaul and further in view of Mei et al. (US 20150356199) (“Mei”).
Regarding claim 7, neither Léon, Dayan, Koutník, nor Schaul appears to disclose explicitly the further limitations of the claim.  However, Mei discloses that “the dimensionality of the final goal vector is at least ten times higher than the dimensionality of the goal embedding vector (click-through-based cross-view learning techniques can reduce feature dimension by several orders of magnitude (e.g., from thousands to tens) – Mei, paragraph 10 [note that, while the original and reduced spaces of Mei are not the “goal” and “embedding” spaces, respectively, Schaul discloses these spaces, and the general concept of dimensionality reduction by orders of magnitude disclosed in Mei can be applied to the spaces of Schaul without inventive effort]).”
Léon, Dayan, Koutník, Schaul, and Mei all relate to machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Koutník, and Schaul to reduce the dimensionality of the data by at least ten times, as disclosed by Mei, and an ordinary artisan could reasonably have expected to do so successfully.  One motivation for doing so would be to produce memory savings.  See Mei, paragraph 10.

Claims 8-9 are rejected under 35 U.S.C. 103 as being unpatentable over Léon, Dayan, and Koutník and further in view of Kulkarni et al., “Hierarchical Deep Reinforcement Learning: Integrating Temporal Advances in Neural Info. Processing Systems 3675-3683 (2016) (“Kulkarni”).
Regarding claim 8, neither Léon, Dyan, nor Koutník appears to disclose explicitly the further limitations of the claim.  However, Kulkarni discloses that “the worker neural network subsystem has been trained to generate action scores that maximize a time discounted combination of rewards, wherein each reward is a combination of an external reward received as a result of the agent performing selected actions and an intrinsic reward dependent upon goal vectors generated by the manager neural network subsystem (objective function for a controller is to maximize cumulative intrinsic reward and the objective of a meta-controller [manager neural network subsystem] is to optimize the cumulative extrinsic reward, where the cumulative extrinsic reward is a function of the agent being in a state after taking an action, the cumulative intrinsic reward is a function of a goal g, and the discounting in the cumulative extrinsic reward is over sequences of goals; objective of the agent is to maximize the extrinsic reward function over long periods of time; policy over actions and policy over goals are produced by estimating action-value functions [action scores] – Kulkarni, pp. 3-4, sec. 3, up through “Temporal Abstractions”; see also Fig. 1).”
Léon, Dayan, Koutník, and Kulkarni all relate to machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Koutník to generate a combination of reward functions, one of which is extrinsic and another of which is intrinsic, as disclosed by Kulkarni, and an ordinary artisan could reasonably have expected to do so successfully.  One motivation for doing so would be to allow the agent to explore behavior for its own sake, thereby helping the agent solve tasks posed by the environment.  See Kulkarni, abstract.

Regarding claim 9, Léon, as modified by Dayan, Koutník, and Kulkarni, discloses that “the manager neural network subsystem has been trained to generate initial goal vectors that result in action scores that encourage selection of actions that increase the external rewards received as a result of the agent meta-controller receives a state and chooses a goal in the set of all possible goals; goal remains in place for the next few time steps until either it is achieved or a terminal state is reached – Kulkarni, p. 3, paragraph titled “Temporal Abstractions”; meta-controller looks at the raw states and produces a policy over goals by estimating an action-value function to maximize expected future extrinsic reward, controller takes in states and the current goal and produces a policy over actions by estimating a second action-value function [action score] to solve the predicted goal by maximizing expected future intrinsic reward; internal critic provides a positive reward to the controller iff the goal is reached – id. at Fig. 1 caption [so that an action will be chosen if it moves the agent closer to the goal and thus increases external rewards; note that while the state space of Kulkarni is not necessarily a “latent state space”, Esteban teaches latent representations, and the framework disclosed by Kulkarni could be applied to latent state spaces without inventive effort]).”
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Koutník to generate goal vectors that encourage the selection of actions that move the agent closer to the goal, as disclosed by Kulkarni, and an ordinary artisan could reasonably have expected to do so successfully. One motivation for doing so would be to help the agent solve tasks posed by the environment.  See Kulkarni, abstract.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Léon, Dayan, and Koutník and further in view of Bichler (US 20170200078) (“Bichler”).
Regarding claim 10, neither Léon, Dayan, nor Koutník appears to disclose explicitly the further limitations of the claim.  However, Bichler discloses that “generating the latent representation, in the latent space, of the current state of the environment at the time step comprises: 
processing an observation characterizing the current state of the environment using a convolutional neural network (convolutional neural networks are feedforward neural networks that allow for a learning of intermediate [latent] representations of objects that are smaller and can be generalized for similar objects – Bichler, paragraph 7).”
See Bichler, paragraph 7.

Claims 11, 16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan, Koutník, and Heess and further in view of Hochreiter et al., “Long Short-Term Memory,” in 9.8 Neural Computation 1735-1780 (1997) (“Hochreiter”) and Yu et al., “Multi-Scale Context Aggregation by Dilated Convolutions,” in ICLR 2016 (2016) (“Yu”).
Regarding claim 11, Léon, as modified by Dayan, Koutník, and Heess, discloses that “the goal recurrent neural network is … configured to maintain an internal state that is partitioned into r sub-states, wherein r is an integer greater than one (neurons in hidden layer of clockwork RNN are partitioned into g [r] modules of size k, each of which is assigned a clock period Tn – Koutník, section 3, first paragraph [the use of the plural “modules” suggests that g > 1]), and … the … neural network is configured to, at each time step in the plurality of time steps: 
receive a network input for the time step (input weight matrix WI is partitioned into g blocks-rows – Koutník, section 3, bottom of right hand column, especially equation 3); 
select a sub-state from the r sub-states (at each CW-RNN time step t, only the output of modules i [sub-state] that satisfy (t MOD Ti) = 0 are executed – Koutník, sec. 3, last full paragraph on p. 3); and 
process current values of the selected sub-state and the network input for the time step using a[] … neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of …network parameters (at each forward pass time step, only the block-rows of input weight matrix WI and WH, a block-upper triangular hidden weight matrix [network parameters], that correspond to the executed modules are used for evaluation and the corresponding parts of an output vector yH are updated; standard RNN state update equation is governed by                         
                            
                                
                                    y
                                
                                
                                    H
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            =
                            
                                
                                    f
                                
                                
                                    H
                                
                            
                            (
                            
                                
                                    W
                                
                                
                                    H
                                
                            
                            
                                
                                    y
                                
                                
                                    
                                        
                                            t
                                            -
                                            1
                                        
                                    
                                
                            
                            +
                            
                                
                                    W
                                
                                
                                    I
                                
                            
                            
                                
                                    x
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                    , where WO is the output weight matrix, the output is governed by                         
                            
                                
                                    y
                                
                                
                                    O
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            =
                            
                                
                                    f
                                
                                
                                    O
                                
                            
                            (
                            
                                
                                    W
                                
                                
                                    O
                                
                            
                            
                                
                                    y
                                
                                
                                    H
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                     [so in CW-RNN, only the state and output corresponding to the module i are updated] – Koutník, sec. 3, pp. 3-4).”
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Heess to select a sub-state and use a neural network to process the values of the sub-state and the input to update current values of the substate and generate an output, as disclosed by Koutník, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would reduce the number of parameters of the network, improve the performance of the network, and speed up the network evaluation.  See Koutník, abstract.
Hochreiter discloses “a[n] … LSTM neural network, wherein the … LSTM neural network is configured to maintain an internal state (gradient-based method known as long short-term memory introduced to learn to store information over extended time intervals via recurrent backpropagation – Hochreiter, abstract; gradient-based algorithm for an architecture enforcing constant error flow through internal states of special units achieves the bridging of time intervals in excess of 1000 steps – id. at pp. 1-2, first paragraph under “The remedy”) ….”
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Koutník, and Heess to make the network in question an LSTM network, as disclosed by Hochreiter, and an ordinary artisan could reasonably have expected to do so successfully.  One motivation for doing so would be to ensure that long time intervals can be bridged even in case of noisy input sequences without loss of short time lag capabilities.  See Hochreiter, p. 1, subsection entitled “The remedy” under section 1.
Yu discloses “a dilated … neural network (module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution – Yu, abstract)….”
See Yu, abstract.

Regarding claim 16, Léon, as modified by Dayan, Koutník, Heess, Hochreiter, and Yu, discloses that “the time steps in the plurality of time steps are indexed starting from 1 for the first time step in the plurality of time steps to T for the last time step in the plurality of time steps (each of the modules is assigned a clock period                         
                            
                                
                                    T
                                
                                
                                    n
                                
                            
                            ∈
                            {
                            
                                
                                    T
                                
                                
                                    1
                                
                            
                            ,
                            …
                            ,
                            
                                
                                    T
                                
                                
                                    g
                                
                            
                            }
                        
                     – Koutník, sec. 3, first paragraph [g of Koutník = T of the claim]), wherein each sub-state is assigned an index ranging from 1 to r (neurons of the hidden layer are partitioned into g modules of size k – Koutník sec. 3, first paragraph [k = r]), and … selecting a sub-state from the r sub-states comprises: 
selecting the sub-state having an index that is equal to the index of the time step modulo r (at each CW-RNN time step t, only the output of modules i that satisfy (t MOD Ti) = 0 are executed – Koutník, sec. 3 [note that the Ti in this equation could be changed to g to correspond to the claim language without inventive effort]).”
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Heess, Hochreiter, and Yu to select a sub-state equal to the index of a time step modulo an index number, as disclosed by Koutník, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would reduce the number of parameters of the network, improve the performance of the network, and speed up the network evaluation.  See Koutník, abstract.


setting an internal state of the LSTM neural network to the current values of the selected sub-state for the processing of the network input at the time step (at each forward pass time step, only the block-rows of the hidden layer weight matrix WH that correspond to the executed modules are used for evaluation, such that each block row                         
                            
                                
                                    W
                                
                                
                                    
                                        
                                            H
                                        
                                        
                                            i
                                        
                                    
                                
                            
                            =
                             
                            
                                
                                    W
                                
                                
                                    
                                        
                                            H
                                        
                                        
                                            i
                                        
                                    
                                
                            
                             
                            f
                            o
                            r
                             
                            
                                
                                    t
                                     
                                    m
                                    o
                                    d
                                     
                                    
                                        
                                            T
                                        
                                        
                                            i
                                        
                                    
                                
                            
                            =
                            0
                            ;
                            0
                             
                            o
                            t
                            h
                            e
                            r
                            w
                            i
                            s
                            e
                        
                     [so that, when the internal state y(t-1) is multiplied by WH, only the values corresponding to the sub-state are considered] – Koutník, section 3, pp. 3-4).”
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Heess, Hochreiter, and Yu to set an internal state of the network to current values of a sub-state, as disclosed by Koutník, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would reduce the number of parameters of the network, improve the performance of the network, and speed up the network evaluation.  See Koutník, abstract.

Claims 12, 14, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan, Koutník, Heess, Hochreiter, and Yu and further in view of Dai et al., “Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning,” in MediaEval (2015) (“Dai”).
Regarding claim 12, neither Léon, Dayan, Koutník, Heess, Hochreiter, nor Yu appears to disclose explicitly the further limitations of the claim.  However, Dai discloses that “the dilated LSTM neural network is further configured to, for each of the time steps:  28Attorney Docket No. 45288-8225002 
pool the network output for the time step and the network outputs for up to a predetermined number of preceding time steps to generate a final network output for the time step (LSTM model trained with another video dataset is adopted and the average [pooled] output from all the time-steps of the last LSTM layers is used as the feature [final output] – Dai, sec. 1.1, last paragraph before “Conventional features”).”
Léon, Dayan Koutník, Heess, Hochreiter, Yu, and Dai are all in the field of machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Koutník, Heess, Hochreiter, and Yu to pool the network outputs of previous time steps, as disclosed by Dai.  In so doing, an ordinary artisan before the effective filing date would merely be applying the known method of pooling outputs for previous time steps, as disclosed by Dai, to the known LSTM disclosed by Koutník, with the predictable result that historical data are taken into consideration in producing the output.  See KSR Int’l. Co. v. Teleflex Inc., 550 U.S. 398, 127 S. Ct. 1727, 167 L. Ed. 2d 705 (2007).

Regarding claim 14, Léon, as modified by Dayan, Koutník, Heess, Hochreiter, Yu, and Dai, discloses “pooling the network outputs comprises averaging the network outputs (LSTM model trained with another video dataset is adopted and the average output from all the time-steps of the last LSTM layers is used as the feature – Dai, sec. 1.1, last paragraph before “Conventional features”).”  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Koutník, Heess, Hochreiter, and Yu to average the network outputs of the LSTM, as disclosed by Dai.  In so doing, an ordinary artisan before the effective filing date would merely be applying the known method of pooling outputs for previous time steps, as disclosed by Dai, to the known LSTM disclosed by Koutník, with the predictable result that historical data are taken into consideration in producing the output.  See KSR Int’l. Co. v. Teleflex Inc., 550 U.S. 398, 127 S. Ct. 1727, 167 L. Ed. 2d 705 (2007).

Regarding claim 17, Léon, as modified by Dayan, Koutník, Heess, Hochreiter, Yu, and Dai, discloses that “the LSTM neural network comprises a plurality of LSTM layers (LSTM model trained with another video dataset is adopted and the average output from all the time-steps of the last LSTM layers is used as the feature – Dai, sec. 1.1, last paragraph before “Conventional features” [the plural “layers” implies that the network has multiple layers]).”
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Koutník, Heess, Hochreiter, and Yu to introduce multiple LSTM layers, as disclosed by Dai.  In so doing, an ordinary artisan before the effective filing date would merely be adding layers to the known LSTM disclosed by Koutník, with the predictable result that the model becomes more capable of representing the data over different time scales.  See KSR Int’l. Co. v. Teleflex Inc., 550 U.S. 398, 127 S. Ct. 1727, 167 L. Ed. 2d 705 (2007).

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan and further in view of Koutník, Heess, Hochreiter, Yu, Dai, and Cox (US 20180247199) (“Cox”).
Regarding claim 13, neither Léon, Dayan, Koutník, Heess, Hochreiter, Yu, nor Dai appears to disclose explicitly the further limitations of the claim.  However, Cox discloses “pooling the network outputs comprises summing the network outputs (output sequence at each time step may be linearly summed to obtain a first sum – Cox, paragraph 86).”
Léon, Dayan, Koutník, Heess, Hochreiter, Yu, Dai, and Cox are all in the field of neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Koutník, Hochreiter, and Yu to sum the network outputs, as disclosed by Cox.  In so doing, an ordinary artisan before the effective filing date would merely be applying the known method of pooling outputs, as disclosed by Cox, to the known LSTM disclosed by Koutník, with the predictable result that historical data are taken into consideration in producing the output.  See KSR Int’l. Co. v. Teleflex Inc., 550 U.S. 398, 127 S. Ct. 1727, 167 L. Ed. 2d 705 (2007).

15 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan, Koutník, Heess, Hochreiter, Yu, and Dai and further in view of Vinyals et al. (US 20150356401) (“Vinyals”).
Regarding claim 15, Léon, as modified by Dayan, Koutník, Heess, Hochreiter, Yu, Dai, and Vinyals, discloses “pooling the network outputs comprises selecting a highest network output (system processes a selected output using a decoder LSTM neural network to generate a set of next output scores and then selects a highest-scoring output according to the next output scores as the next output in the target sequence – Vinyals, paragraphs 36-37).”
Léon, Dayan, Koutník, Heess, Hochreiter, Yu, Dai and Vinyals are all related to machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of  Léon, Dayan, Koutník, Heess, Hochreiter, Yu, and Dai to select a highest-scoring output among the outputs of the LSTM, as disclosed by Vinyals.  In so doing, an ordinary artisan before the effective filing date would merely be applying the known method of pooling outputs, as disclosed by Vinyals, to the known LSTM disclosed by Koutník, with the predictable result that historical data are taken into consideration in producing the output.  See KSR Int’l. Co. v. Teleflex Inc., 550 U.S. 398, 127 S. Ct. 1727, 167 L. Ed. 2d 705 (2007).

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159.  See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-20 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-4 and 6-19 of U.S. Patent No. 10,679,126. Although the claims at issue are not identical, they are not patentably distinct from each other because the instant independent claims are in general broader than their counterparts in the ‘126 patent and the dependent claims are largely identical to each other.  Other than a few minor differences in wording, the claims of the reference patent contain limitations that are substantively identical to those of the instant application.  The only limitation of the reference patent that even arguably differs substantively from that of the instant application is that the reference patent recites receiving an intermediate representation and mapping the intermediate representation to a latent representation, whereas the instant application merely recites generating a latent representation.  In other .
Instant Application
Reference Patent
1. A system for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the system comprising one or more computers and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to implement: 
a manager neural network subsystem that is configured to, at each of a plurality of time steps: 
generate a latent representation, in a latent space, of a current state of the environment at the time step; 
generate, based at least in part on the latent representation of the current state of the environment at the time step, an initial goal vector that defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; and 

a worker neural network subsystem that is configured to, at each of the plurality of time steps: 
generate a respective action score for each action in the predetermined set of actions based at least in part on the final goal vector for the time step; and 
select an action from the predetermined set of actions to be performed by the agent at the time step using the action scores.
2. The system of claim 1, wherein generating the initial goal vector, comprises: processing the latent representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the latent representation and to process the latent representation in accordance with a hidden state of the goal recurrent neural network to generate the initial goal vector and to update the hidden state of the goal recurrent neural network.
3. The system of claim 1, wherein generating the respective action score for each action in the predetermined set of actions comprises: 

projecting the final goal vector for the time step to the embedding space to generate a goal embedding vector; and 
modulating the respective action embedding vector for each action by the goal embedding vector to generate the respective action score for each action in the predetermined set of actions.  


a manager neural network subsystem that is configured to, at each of a plurality of time steps: 
receive an intermediate representation of a current state of the environment at the time step, 
map the intermediate representation to a latent representation of the current state in a latent state space, 
process the latent representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the latent representation and to process the latent 
pool the initial goal vector for the time step and initial goal vectors for one or more preceding time steps to generate a final goal vector for the time step; 
a worker neural network subsystem that is configured to, at each of the plurality of time steps: 
receive the intermediate representation of the current state of the environment at the time step, 
map the intermediate representation to a respective action embedding vector in anPage: 8 of 17 embedding space for each action in the predetermined set of actions, 
project the final goal vector for the time step from the goal space to the embedding space to generate a goal embedding vector, and 
modulate the respective action embedding vector for each action by the goal embedding vector to 
an action selection subsystem, wherein the action selection subsystem is configured to, at each of the plurality of time steps: 
receive an observation characterizing the current state of the environment at the time step, 
generate the intermediate representation from the observation, 
provide the intermediate representation as input to the manager neural network to generate the final goal vector for the time step, 
provide the intermediate representation and the final goal vector as input to the worker neural network to generate the action scores, and 
select an action from the predetermined set of actions to be performed by the agent in response to the observation using the action scores.

2. The system of claim 1, wherein selecting the action comprises selecting the action having a highest action score.
5. The system of claim 3, wherein generating the respective action embedding vector in the embedding space for each action in the predetermined set of actions comprises:


processing the intermediate representation using an action score recurrent neural network, wherein the action score recurrent neural network is configured to receive the intermediate representation and to process the intermediate representation in accordance with a current hidden state of the action score recurrent neural network to generate the action embedding vectors and to update the hidden state of the action score neural network.

6. The system of claim 1, wherein the goal space has a higher dimensionality than the embedding space.
7. The system of claim 6, wherein the dimensionality of the final goal vector is at least ten times higher than the dimensionality of the goal embedding vector.
7. (The system of claim 6, wherein the dimensionality of the goal space is at least ten times higher than the dimensionality of the embedding space.
8. The system of claim 1, wherein the worker neural network subsystem has been trained to generate action scores that maximize a time discounted combination of rewards, wherein each reward is a combination of an external reward received as a result of the agent performing selected actions and an intrinsic reward dependent 


9. The system of claim 8, wherein the manager neural network subsystem has been trained to generate initial goal vectors that result in action scores that encourage selection of actions that increase the external rewards received as a result of the agent performing the selected actions.
10. The system of claim 1, wherein generating the latent representation, in the latent space, of the current state of the environment at the time step comprises: processing an observation characterizing the current state of the environment using a convolutional neural network.
3. The system of claim 1, wherein generating the intermediate representation from the observation comprises processing the observation using a convolutional neural network.
11. The system of claim 2, wherein the goal recurrent neural network is a dilated long short- term memory (LSTM) neural network, wherein the dilated LSTM neural network is configured to maintain an internal state that is partitioned into r sub-states, wherein r is an integer greater than one, and wherein the dilated LSTM neural network is configured to, at each time step in the plurality of time steps: 
receive a network input for the time step; 
select a sub-state from the r sub-states; and 


receive a network input for the time step; 
select a sub-state from the r sub-states; and 


pool the network output for the time step and the network outputs for up to a predetermined number of preceding time steps to generate a final network output for the time step.
11. The system of claim 10, wherein the dilated LSTM neural network is further configured to, for each of the time steps: 
pool the network output for the time step and the network outputs for up to a predetermined number of preceding time steps to generate a final network output for the time step.
13. The system of claim 12, wherein pooling the network outputs comprises summing the network outputs.
12. The system of claim 11, wherein pooling the network outputs comprises summing the network outputs.
14. The system of claim 12, wherein pooling the network outputs comprises averaging the network outputs.
13. The system of claim 11, wherein pooling the network outputs comprises averaging the network outputs.
15. The system of claim 12, wherein pooling the network outputs comprises selecting a highest network output.
14. The system of claim 11, wherein pooling the network outputs comprises selecting a highest network output.
16. The system of claim 11, wherein the time steps in the plurality of time steps are indexed starting from 1 for the first time step in the plurality of time 
selecting the sub-state having an index that is equal to the index of the time step modulo r.

selecting the sub-state having an index that is equal to the index of the time step modulo r.

16. The system of claim 10, wherein the LSTM neural network comprises a plurality of LSTM layers.
18. The system of claim 11, wherein processing current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters comprises: 
setting an internal state of the LSTM neural network to the current values of the selected sub-state for the processing of the network input at the time step.
17. The system of claim 10, wherein processing current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters comprises: 
setting an internal state of the LSTM neural network to the current values of the selected sub-state for the processing of the network input at the time step.
19. One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for 29Attorney Docket No. 45288-8225002 selecting actions to be performed by an agent that interacts 
generating a latent representation, in a latent space, of a current state of the environment at the time step; 
generating, based at least in part on the latent representation of the current state of the environment at the time step, an initial goal vector that defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; 
pooling the initial goal vector for the time step and initial goal vectors for one or more preceding time steps to generate a final goal vector for the time step; 
generating a respective action score for each action in the predetermined set of actions based at least in part on the final goal vector for the time step; and 
selecting an action from the predetermined set of actions to be performed by the agent at the time step using the action scores.

receiving an observation characterizing a current state of the environment at the time step; 
generating an intermediate representation of the current state of the environment at the time step from the observation; 
mapping the intermediate representation to a latent representation of the current state in a latent state space; 
processing the latent representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the latent representation and to process the latent representation in accordance with a current hidden state of the goal recurrent neural network to generate an initial goal vector in a goal space for the time step and to update an internal state of the goal recurrent neural network, wherein the initial goal vector defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; 
pooling the initial goal vector for the time step and initial goal vectors for one or more preceding time 
mapping the intermediate representation to a respective action embedding vector in an embedding space for each action in the predetermined set of actions; 
projecting the final goal vector for the time step from the goal space to the embedding space to generate a goal embedding vector; 
modulating the respective action embedding vector for each action by the goal embedding vector to generate a respective action score for each action in the predetermined set of actions; and 
selecting an action from the predetermined set of actions to be performed by the agent in response to the observation using the action scores.

generating a latent representation, in a latent space, of a current state of the environment at the time step; 

pooling the initial goal vector for the time step and initial goal vectors for one or more preceding time steps to generate a final goal vector for the time step; 
generating a respective action score for each action in the predetermined set of actions based at least in part on the final goal vector for the time step; and 
selecting an action from the predetermined set of actions to be performed by the agent at the time step using the action scores.

receiving an observation characterizing a current state of the environment at the time step; 

mapping the intermediate representation to a latent representation of the current state in a latent state space; 
processing the latent representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the latent representation and to process the latent representation in accordance with a current hidden state of the goal recurrent neural network to generate an initial goal vector in a goal space for the time step and to update an internal state of the goal recurrent neural network, wherein the initial goal vector defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; 
pooling the initial goal vector for the time step and initial goal vectors for one or more preceding time steps to generate a final goal vector for the time step; 
mapping the intermediate representation to a respective action embedding vector in an 
projecting the final goal vector for the time step from the goal space to the embedding space to generate a goal embedding vector; 
modulating the respective action embedding vector for each action by the goal embedding vector to generate a respective action score for each action in the predetermined set of actions; and 
selecting an action from the predetermined set of actions to be performed by the agent in response to the observation using the action scores.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RYAN C VAUGHN whose telephone number is (571)272-4849.  The examiner can normally be reached on M-R 7a-5:30p ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on 571-272-7796.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through 






/R.C.V./             Examiner, Art Unit 2125    

/KAMRAN AFSHAR/             Supervisory Patent Examiner, Art Unit 2125