DETAILED ACTION
This action is in response to claims filed on the 9th of April, 2021 for application 16/548560 filed on the 22nd of August, 2019. 
Currently, claims 1-21 are pending and claims 1, 8, and 17 have been amended.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.

Claims 1-3, 6-10, 15-17, 19 and 21 are rejected under 35 U.S.C. 103 as being unpatenable over Shin, Hanul, et al. "Continual learning with deep generative replay." Advances in Neural Information Processing Systems. 2017(“Shin”) in view of Ha, David et al., "Recurrent world models facilitate policy evolution." Advances in Neural Information Processing Systems. (2018)(“Ha”).
Regarding claim 1, Shin teaches an autonomous or semi-autonomous system comprising: 
a temporal prediction network configured to process a first set of samples from an environment of the system during performance of a first task(Shin pg., 3, sec. 3 Generative Replay, “In our continual learning framework, we define the sequence of tasks to be solved as a task sequence             
                T
                =
                (
                
                    
                        T
                    
                    
                        1
                    
                
                ,
                 
                
                    
                        T
                    
                    
                        2
                    
                
                ,
                …
                ,
                
                    
                        T
                    
                    
                        N
                    
                
                )
            
         of N Tasks. A task             
                
                    
                        T
                    
                    
                        i
                    
                
            
         is to optimize a model towards an objective on data distribution             
                
                    
                        D
                    
                    
                        i
                    
                
            
         from which the training examples             
                (
                
                    
                        x
                    
                    
                        i
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        i
                    
                
                )
            
        ’s are drawn…[a] scholar H is a tuple             
                
                    
                        G
                        ,
                        S
                    
                
            
         where a generator G is a generative model that produces real-like samples and a solver S is a task solving model parameterized by             
                θ
                .
            
        ” Note: It is being interpreted that the generator G represents the temporal prediction network );a controller configured to process the first set of samples from the environment (Shin pg., 3, sec. 3 Generative Replay, “In our continual learning framework, we define the sequence of tasks to be solved as a task sequence             
                T
                =
                (
                
                    
                        T
                    
                    
                        1
                    
                
                ,
                 
                
                    
                        T
                    
                    
                        2
                    
                
                ,
                …
                ,
                
                    
                        T
                    
                    
                        N
                    
                
                )
            
         of N Tasks. A task             
                
                    
                        T
                    
                    
                        i
                    
                
            
         is to optimize a model towards an objective on data distribution             
                
                    
                        D
                    
                    
                        i
                    
                
            
         from which the training examples             
                (
                
                    
                        x
                    
                    
                        i
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        i
                    
                
                )
            
        ’s are drawn…[a] scholar H is a tuple             
                
                    
                        G
                        ,
                        S
                    
                
            
         where a generator G is a generative model that produces real-like samples and a solver S is a task solving model parameterized by             
                θ
                .
            
        ” Note: It is being interpreted that the solver S represents the controller) by the temporal prediction network(Shin pg., 3, sec. 3 Generative             
                T
                =
                (
                
                    
                        T
                    
                    
                        1
                    
                
                ,
                 
                
                    
                        T
                    
                    
                        2
                    
                
                ,
                …
                ,
                
                    
                        T
                    
                    
                        N
                    
                
                )
            
         of N Tasks. A task             
                
                    
                        T
                    
                    
                        i
                    
                
            
         is to optimize a model towards an objective on data distribution             
                
                    
                        D
                    
                    
                        i
                    
                
            
         from which the training examples             
                (
                
                    
                        x
                    
                    
                        i
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        i
                    
                
                )
            
        ’s are drawn…[a] scholar H is a tuple             
                
                    
                        G
                        ,
                        S
                    
                
            
         where a generator G is a generative model that produces real-like samples and a solver S is a task solving model parameterized by             
                θ
                .
            
        ” Note: It is being interpreted that the generator G represents the temporal prediction network); a preserved copy of the temporal prediction network (Shin pg., 4, sec. 3.1 Proposed Method, As Figure 1(b) details, “A new generator is trained to mimic a mixed data distribution of real samples x and replayed inputs             
                
                    
                        x
                    
                    
                        '
                    
                
            
         from previous generator.” Note: It is being interpreted that the previous generator represents a preserved copy of the temporal prediction network); and a preserved copy of the controller (Shin pg., 4, sec. 3.1 Proposed Method, As Figure 1(c) details, “A new solver learns from real input-target pairs (x, y) and replayed input-target pairs             
                (
                
                    
                        x
                    
                    
                        '
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        '
                    
                
                )
            
         where replayed response             
                
                    
                        y
                    
                    
                        '
                    
                
            
         is obtained by feeding generated inputs into previous solver” Note: It is being interpreted that the previous solver represents a preserved copy of the controller);wherein the preserved copy of the temporal prediction network and the preserved copy of the controller (Shin pg., 4, sec. 3.1 Proposed Method, fig. 1(elements a, b, c),  “Training the [new] scholar model from… [the old] scholar [model] involves two independent procedures of training the generator and the solver of [the new scholar model] …[r]eal and replayed samples are mixed at a ratio that depends on the desired importance of a new task compared to the older tasks.” Note: It is being interpreted that the old scholar model represents the preserved copy of the temporal predication network and the preserved copy of the controller and the replayed samples represents simulated rollouts)and wherein the system is configured to interleave the simulated rollouts with a second set of samples from the environment during performance of a second task to preserve knowledge of the temporal prediction network for performing the first task (Shin, pg., 4, sub-sec. 3.1, fig. 1(elements a, b, c), “Training the scholar model from another scholar involves two independent procedures of training the generator and the solver. First, the new generator receives current task input x and replayed inputs            
                 
                
                    
                        x
                    
                    ´
                
            
         from previous tasks. Real and replayed samples are mixed at a ratio that depends on the desired importance of a new task compared to the older tasks.”). 
Shin does not teach: and a hidden state output. 
However Ha teaches and a hidden state output (Ha, pg., 10, sec. A.4 Controller, “In the Car Racing task, this hidden state is the output vector h…”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shin’s system in view of Ha to teach: and a hidden state output.  The motivation to do so would be to have a controller that is able to make better actions using both known  and hidden variables  (Ha, pg., 2, sec. 2 Agent Model, “While V's role is to compress what the agent sees at each time frame, we also want to compress what
happens over time.”). 
	Shin also does not teach: are configured to generate temporally consistent simulated rollouts. 
However, Ha teaches: are configured to generate temporally consistent simulated rollouts(Ha, pg., 5, sec. 4.1 Experiment Setup, “We first build an OpenAI Gym environment interface by wrapping a gym. Env…interface over our M as if it were a real Gym environment, and then train our agent inside of this virtual environment instead of using the actual environment. Thus in our simulation…our agent will therefore only train entirely in a more efficient latent space environment…Here, our RNN-based world model is trained to mimic a Note: It is being interpreted that the virtual environment that simulates the real environment in which both M and C are first trained on represent temporally consistent simulated rollouts for the agent interacting in the real environment thereafter).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shin’s system in view of Ha to teach: are configured to generate temporally consistent simulated rollouts.  The motivation to do so would be to train reinforcement learning agents in a simulated environment using different permutations of the simulated environment to construct various agents with different real world policies based on previous experiences learned in the virtual environment (Ha, pg., 8, sec. 6 Discussion, “We may not want to waste cycles training an agent in the actual environment, but instead train the agent as many times as we want inside its simulated environment. Agents that are trained incrementally to simulate reality may prove to be useful for transferring policies back to the real world. Our approach may complement sim2real approaches….” & see also Ha, pg., 1, sec 1 Introduction, “In fact, our M will be a large RNN that learns to predict the future given the past in an unsupervised manner. M's internal representations of memories of past observations and actions are perceived and exploited.”). 
an auto-encoder, wherein the auto-encoder is configured to embed the first set of samples from the environment of the system into a latent space (Ha, pg. 2, sec. 2,  fig. 2(diagram showing how V, M and C interacts with environment), “The environment provides our agent with a high dimensional input observation at each time step. This input is usually a 2D image frame that is part of a video sequence. The role of V is to learn an abstract, compressed representation of each observed input frame. Here, we use a Variational Autoencoder (VAE)…as V to compress each image frame into a latent vector z.”).
Regarding claim 3, Shin as modified in view of Ha teaches the system of claim 1 and 2, wherein the auto-encoder is a convolutional variational auto-encoder(Ha, pg., 9, sub-sec. A.2, fig. 3(left), “We trained a Convolutional Variational Autoencoder (ConvVAE) model as our agent's V.”).
Regarding claim 6, Shin as modified in view of Ha teaches the system of claim 1 wherein the  temporal prediction network comprises: a Long Short-Term Memory (LSTM) layer; and a Mixture Density Network. (Ha, pg., 10, sub-sec. A.3, fig. 3 (Right), “To implement M, we use an LSTM…recurrent neural network combined with a Mixture Density Network…as the output layer.”).
Regarding claim 7, Shin as modified in view of Ha teaches the system of claim 1, wherein the controller is configured to output an action distribution, and wherein sampled actions from the action distribution maximize an expected reward on the first task (Ha, pg., 2-3, sec. 2, figure 2, “C is responsible for determining the course of actions to take in order to maximize the expected cumulative reward of the agent during a rollout of the environment… C is a simple single layer linear model that maps             
                
                    
                        z
                    
                    
                        t
                    
                
            
         and            
                
                    
                         
                        h
                    
                    
                        t
                    
                
            
         directly to action            
                
                    
                         
                        a
                    
                    
                        t
                    
                
            
         at each time             
                 
                
                    
                         
                        a
                    
                    
                        t
                    
                
                =
                
                    
                        W
                    
                    
                        c
                    
                
                
                    
                        
                            
                                z
                            
                            
                                t
                            
                        
                        
                            
                                 
                                h
                            
                            
                                t
                            
                        
                    
                
                +
                
                    
                        b
                    
                    
                        c
                    
                
            
        . In this linear model,             
                
                    
                        W
                    
                    
                        c
                    
                
            
        and             
                
                    
                        b
                    
                    
                        c
                    
                
            
         are the parameters that map the concatenated input vector            
                 
                
                    
                        
                            
                                z
                            
                            
                                t
                            
                        
                        
                            
                                 
                                h
                            
                            
                                t
                            
                        
                    
                
            
         to the output action vector             
                
                    
                         
                        a
                    
                    
                        t
                    
                
            
        .”).
Regarding claim 8, Shin teaches a non-transitory computer-readable storage medium having software instructions stored therein, which, when executed by a processor, cause the processor to: train a temporal prediction network on a first set of samples from an environment of an autonomous or semi-autonomous system during performance of a first task(Shin pg., 3, sec. 3 Generative Replay, “In our continual learning framework, we define the sequence of tasks to be solved as a task sequence             
                T
                =
                (
                
                    
                        T
                    
                    
                        1
                    
                
                ,
                 
                
                    
                        T
                    
                    
                        2
                    
                
                ,
                …
                ,
                
                    
                        T
                    
                    
                        N
                    
                
                )
            
         of N Tasks. A task             
                
                    
                        T
                    
                    
                        i
                    
                
            
         is to optimize a model towards an objective on data distribution             
                
                    
                        D
                    
                    
                        i
                    
                
            
         from which the training examples             
                (
                
                    
                        x
                    
                    
                        i
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        i
                    
                
                )
            
        ’s are drawn…[a] scholar H is a tuple             
                
                    
                        G
                        ,
                        S
                    
                
            
         where a generator G is a generative model that produces real-like samples and a solver S is a task solving model parameterized by             
                θ
                .
            
        ” Note: It is being interpreted that the generator G represents the temporal prediction network );train a controller on the first set of samples from the environment (Shin pg., 3, sec. 3 Generative Replay, “In our continual learning framework, we define the sequence of tasks to be solved as a task sequence             
                T
                =
                (
                
                    
                        T
                    
                    
                        1
                    
                
                ,
                 
                
                    
                        T
                    
                    
                        2
                    
                
                ,
                …
                ,
                
                    
                        T
                    
                    
                        N
                    
                
                )
            
         of N Tasks. A task             
                
                    
                        T
                    
                    
                        i
                    
                
            
         is to optimize a model towards an objective on data distribution             
                
                    
                        D
                    
                    
                        i
                    
                
            
         from which the training examples             
                (
                
                    
                        x
                    
                    
                        i
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        i
                    
                
                )
            
        ’s are drawn…[a] scholar H is a tuple             
                
                    
                        G
                        ,
                        S
                    
                
            
         where a generator G is a generative model that produces real-like samples and a solver S is a task solving model parameterized by             
                θ
                .
            
        ” Note: It is being interpreted that the solver S represents the controller) by the temporal prediction network(Shin pg., 3, sec. 3 Generative Replay, “In our continual learning framework, we define the sequence of tasks to be solved as a task sequence             
                T
                =
                (
                
                    
                        T
                    
                    
                        1
                    
                
                ,
                 
                
                    
                        T
                    
                    
                        2
                    
                
                ,
                …
                ,
                
                    
                        T
                    
                    
                        N
                    
                
                )
            
         of N Tasks. A task             
                
                    
                        T
                    
                    
                        i
                    
                
            
         is to optimize a model towards an objective on data distribution             
                
                    
                        D
                    
                    
                        i
                    
                
            
         from which the training examples             
                (
                
                    
                        x
                    
                    
                        i
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        i
                    
                
                )
            
        ’s are drawn…[a] scholar H is a tuple             
                
                    
                        G
                        ,
                        S
                    
                
            
         where a generator G is a generative model that produces S is a task solving model parameterized by             
                θ
                .
            
        ” Note: It is being interpreted that the generator G represents the temporal prediction network); store a preserved copy of the temporal prediction network (Shin pg., 4, sec. 3.1 Proposed Method, As Figure 1(b) details, “A new generator is trained to mimic a mixed data distribution of real samples x and replayed inputs             
                
                    
                        x
                    
                    
                        '
                    
                
            
         from previous generator.” Note: It is being interpreted that the previous generator represents a preserved copy of the temporal prediction network);store a preserved copy of the controller (Shin pg., 4, sec. 3.1 Proposed Method, As Figure 1(c) details, “A new solver learns from real input-target pairs (x, y) and replayed input-target pairs             
                (
                
                    
                        x
                    
                    
                        '
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        '
                    
                
                )
            
         where replayed response             
                
                    
                        y
                    
                    
                        '
                    
                
            
         is obtained by feeding generated inputs into previous solver” Note: It is being interpreted that the previous solver represents a preserved copy of the controller); from the preserved copy of the temporal prediction network and the preserved copy of the controller (Shin pg., 4, sec. 3.1 Proposed Method, fig. 1(elements a, b, c),  “Training the [new] scholar model from… [the old] scholar [model] involves two independent procedures of training the generator and the solver of [the new scholar model] …[r]eal and replayed samples are mixed at a ratio that depends on the desired importance of a new task compared to the older tasks.” Note: It is being interpreted that the old scholar model represents the preserved copy of the temporal predication network and the preserved copy of the controller and the replayed samples represents simulated rollouts); and interleave the simulated rollouts with a second set of samples from the environment during performance of a second task to preserve knowledge of the temporal prediction network for performing the first task (Shin, pg., 4, sub-sec. 3.1, fig. 1(elements a, b, c), “Training the scholar model from another scholar involves two independent procedures of training the generator and the solver. First, the new generator receives current task input x and replayed inputs            
                 
                
                    
                        x
                    
                    ´
                
            
         from previous tasks. Real and replayed samples are . 
Shin does not teach: and a hidden state output. 
However Ha teaches and a hidden state output (Ha, pg., 10, sec. A.4 Controller, “In the Car Racing task, this hidden state is the output vector h…”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shin’s computer-readable medium in view of Ha to teach: and a hidden state output.  The motivation to do so would be to have a controller that is able to make better actions using both known  and hidden variables  (Ha, pg., 2, sec. 2 Agent Model, “While V's role is to compress what the agent sees at each time frame, we also want to compress what happens over time.”).
Shin also does not teach: generate temporally consistent simulated rollouts. 
However Ha teaches generate temporally consistent simulated rollouts(Ha, pg., 5, sec. 4.1 Experiment Setup, “We first build an OpenAI Gym environment interface by wrapping a gym. Env…interface over our M as if it were a real Gym environment, and then train our agent inside of this virtual environment instead of using the actual environment. Thus in our simulation…our agent will therefore only train entirely in a more efficient latent space environment…Here, our RNN-based world model is trained to mimic a complete game environment designed by human programmers. By learning only from raw image data collected from random episodes, it learns how to simulate the essential aspects of the game…After training, our controller learns to navigate around the virtual environment and escape from deadly fireballs launched by monsters generated by M. Our agent achieved an average score of 918 time steps in the virtual environment. We then took the agent trained inside of the virtual environment Note: It is being interpreted that the virtual environment that simulates the real environment in which both M and C are first trained on represent temporally consistent simulated rollouts for the agent interacting in the real environment thereafter).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shin’s computer-readable medium in view of Ha to teach: generate temporally consistent simulated rollouts.  The motivation to do so would be to train reinforcement learning agents in a simulated environment using different permutations of the simulated environment to construct various agents with different real world policies based on previous experiences learned in the virtual environment (Ha, pg., 8, sec. 6 Discussion, “We may not want to waste cycles training an agent in the actual environment, but instead train the agent as many times as we want inside its simulated environment. Agents that are trained incrementally to simulate reality may prove to be useful for transferring policies back to the real world. Our approach may complement sim2real approaches….” & see also Ha, pg., 1, sec 1 Introduction, “In fact, our M will be a large RNN that learns to predict the future given the past in an unsupervised manner. M's internal representations of memories of past observations and actions are perceived and exploited.”). 
Regarding claim 9, Shin as modified in view of Ha teaches the non-transitory computer-readable storage medium of claim 8, wherein the software instructions, when executed by the processor, further cause the processor to embed, with an auto-encoder, the first set of samples into a latent space (Ha, pg. 2, sec. 2, fig. 2(diagram showing how V, M and C interacts with environment), “The environment provides our agent with a high dimensional input .”).
Regarding claim 10, Shin as modified in view of Ha teaches the non-transitory computer-readable storage medium of claim 8 and 9, wherein the auto-encoder is a convolutional variational auto-encoder (Ha, pg., 9, sub-sec. A.2, fig. 3(left) “We trained a Convolutional Variational Autoencoder (ConvVAE) model as our agent's V.”).
Regarding claim 15, Shin as modified in view of Ha teaches the non-transitory computer-readable storage medium of claim 8, wherein the temporal prediction network comprises: a Long Short-Term Memory (LSTM) layer; and a Mixture Density Network (Ha, pg., 10, sub-sec. A.3, fig. 3 (Right) “To implement M, we use an LSTM…recurrent neural network combined with a Mixture Density Network…as the output layer.”).
Regarding claim 16, Shin as modified in view of Ha teaches the non-transitory computer-readable storage medium of claim 11, wherein the controller is configured to output an action distribution, and wherein sampled actions from the action distribution maximize an expected reward on the first task(Ha, pgs., 2-3, sec. 2, figure 2, “C is responsible for determining the course of actions to take in order to maximize the expected cumulative reward of the agent during a rollout of the environment… C is a simple single layer linear model that maps             
                
                    
                        z
                    
                    
                        t
                    
                
            
         and            
                
                    
                         
                        h
                    
                    
                        t
                    
                
            
         directly to action            
                
                    
                         
                        a
                    
                    
                        t
                    
                
            
         at each time step:            
                 
                
                    
                         
                        a
                    
                    
                        t
                    
                
                =
                
                    
                        W
                    
                    
                        c
                    
                
                
                    
                        
                            
                                z
                            
                            
                                t
                            
                        
                        
                            
                                 
                                h
                            
                            
                                t
                            
                        
                    
                
                +
                
                    
                        b
                    
                    
                        c
                    
                
            
        . In this linear model,             
                
                    
                        W
                    
                    
                        c
                    
                
            
        and             
                
                    
                        b
                    
                    
                        c
                    
                
            
         are the parameters that map the concatenated input vector            
                 
                
                    
                        
                            
                                z
                            
                            
                                t
                            
                        
                        
                            
                                 
                                h
                            
                            
                                t
                            
                        
                    
                
            
         to the output action vector             
                
                    
                         
                        a
                    
                    
                        t
                    
                
            
        .”). 
 training a temporal prediction network to perform a 1-time-step prediction on a first set of samples from an environment of the system during performance of a first task(Shin pg., 3, sec. 3 Generative Replay, “In our continual learning framework, we define the sequence of tasks to be solved as a task sequence             
                T
                =
                (
                
                    
                        T
                    
                    
                        1
                    
                
                ,
                 
                
                    
                        T
                    
                    
                        2
                    
                
                ,
                …
                ,
                
                    
                        T
                    
                    
                        N
                    
                
                )
            
         of N Tasks. A task             
                
                    
                        T
                    
                    
                        i
                    
                
            
         is to optimize a model towards an objective on data distribution             
                
                    
                        D
                    
                    
                        i
                    
                
            
         from which the training examples             
                (
                
                    
                        x
                    
                    
                        i
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        i
                    
                
                )
            
        ’s are drawn…[a] scholar H is a tuple             
                
                    
                        G
                        ,
                        S
                    
                
            
         where a generator G is a generative model that produces real-like samples and a solver S is a task solving model parameterized by             
                θ
                .
            
        ” Note: It is being interpreted that the generator G represents the temporal prediction network & see Shin pg., 6, sec. 4.2 Learning new domains, As Figure 3(a) details training of the generative replay model on the first task i.e. MNIST, lasted for 5 iterations Note: It is being interpreted that 5 iterations represents 1-time-step); training a controller to generate an action distribution based on the first set of samples (Shin pg., 3, sec. 3 Generative Replay, “In our continual learning framework, we define the sequence of tasks to be solved as a task sequence             
                T
                =
                (
                
                    
                        T
                    
                    
                        1
                    
                
                ,
                 
                
                    
                        T
                    
                    
                        2
                    
                
                ,
                …
                ,
                
                    
                        T
                    
                    
                        N
                    
                
                )
            
         of N Tasks. A task             
                
                    
                        T
                    
                    
                        i
                    
                
            
         is to optimize a model towards an objective on data distribution             
                
                    
                        D
                    
                    
                        i
                    
                
            
         from which the training examples             
                (
                
                    
                        x
                    
                    
                        i
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        i
                    
                
                )
            
        ’s are drawn…[a] scholar H is a tuple             
                
                    
                        G
                        ,
                        S
                    
                
            
         where a generator G is a generative model that produces real-like samples and a solver S is a task solving model parameterized by             
                θ
                .
            
        ” Note: It is being interpreted that the solver S represents the controller & see Shin pgs. 7-8, fig. 6, fig.7, “In Figure 6, we divided MNIST dataset into 5 disjoint subsets, each of which contains samples from only 2 classes… [w]hen both the input and output distributions are reconstructed, generative replay evoked previously learnt classes, and the model was able to discriminate all encountered classes.” Note: It is being interpreted that the output distribution represents the action by the temporal prediction network(Shin pg., 3, sec. 3 Generative Replay, “In our continual learning framework, we define the sequence of tasks to be solved as a task sequence             
                T
                =
                (
                
                    
                        T
                    
                    
                        1
                    
                
                ,
                 
                
                    
                        T
                    
                    
                        2
                    
                
                ,
                …
                ,
                
                    
                        T
                    
                    
                        N
                    
                
                )
            
         of N Tasks. A task             
                
                    
                        T
                    
                    
                        i
                    
                
            
         is to optimize a model towards an objective on data distribution             
                
                    
                        D
                    
                    
                        i
                    
                
            
         from which the training examples             
                (
                
                    
                        x
                    
                    
                        i
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        i
                    
                
                )
            
        ’s are drawn…[a] scholar H is a tuple             
                
                    
                        G
                        ,
                        S
                    
                
            
         where a generator G is a generative model that produces real-like samples and a solver S is a task solving model parameterized by             
                θ
                .
            
        ” Note: It is being interpreted that the generator G represents the temporal prediction network)wherein sampled actions of the action distribution(Shin pg., 3, sec. 3 Generative Replay, “In our continual learning framework, we define the sequence of tasks to be solved as a task sequence             
                T
                =
                (
                
                    
                        T
                    
                    
                        1
                    
                
                ,
                 
                
                    
                        T
                    
                    
                        2
                    
                
                ,
                …
                ,
                
                    
                        T
                    
                    
                        N
                    
                
                )
            
         of N Tasks. A task             
                
                    
                        T
                    
                    
                        i
                    
                
            
         is to optimize a model towards an objective on data distribution             
                
                    
                        D
                    
                    
                        i
                    
                
            
         from which the training examples             
                (
                
                    
                        x
                    
                    
                        i
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        i
                    
                
                )
            
        ’s are drawn…[a] scholar H is a tuple             
                
                    
                        G
                        ,
                        S
                    
                
            
         where a generator G is a generative model that produces real-like samples and a solver S is a task solving model parameterized by             
                θ
                .
            
        ” Note: It is being interpreted that the solver S represents the controller & see Shin pgs. 7-8, fig. 6, fig.7, “In Figure 6, we divided MNIST dataset into 5 disjoint subsets, each of which contains samples from only 2 classes… [w]hen both the input and output distributions are reconstructed, generative replay evoked previously learnt classes, and the model was able to discriminate all encountered classes.” Note: It is being interpreted that the output distribution represents the action distribution) on the first task(Shin pg., 7, sec. 4.3 Learning new classes, As Figure 6 details, for the first task the GR model achieved 1.0 accuracy within 5 iterations); preserving the temporal prediction network and the controller as a preserved copy of the temporal prediction network and a preserved copy of the controller, respectively (Shin pg., 4, sec. 3.1 Proposed Method, As Figure 1(b) details, “A new generator is trained to mimic a mixed data distribution of real samples x and replayed inputs             
                
                    
                        x
                    
                    
                        '
                    
                
            
         from previous generator.” Note:  & see Shin pg., 4, sec. 3.1 Proposed Method, As Figure 1(c) details, “A new solver learns from real input-target pairs (x, y) and replayed input-target pairs             
                (
                
                    
                        x
                    
                    
                        '
                    
                
                ,
                 
                
                    
                        y
                    
                    
                        '
                    
                
                )
            
         where replayed response             
                
                    
                        y
                    
                    
                        '
                    
                
            
         is obtained by feeding generated inputs into previous solver” Note: It is being interpreted that the previous solver represents a preserved copy of the controller); from the preserved copy of the temporal prediction network and the preserved copy of the controller; (Shin pg., 4, sec. 3.1 Proposed Method, fig. 1(elements a, b, c),  “Training the [new] scholar model from… [the old] scholar [model] involves two independent procedures of training the generator and the solver of [the new scholar model] …[r]eal and replayed samples are mixed at a ratio that depends on the desired importance of a new task compared to the older tasks.” Note: It is being interpreted that the old scholar model represents the preserved copy of the temporal predication network and the preserved copy of the controller and the replayed samples represents simulated rollouts); and interleaving the simulated rollouts with a second set of samples from the environment during performance of a second task to preserve knowledge of the temporal prediction network for performing the first task(Shin, pg., 4, sub-sec. 3.1, fig. 1(elements a, b, c), “Training the scholar model from another scholar involves two independent procedures of training the generator and the solver. First, the new generator receives current task input x and replayed inputs            
                 
                
                    
                        x
                    
                    ´
                
            
         from previous tasks. Real and replayed samples are mixed at a ratio that depends on the desired importance of a new task compared to the older tasks.”). 
Shin does not teach: and a hidden state output; and maximize an expected reward.
However Ha teaches and a hidden state output (Ha, pg., 10, sec. A.4 Controller, “In the Car Racing task, this hidden state is the output vector h…”); maximize an expected reward (Ha, “C is responsible for determining the course of actions to take in order to maximize the expected cumulative reward…).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shin’s method in view of Ha to teach: and a hidden state output; and maximize an expected reward.  The motivation to do so would be to have a controller that is able to make better actions using both known  and hidden variables  (Ha, pg., 2, sec. 2 Agent Model, “While V's role is to compress what the agent sees at each time frame, we also want to compress what happens over time.”). 
Shin also does not teach: generating temporally consistent simulated rollouts. 
However Ha teaches generating temporally consistent simulated rollouts(Ha, pg., 5, sec. 4.1 Experiment Setup, “We first build an OpenAI Gym environment interface by wrapping a gym. Env…interface over our M as if it were a real Gym environment, and then train our agent inside of this virtual environment instead of using the actual environment. Thus in our simulation…our agent will therefore only train entirely in a more efficient latent space environment…Here, our RNN-based world model is trained to mimic a complete game environment designed by human programmers. By learning only from raw image data collected from random episodes, it learns how to simulate the essential aspects of the game…After training, our controller learns to navigate around the virtual environment and escape from deadly fireballs launched by monsters generated by M. Our agent achieved an average score of 918 time steps in the virtual environment. We then took the agent trained inside of the virtual environment and tested its performance on the original VizDoom environment. The agent obtained an average score of 1092 time steps, far beyond the required score of 750 time steps….” Note: It is being interpreted that the virtual environment that simulates the real environment in which both M and .
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Shin’s medium in view of Ha to teach: generate temporally consistent simulated rollouts.  The motivation to do so would be to train reinforcement learning agents in a simulated environment using different permutations of the simulated environment to construct various agents with different real world policies based on previous experiences learned in the virtual environment (Ha, pg., 8, sec. 6 Discussion, “We may not want to waste cycles training an agent in the actual environment, but instead train the agent as many times as we want inside its simulated environment. Agents that are trained incrementally to simulate reality may prove to be useful for transferring policies back to the real world. Our approach may complement sim2real approaches….” & see also Ha, pg., 1, sec 1 Introduction, “In fact, our M will be a large RNN that learns to predict the future given the past in an unsupervised manner. M's internal representations of memories of past observations and actions are perceived and exploited.”).
Regarding claim 19, Shin as modified in view of Ha teaches the method of claim 17, further comprising embedding, with a convolutional auto-encoder, the first set of samples collected during performance of the first task into a latent space(Ha, pg. 2, sec. 2, fig. 2(diagram showing how V, M and C interacts with environment), “The environment provides our agent with a high dimensional input observation at each time step. This input is usually a 2D image frame that is part of a video sequence. The role of V is to learn an abstract, compressed representation of each observed input frame. Here, we use a Variational Autoencoder (VAE)…as V to compress each image frame into a latent vector z.”).
 method of claim 17 wherein the  temporal prediction network comprises: a Long Short-Term Memory (LSTM) layer; and a Mixture Density Network (Ha, pg., 10, sub-sec. A.3, fig. 3 (Right), “To implement M, we use an LSTM…recurrent neural network combined with a Mixture Density Network…as the output layer.”).
Claims 4, 11-13, and 18 are rejected under 35 U.S.C. 103 as being unpatenable over Shin, Hanul, et al. "Continual learning with deep generative replay." Advances in Neural Information Processing Systems. 2017(“Shin”) in view of Ha, David et al., "Recurrent world models facilitate policy evolution." Advances in Neural Information Processing Systems. (2018)(“Ha”) and in view of Rusu, Andrei A., et al. "Policy distillation." arXiv preprint arXiv:1511.06295 (2015)(“Rusu”). 
Regarding claim 4, Shin as modified in view of Ha teaches the system of claim 1, but does not teach wherein the controller is a stochastic gradient-descent based reinforcement learning controller. 
However, Rusu teaches wherein the controller is a stochastic gradient-descent based reinforcement learning controller (Rusu, pg., 10, sec. A, Table A1, “We used the…variation of minibatch stochastic gradient descent to train student networks.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ha’s system in view of Shin and in view of  Rusu to teach wherein the controller is a stochastic gradient-descent based reinforcement learning controller. The motivation to do so would be to implement policy distillation which decreases the number of parameters in the neural network, but also allows multiple policies learned for different tasks to be combined into one single multi-task policy (Rusu, pg.,1, sec. I, “The method has multiple advantages: network size can be compressed by up to 15 times 
Regarding claim 11, Shin as modified in view of Ha teaches the non-transitory computer-readable storage medium of claim 8, but does not teach wherein training the controller utilizes policy distillation including a cross-entropy loss function with a specific temperature. 
However, Rusu teaches wherein training the controller utilizes policy distillation (Rusu, pg., 5, sub-sec., 4.1,  fig., 2(a), “Single task policy distillation is a process of data generation by the teacher network (a trained DQN agent) and supervised training by the student network…”) including a cross-entropy loss function (Rusu, pg., 4, sub-sec. 3.2, “In the third case, we adopt the distillation setup…and use the Kullback-Leibler divergence (KL) with temperature             
                τ
            
        :             
                
                    
                        L
                    
                    
                        K
                        L
                    
                
                
                    
                        
                            
                                D
                            
                            
                                T
                            
                        
                        ,
                         
                        
                            
                                θ
                            
                            
                                S
                            
                        
                    
                
                =
                
                    
                        ∑
                        
                            i
                            =
                            1
                        
                        
                            |
                            D
                            |
                        
                    
                    
                        s
                        o
                        f
                        t
                        m
                        a
                        x
                        
                            
                                
                                    
                                        
                                            
                                                q
                                            
                                            
                                                i
                                            
                                            
                                                T
                                            
                                        
                                    
                                    
                                        τ
                                    
                                
                            
                        
                        l
                        n
                        
                            
                                s
                                o
                                f
                                t
                                m
                                a
                                x
                                (
                                
                                    
                                        
                                            
                                                q
                                            
                                            
                                                i
                                            
                                            
                                                T
                                            
                                        
                                    
                                    
                                        τ
                                    
                                
                                )
                            
                            
                                s
                                o
                                f
                                t
                                m
                                a
                                x
                                (
                                
                                    
                                        q
                                    
                                    
                                        i
                                    
                                    
                                        S
                                    
                                
                                )
                            
                        
                    
                
            
         .” Note: Since the dataset is given as             
                
                    
                        D
                    
                    
                        T
                    
                
            
         the entropy             
                 
                H
                
                    
                        (
                        D
                    
                    
                        T
                    
                
                )
            
         is being interpreted as a fixed constant and thus, minimizing the cross-entropy loss function is equivalent to minimizing the KL divergence. ) with a specific temperature (Rusu pg., 6, sub-sec. 4.2, “Passing Q-values through a softmax function with a temperature parameter and minimizing the KL divergence cost strikes a convenient balance between these two extremes. We determine empirically that a low temperature             
                τ
            
         = 0:01 is best suited for distillation in this domain.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ha’s system with Shin and with Rusu to teach wherein training the controller utilizes policy distillation including a cross-entropy loss function with a specific temperature. The motivation to do so would be to implement policy distillation which decreases the number of parameters in the neural network, but also allows multiple policies learned for different tasks to be combined into one single multi-task policy 
Regarding claim 12, Shin as modified in view of Ha teaches the non-transitory computer-readable storage medium of claim 8, but does not teach wherein the specific temperature is 0.01.
However, Rusu teaches wherein the specific temperature is 0.01(Rusu pg., 6, sub-sec. 4.2, “Passing Q-values through a softmax function with a temperature parameter and minimizing the KL divergence cost strikes a convenient balance between these two extremes. We determine empirically that a low temperature             
                τ
            
         = 0.01 is best suited for distillation in this domain.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ha’s non-transitory computer-readable storage medium in view of Shin and in view of Rusu to teach wherein the specific temperature is 0.01. The motivation to do so would be to implement policy distillation which decreases the number of parameters in the neural network, but also allows multiple policies learned for different tasks to be combined into one single multi-task policy (Rusu, pg.,1, sec. I, “The method has multiple advantages: network size can be compressed by up to 15 times without degradation in performance; multiple expert policies can be combined into a single multi-task policy that can outperform the original experts….”).
Regarding claim 13, Shin as modified in view of Ha teaches the non-transitory computer-readable storage medium of claim 8, but does not teach wherein the controller is a stochastic gradient-descent based reinforcement learning controller. 
the controller is a stochastic gradient-descent based reinforcement learning controller (Rusu, pg., 10, sec. A, Table A1, “We used the…variation of minibatch stochastic gradient descent to train student networks.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ha’s non-transitory computer-readable storage medium in view of Shin and in view of Rusu to teach wherein the controller is a stochastic gradient-descent based reinforcement learning controller. The motivation to do so would be to implement policy distillation which decreases the number of parameters in the neural network, but also allows multiple policies learned for different tasks to be combined into one single multi-task policy (Rusu, pg.,1, sec. I, “The method has multiple advantages: network size can be compressed by up to 15 times without degradation in performance; multiple expert policies can be combined into a single multi-task policy that can outperform the original experts….”).
Regarding claim 18, Shin as modified in view of Ha teaches the method of claim 17, but does not teach wherein training the controller utilizes policy distillation including a cross-entropy loss function wherein the specific temperature is 0.01. 
However, Rusu teaches wherein training the controller utilizes policy distillation(Rusu, pg., 5, sub-sec., 4.1,  fig., 2(a), “Single task policy distillation is a process of data generation by the teacher network (a trained DQN agent) and supervised training by the student network…”) including a cross-entropy loss function (Rusu, pg., 4, sub-sec. 3.2, “In the third case, we adopt the distillation setup of Hinton et al. (2014) and use the Kullback-Leibler divergence (KL) with temperature             
                τ
            
        :            
                 
                
                    
                        L
                    
                    
                        K
                        L
                    
                
                
                    
                        
                            
                                D
                            
                            
                                T
                            
                        
                        ,
                         
                        
                            
                                θ
                            
                            
                                S
                            
                        
                    
                
                =
                
                    
                        ∑
                        
                            i
                            =
                            1
                        
                        
                            |
                            D
                            |
                        
                    
                    
                        s
                        o
                        f
                        t
                        m
                        a
                        x
                        
                            
                                
                                    
                                        
                                            
                                                q
                                            
                                            
                                                i
                                            
                                            
                                                T
                                            
                                        
                                    
                                    
                                        τ
                                    
                                
                            
                        
                        l
                        n
                        
                            
                                s
                                o
                                f
                                t
                                m
                                a
                                x
                                (
                                
                                    
                                        
                                            
                                                q
                                            
                                            
                                                i
                                            
                                            
                                                T
                                            
                                        
                                    
                                    
                                        τ
                                    
                                
                                )
                            
                            
                                s
                                o
                                f
                                t
                                m
                                a
                                x
                                (
                                
                                    
                                        q
                                    
                                    
                                        i
                                    
                                    
                                        S
                                    
                                
                                )
                            
                        
                    
                
            
        .” Note: Since the dataset is given as             
                
                    
                        D
                    
                    
                        T
                    
                
            
         the entropy             
                 
                H
                
                    
                        (
                        D
                    
                    
                        T
                    
                
                )
            
         is being interpreted as a fixed constant and wherein the specific temperature is 0.01 (Rusu pg., 6, sub-sec. 4.2, “Passing Q-values through a softmax function with a temperature parameter and minimizing the KL divergence cost strikes a convenient balance between these two extremes. We determine empirically that a low temperature              
                τ
            
         = 0:01 is best suited for distillation in this domain.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ha’s method in view of Shin and in view of Rusu to teach wherein training the controller utilizes policy distillation including a cross-entropy loss function wherein the specific temperature is 0.01 . The motivation to do so would be to implement policy distillation which decreases the number of parameters in the neural network, but also allows multiple policies learned for different tasks to be combined into one single multi-task policy (Rusu, pg.,1, sec. I, “The method has multiple advantages: network size can be compressed by up to 15 times without degradation in performance; multiple expert policies can be combined into a single multi-task policy that can outperform the original experts….”).
Claims 5 and 14 are rejected under 35 U.S.C. 103 as being unpatenable over Shin, Hanul, et al. "Continual learning with deep generative replay." Advances in Neural Information Processing Systems. 2017(“Shin”) in view of Ha, David et al., "Recurrent world models facilitate policy evolution." Advances in Neural Information Processing Systems. (2018)(“Ha”) and in view of Oh, Junhyuk, et al. "Self-imitation learning." arXiv preprint arXiv:1806.05635 (2018)(“Oh”).
Regarding claim 5, Shin as modified in view of Ha teaches the system of claim 1 but does not teach wherein the controller comprises an A2C algorithm.
the controller comprises an A2C algorithm (Oh, pg., 3, sec. 3, Algorithm 1(Eq.4), “In this paper, we focus on the combination of advantage actor-critic (A2C)… and self-imitation learning (A2C+SIL)...The objective of A2C (La2c) is given by [equations (4), (5), and (6)].”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ha’s system in view of Shin and in view of Oh to teach wherein the controller comprises an A2C algorithm. The motivation to do so would be to use the A2C algorithm as a baseline algorithm to combine it with experience replay(Oh, pg., 2, sec.2 “In fact, actor-critic framework can also utilize experience replay [which allows for good past experiences to be experienced].”).
Regarding claim 14, Shin as modified in view of Ha teaches the non-transitory computer-readable storage medium of claim 8 but does not teach: wherein the controller comprises an A2C algorithm.
However, Oh teaches wherein the controller comprises an A2C algorithm (Oh, pg., 3, sec. 3, Algorithm 1(Eq.4), “In this paper, we focus on the combination of advantage actor-critic (A2C)… and self-imitation learning (A2C+SIL)…The objective of A2C (La2c) is given by [equations (4), (5), and (6)].”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ha’s non-transitory computer-readable storage medium in view of Shin and in view of Oh to teach wherein the controller comprises an A2C algorithm. The motivation to do so would be to use the A2C algorithm as a baseline algorithm to combine it with experience replay (Oh, pg., 2, sec.2 “In fact, actor-critic framework can also utilize experience replay [which allows for good past experiences to be experienced].”).
Claim 20 is rejected under 35 U.S.C. 103 as being unpatenable over Shin, Hanul, et al. "Continual learning with deep generative replay." Advances in Neural Information Processing Systems. 2017(“Shin”) in view of Ha, David et al., "Recurrent world models facilitate policy evolution." Advances in Neural Information Processing Systems. (2018)(“Ha”) and in view of Rusu, Andrei A., et al. "Policy distillation." arXiv preprint arXiv:1511.06295 (2015)(“Rusu”)  and further in view of  Oh, Junhyuk, et al. "Self-imitation learning." arXiv preprint arXiv:1806.05635 (2018)(“Oh”).
Regarding claim 20, Shin as modified in view of Ha teaches the method of claim 17, but does not teach wherein the controller is a stochastic gradient-descent based reinforcement learning controller comprising an A2C algorithm. 
However, Rusu teaches wherein the controller is a stochastic gradient-descent based reinforcement learning controller (Rusu, pg., 10, sec. A, Table A1, “We used the …variation of minibatch stochastic gradient descent to train student networks.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ha’s method in view of Shin and in view of Rusu to teach wherein the controller is a stochastic gradient-descent based reinforcement learning controller. The motivation to do so would be to implement policy distillation which decreases the number of parameters in the neural network, but also allows multiple policies learned for different tasks to be combined into one single multi-task policy (Rusu, pg.,1, sec. I, “The method has multiple advantages: network size can be compressed by up to 15 times without degradation in performance; multiple expert policies can be combined into a single multi-task policy that can outperform the original experts….”).
the controller comprises an A2C algorithm (Oh, pg., 3, sec. 3, Algorithm 1(Eq.4), “In this paper, we focus on the combination of advantage actor-critic (A2C)… and self-imitation learning (A2C+SIL)...The objective of A2C (La2c) is given by [equations (4), (5), and (6)].”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ha’s method in view of Shin and in view of Oh to teach wherein the controller comprises an A2C algorithm. The motivation to do so would be to use the A2C algorithm as a baseline algorithm to combine it with experience replay(Oh, pg., 2, sec.2 “In fact, actor-critic framework can also utilize experience replay [which allows for good past experiences to be experienced].”).
Response to Arguments
Applicant's arguments filed 04/09/2021 have been fully considered but they are not persuasive. 
Applicant argues that the combination does not teach the amended limitation of generate/generating temporally consistent simulated rollouts. Examiner respectfully disagrees. The rejection relies on Ha to teach generate/generating temporally consistent simulated rollouts as discussed above.
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ADAM CLARK STANDKE whose telephone number is (571)270-1806.  The examiner can normally be reached on 7:00-5:00 M-Th.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to 






/ADAM C STANDKE/Examiner, Art Unit 2122                                                                                                                                                                                                        
/ERIC NILSSON/Primary Examiner, Art Unit 2122