DETAILED ACTION
Status of Claims
This is a non-final office action on the merits in response to the arguments and amendments filed on 16 August 2022 and the request for continued examination filed on 16 August 2022. 
Claim 21 is new. Claims 1 and 18 were amended. Claims 1-14, 16-19, and 21 are currently pending and have been examined. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112(b)
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 9-12 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Claims not listed below are rejected for dependency. 

Claim 9 recites “the image data” and “the proprioceptive data.” There is no antecedent basis for these terms in the claims. The lack of antecedent basis makes it unclear what these terms are referencing, rendering the claim indefinite. 



Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 18, and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Nair et al. (“Overcoming Exploration in Reinforcement Learning with Demonstrations”). 

Regarding Claim 1 and 18: Nair discloses a computer-implemented method of training a neural network to generate commands for controlling an agent to perform a task in an environment, the method comprising:
obtaining, for each of a plurality of performances of the task, a respective dataset characterizing the corresponding performance of the task (“We limit ourselves to 100 human demonstrations collected via teleoperation in virtual reality. Using these demonstrations, we are able to solve a complex robotics task in simulation that is beyond the capability of both reinforcement learning and imitation learning.” See at least Page 1. Also: “We recorded 100 demonstrations to stack 6 blocks, and use subsets of these demonstrations as demonstrations for stacking fewer blocks.” See at least Page 5. 
for each task stage in a plurality of task stages, generating a respective set of demonstration states using states defined in one or more of the datasets characterizing the corresponding performance of the task, wherein each task stage defines one of a plurality of portions of the task (“To show that our method can solve more complex tasks with longer horizon and sparser reward, we study the task of block stacking in a simulated environment as shown in Fig. 1 with the same physical properties as the previous experiments. Our experiments show that our approach can solve the task in full and learn a policy to stack 6 blocks with demonstrations and RL. To measure and communicate various properties of our method, we also show experiments on stacking fewer blocks, a subset of the full task. We initialize the task with blocks at 6 random locations x1...x6. We also provide 6 goal locations g1...g6. To form a tower of blocks, we let g1 = x1 and gi = gi−1 + (0, 0, 5cm) for i ∈ 2, 3, 4, 5. By stacking N blocks, we mean N blocks reach their target locations. Since the target locations are always on top of x1, we start with the first block already in position. So stacking N blocks involves N −1 pick-and-place actions. To solve stacking N, we allow the agent 50∗(N −1) timesteps. This means that to stack 6 blocks, the robot executes 250 actions or 5 seconds. We recorded 100 demonstrations to stack 6 blocks, and use subsets of these demonstrations as demonstrations for stacking fewer blocks.” See at least Page 5). 
using the dataset, training a neural network to generate commands for controlling the agent based on agent data characterizing the environment and the agent; wherein the training the neural network comprises:  using the neural network to generate a plurality of sets of one or more commands (“We consider the standard Markov Decision Process framework for picking optimal actions to maximize rewards over discrete timesteps in an environment E. We assume that the environment is fully observable. At every timestep t, an agent is in a state xt, takes an action at, receives a reward rt, and E evolves to state xt+1. In reinforcement learning, the agent must learn a policy at = π(xt) to maximize expected returns.” See at least Page 3. Also: “To train our models, we use Adam [37] as the optimizer with learning rate 10−3 . We use N = 1024, ND = 128, λ1 = 10−3 , λ2 = 1.0/ND. The discount factor γ is 0.98. We use 100 demonstrations to initialize RD. The function approximators π and Q are deep neural networks with ReLU activations.” See at least Page 4), comprising: 
randomly selecting a respective initial stage from the respective set of demonstration states generated for the task stage using states defined in the one or more of the datasets characterizing the corresponding performance of the task, and providing, as input to the neural network, data characterizing the environment and the agent at the respective initial stage (“We introduce … a method of resetting from demonstration states that significantly improves and speeds up training policies.” See at least Page 1. Also: “To overcome the problem of sparse rewards in very long horizon tasks, we reset some training episodes using states and goals from demonstration episodes. Restarts from within demonstrations expose the agent to higher reward states during training.” See at least Page 4. Also: “To reset to a demonstration state, we first sample a demonstration D = (x0, u0, x1, u1, ...xN , uN ) from the set of demonstrations. We then uniformly sample a state xi from D. As in HER, we use the final state achieved in the demonstration as the goal. We roll out the trajectory with the given initial state and goal for the usual number of timesteps.” See at least Page 4. Also: “We find that initializing rollouts from within demonstration states greatly helps to learn to stack 5 and 6 blocks but hurts training with fewer blocks, as shown in Fig. 5. Note that even where resets from demonstration states helps the final success rate, learning takes off faster when this technique is not used. However, since stacking the tower higher is risky, the agent learns the safer behavior of stopping after achieving a certain reward. Resetting from demonstration states alleviates this problem because the agent regularly experiences higher rewards.” See at least Page 7). 
for each set of commands generating at least one corresponding reward value indicative of how successfully the task is carried out upon implementation of the set of commands by the agent (“We consider the standard Markov Decision Process framework for picking optimal actions to maximize rewards over discrete timesteps in an environment E. We assume that the environment is fully observable. At every timestep t, an agent is in a state xt, takes an action at, receives a reward rt, and E evolves to state xt+1. In reinforcement learning, the agent must learn a policy at = π(xt) to maximize expected returns.” See at least Page 3. “To enable solving the longer horizon tasks of stacking 4 or more blocks, we use the “step” reward : 
    PNG
    media_image1.png
    35
    252
    media_image1.png
    Greyscale
Note the step reward is still very sparse; the robot only sees the reward change when it moves a block into its target location.” See at least Page 6); and
adjusting one or more parameters of the neural network based on the dataset, the sets of commands and the corresponding reward values (“DDPG maintains an actor function π(s) with parameters θπ, a critic function Q(s, a) with parameters θQ, and a replay buffer R as a set of tuples (st, at, rt, st+1) for each transition experienced. DDPG alternates between running the policy to collect experience and updating the parameters.” See at least Page 3. Also: “During each training step, DDPG samples a minibatch consisting of N tuples from R to update the actor and critic networks. DDPG minimizes the following loss L w.r.t. θQ to update the critic” See at least page 3). 

Regarding Claim 19: Nair discloses the above limitations. Nair discloses an agent operative to perform commands generated by the neural network; at least one image capture device operative to capture images of an environment and generate image data encoding the images; and at least one device operative to capture proprioceptive data comprising the one or more variables describing configurations of the agent (“Fig. 1: We present a method using reinforcement learning to solve the task of block stacking shown above. The robot starts with 6 blocks labelled A through F on a table in random positions and a target position for each block. The task is to move each block to its target position. The targets are marked in the above visualization with red spheres which do not interact with the environment. These targets are placed in order on top of block A so that the robot forms a tower of blocks. This is a complex, multi-step task where the agent needs to learn to successfully manage multiple contacts to succeed. Frames from rollouts of the learned policy are shown. A video of our experiments can be found at: http://ashvin.me/demoddpg-website.” See at least Page 2 and Fig. 1). 

Claims 16 and 17 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Li et al. (Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs). 

Regarding Claim 16: Li discloses a method of performing a task, the method comprising: training a neural network to generate commands for controlling an agent to perform the task in an environment, by the method according claim 1 (See rejection of claim 1 above); and a plurality of times performing the steps of: (i) capturing images of an environment and generating image data encoding the images; (ii) capturing proprioceptive data comprising one or more variables at least describing positions of one or more components of the agent; (iii) transmitting the image data and the proprioceptive data to the neural network, the neural network generating at least one command based on the image data and the proprioceptive data; and (iv) transmitting the command to the agent, the agent being operative to perform the command within the environment; whereby the neural network successively generates a sequence of commands to control the agent to perform the task (“The Open Racing Car Simulator (TORCS, Wymann et al. (2000)) is a popular simulator environment for research in autonomous vehicles. We packaged it into a client-server framework with APIs similar to OpenAI Gym (Brockman et al., 2016). Our framework produces a realistic dashboard view and driving related information, and communicates with the policy (client) through TCP packets, so that the policy can be written in languages other than C++. In particular, we implemented our policy using the TensorFlow Python API (Abadi et al., 2016). This framework and the code for reproducing the experiments are available at https://github.com/YunzhuLi/InfoGAIL. All of our experiments are conducted in the TORCS environment. The demonstrations are collected from human experts, by manually driving along the race track, and demonstrate typical behaviors like staying within lanes, avoiding collisions with other cars, and surpassing other cars. The policy accepts raw visual inputs as the only external inputs for the state, and produces a three-dimensional action that consists of steering, acceleration, and braking” See at least Page 5 and Fig. 1. Also: “We conduct a series of ablation experiments to demonstrate that our proposed techniques are indeed crucial for learning an effective policy. The experiments consider a long-term setting: our policy drives a car on the race track along with other cars, whereas the human expert provides trajectories by trying to drive as fast as possible without collision. Reward augmentation is performed by adding a reward that encourages the car to drive faster to the imitation learning objective. The performance of the policy is determined by the average distance. Therefore, a longer average rollout distance indicates a better policy” See at least Page 7. Also: “Figure 6. Visual inputs for our policy in the long-term experiment. During the rollout, our policy is able to pass several other cars. A video is included in the supplementary material.” See at least Fig. 6. Also: Also: “our policy requires certain auxiliary information as internal input to serve as a short-term memory. These auxiliary information can be accessed along with the raw visual inputs. In our experiments, the auxiliary information for the policy at time t consists of the following: 1) velocity at time t, which is a three dimensional vector; 2) actions at time t−1 and t−2, which are both three dimensional vectors; 3) damage of the car, which is a real value” See at least Pages 5 and 6). 

Regarding Claim 17: Li discloses the above limitations. Additionally, Li discloses in which the step of obtaining, for each of a plurality of performances of the task, a respective dataset characterizing the corresponding performance of the task, is performed by controlling the agent to perform the task a plurality of times, and for each performance generating a respective dataset characterizing the performance (“The demonstrations are collected from human experts, by manually driving along the race track, and demonstrate typical behaviors like staying within lanes, avoiding collisions with other cars, and surpassing other cars” See at least Page 5). 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 2-10, 12, 14, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Nair et al. (“Overcoming Exploration in Reinforcement Learning with Demonstrations”) in view of Li et al. (Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs).

Regarding Claim 2: Nair discloses the above limitations. Nair does not appear to disclose adjusting the neural network based on a hybrid energy function, the hybrid energy function including both an imitation reward value derived using the datasets and the generated sets of commands, and task reward term calculated using the generated reward values.
	However, Li teaches adjusting the one or more parameters of the neural network comprises adjusting the neural network based on a hybrid energy function, the hybrid energy function including both an imitation reward value derived using the datasets and the generated sets of commands, and task reward term calculated using the generated reward values (“Take a policy step from θi to θi+1, using the TRPO update rule with the following objective … (with reward augmentation): Eˆ χi [Dωi+1 (s, a)] − λ0η(πθi ) − λ1LI (πθi , Qψi+1 )” See at least Page 5. Also: “This motivates the introduction of reward augmentation, a general framework to incorporate prior knowledge in imitation learning by providing additional incentives to the agent without interfering with the imitation learning process. We achieve this by specifying a surrogate state-based reward η(πθ) = Es∼πθ [r(s)] that reflects our biases over the desired agent’s behavior: min θ max ω V (θ, ω) − λ0η(πθ) where λ0 > 0 is a hyper-parameter. This approach can be seen as a hybrid between imitation and reinforcement learning, where part of the reinforcement signal for the policy optimization is coming from the surrogate reward and part from the discriminator, i.e., from mimicking the expert. The surrogate reward can also be thought of as side information provided to the generator. For example, in our autonomous driving experiment below we show that by providing the agent with a penalty if it collides with other cars, we are able to significantly reduce the collision rate of the policy” See at least Page 4).
	Nair provides a reinforcement learning system for training a robot agent to perform a task, upon which the claimed invention’s hybrid demonstration and reward function can be seen as an improvement. However, Li demonstrates that the prior art already knew of training an agent according to a hybrid demonstration and reward function. One of ordinary skill in the art could have easily applied the techniques of Li to the system of Nair. Further, one of ordinary skill in the art would have recognized that such an application of Li would have resulted in an improved system that could learn to outperform the demonstration data (Li, Page 8). As such, the application of Li and the claimed invention would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention in view of the disclosures of Nair and the teachings of Li.

Regarding Claim 3: Nair in view of Li teaches the above limitations. Additionally, Li teaches using the datasets to train a discriminator neural network, and deriving the imitation reward value using the discriminator neural network and the sets of one or more commands (“To determine a reasonable distance metric, a discriminator is jointly trained to distinguish expert trajectories from ones produced by the policy” See at least Page 1. Also: “the neural network is a discriminator that tries to differentiate the two distributions. See at least Page 2), and deriving the imitation reward value using the discriminator neural network and the sets of one or more commands (“the neural network is a discriminator that tries to differentiate the two distributions. The formal GAIL objective is denoted as minθ maxω V (θ, ω), where V (θ, ω) is Eπθ [log Dω(s, a)] + EπE [log(1 − Dω(s, a))] − λH(πθ)” See at least Page 2. Also: “We achieve this by specifying a surrogate state-based reward η(πθ) = Es∼πθ [r(s)] that reflects our biases over the desired agent’s behavior: min θ max ω V (θ, ω) − λ0η(πθ) (4) where λ0 > 0 is a hyper-parameter. This approach can be seen as a hybrid between imitation and reinforcement learning, where part of the reinforcement signal for the policy optimization is coming from the surrogate reward and part from the discriminator, i.e., from mimicking the expert” See at least Page 4). The motivation to combine Nair and Li is the same as explained under claim 2 above, and is incorporated herein. 

Regarding Claim 4: Nair in view of Li teaches the above limitations. Additionally, Li teaches wherein input data to the discriminator neural network includes object-centric data that specifies at least the positions of other objects in the environment, wherein the other objects do not include the agent (“the discriminator network Dω(s, a)” See at least Page 4. Also: “The discriminator Dω accepts three elements as input: a resized image with lower resolution, the auxiliary information, and the current action” See at least page 6. Also: See Fig 2, Noting the green car in Input Image). The motivation to combine Nair and Li is the same as explained under claim 2 above, and is incorporated herein.

Regarding Claim 5: Nair discloses the above limitations. Additionally, Nair discloses in which the reward value is generated by computationally simulating a process carried out by the agent in the environment based on the corresponding set of commands to generate a final state of the environment (“We consider the standard Markov Decision Process framework for picking optimal actions to maximize rewards over discrete timesteps in an environment E. We assume that the environment is fully observable. At every timestep t, an agent is in a state xt, takes an action at, receives a reward rt, and E evolves to state xt+1. In reinforcement learning, the agent must learn a policy at = π(xt) to maximize expected returns.” See at least Page 3. “To enable solving the longer horizon tasks of stacking 4 or more blocks, we use the “step” reward : 
    PNG
    media_image1.png
    35
    252
    media_image1.png
    Greyscale
Note the step reward is still very sparse; the robot only sees the reward change when it moves a block into its target location.” See at least Page 6).

Nair does not appear to disclose calculating an initial reward value based at least on the final state of the environment. However, Li teaches oHowe
calculating an initial reward value based at least on the final state of the environment “The formal GAIL objective is denoted as minθ maxω V (θ, ω), where V (θ, ω) is 
    PNG
    media_image2.png
    32
    326
    media_image2.png
    Greyscale
and πθ (usually a neural network parameterized by θ) is the policy that we wish to imitate πE with, Dω is a discriminator network which tries to distinguish state-action pairs from the trajectories of πθ and πE, Eπ[f(s, a)] denotes the expectation of f over the state-action pairs generated by π.” See at least page 2. Also: “This motivates the introduction of reward augmentation, a general framework to incorporate prior knowledge in imitation learning by providing additional incentives to the agent without interfering with the imitation learning process. We achieve this by specifying a surrogate state-based reward η(πθ) = Es∼πθ [r(s)] that reflects our biases over the desired agent’s behavior: min θ max ω V (θ, ω) − λ0η(πθ) where λ0 > 0 is a hyper-parameter. This approach can be seen as a hybrid between imitation and reinforcement learning, where part of the reinforcement signal for the policy optimization is coming from the surrogate reward and part from the discriminator, i.e., from mimicking the expert. The surrogate reward can also be thought of as side information provided to the generator. For example, in our autonomous driving experiment below we show that by providing the agent with a penalty if it collides with other cars, we are able to significantly reduce the collision rate of the policy” See at least Page 4. Also: “We conduct a series of ablation experiments to demonstrate that our proposed techniques are indeed crucial for learning an effective policy. The experiments consider a long-term setting: our policy drives a car on the race track along with other cars, whereas the human expert provides trajectories by trying to drive as fast as possible without collision. Reward augmentation is performed by adding a reward that encourages the car to drive faster to the imitation learning objective. The performance of the policy is determined by the average distance. Therefore, a longer average rollout distance indicates a better policy. See at least Page 7). 
Nair provides a reinforcement learning system for training a robot agent to perform a task, upon which the claimed invention’s hybrid demonstration and reward function can be seen as an improvement. However, Li demonstrates that the prior art already knew of training an agent according to a hybrid demonstration and reward function. One of ordinary skill in the art could have easily applied the techniques of Li to the system of Nair. Further, one of ordinary skill in the art would have recognized that such an application of Li would have resulted in an improved system that could learn to outperform the demonstration data (Li, Page 8). As such, the application of Li and the claimed invention would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention in view of the disclosures of Nair and the teachings of Li.

Regarding Claim 6: Nair in view of Li teaches the above limitations. Additionally, Nair discloses in which updates to the neural network are calculated using an activation function estimator obtained by subtracting a value function from the initial reward value (“We consider the standard Markov Decision Process framework for picking optimal actions to maximize rewards over discrete timesteps in an environment E. We assume that the environment is fully observable. At every timestep t, an agent is in a state xt, takes an action at, receives a reward rt, and E evolves to state xt+1. In reinforcement learning, the agent must learn a policy at = π(xt) to maximize expected returns.” See at least Page 3. “To enable solving the longer horizon tasks of stacking 4 or more blocks, we use the “step” reward : 
    PNG
    media_image1.png
    35
    252
    media_image1.png
    Greyscale
Note the step reward is still very sparse; the robot only sees the reward change when it moves a block into its target location.” See at least Page 6). Additionally, Li teaches the initial reward value is calculated according to a task reward function based on the final state of the environment (We achieve this by specifying a surrogate state-based reward η(πθ) = Es∼πθ [r(s)] that reflects our biases over the desired agent’s behavior: min θ max ω V (θ, ω) − λ0η(πθ) (4) where λ0 > 0 is a hyper-parameter. This approach can be seen as a hybrid between imitation and reinforcement learning, where part of the reinforcement signal for the policy optimization is coming from the surrogate reward and part from the discriminator, i.e., from mimicking the expert. The surrogate reward can also be thought of as side information provided to the generator. For example, in our autonomous driving experiment below we show that by providing the agent with a penalty if it collides with other cars, we are able to significantly reduce the collision rate of the policy.” See at least Page 4). The motivation to combine Nair and Li is the same as explained under claim 5 above, and is incorporated herein.

Regarding Claim 7: Nair in view of Li teaches the above limitations. Additionally, Nair discloses in which the value function is calculated using data characterizing the positions of objects in the environment (“We consider the standard Markov Decision Process framework for picking optimal actions to maximize rewards over discrete timesteps in an environment E. We assume that the environment is fully observable. At every timestep t, an agent is in a state xt, takes an action at, receives a reward rt, and E evolves to state xt+1. In reinforcement learning, the agent must learn a policy at = π(xt) to maximize expected returns.” See at least Page 3. “To enable solving the longer horizon tasks of stacking 4 or more blocks, we use the “step” reward : 
    PNG
    media_image1.png
    35
    252
    media_image1.png
    Greyscale
Note the step reward is still very sparse; the robot only sees the reward change when it moves a block into its target location.” See at least Page 6)

Regarding Claim 8: Nair in view of Li teaches the above limitations. Additionally, Nair discloses in which the value function is calculated by an adaptive model (“We consider the standard Markov Decision Process framework for picking optimal actions to maximize rewards over discrete timesteps in an environment E. We assume that the environment is fully observable. At every timestep t, an agent is in a state xt, takes an action at, receives a reward rt, and E evolves to state xt+1. In reinforcement learning, the agent must learn a policy at = π(xt) to maximize expected returns.” See at least Page 3. “To enable solving the longer horizon tasks of stacking 4 or more blocks, we use the “step” reward : 
    PNG
    media_image1.png
    35
    252
    media_image1.png
    Greyscale
Note the step reward is still very sparse; the robot only sees the reward change when it moves a block into its target location.” See at least Page 6)

Regarding Claim 9: Nair discloses the above limitations. Nair does not appear to disclose in which the neural network comprises a convolutional neural network which receives the image data and from it generates convolved data, the neural network further comprising at least one adaptive component which receives the output of the convolutional neural network and the proprioceptive data. Li teaches in which the neural network comprises a convolutional neural network which receives the image data and from it generates convolved data, the neural network further comprising at least one adaptive component which receives the output of the convolutional neural network and the proprioceptive data (“Network architecture for the policy/generator πθ. conv denotes a convolutional layer, and fc denotes a fully connected layer” See at least Page 5 and Fig. 1. Also: “input visual features are passed through two convolutional layers, and then combined with the auxiliary information vector” See at least Page 6). 
	Nair provides a reinforcement learning system for training an agent, upon which the claimed invention’s use of image data and a convolutional neural network can be seen as an improvement. However, Li demonstrates that the prior art already knew of using convolutional neural networks to collect environmental data for reinforcement learning agents. One of ordinary skill in the art could have easily applied the techniques of Li to the system of Nair. Further, one of ordinary skill in the art would have recognized that such an application of Li would have resulted in an improved system which could use visual information as an input. As such, the application of Li and the claimed invention would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention in view of the disclosures of Nair and the teachings of Li.

Regarding Claim 10: Nair in view of Li teaches the above limitations. Additionally, Li teaches in which the adaptive component is a perceptron (“Network architecture for the policy/generator πθ. conv denotes a convolutional layer, and fc denotes a fully connected layer” See at least Page 5 and Fig. 1). The motivation to combine Nair and Li is the same as explained under claim 9 above, and is incorporated herein.
 
Regarding Claim 12: Nair in view of Li teaches the above limitations. Additionally, Li teaches defining at least one auxiliary task, and training the convolutional network as part of an adaptive system which is trained to perform the auxiliary task based on image data (“The surrogate reward can also be thought of as side information provided to the generator. For example, in our autonomous driving experiment below we show that by providing the agent with a penalty if it collides with other cars, we are able to significantly reduce the collision rate of the policy” See at least Page 4. Also: “Algorithm 1 InfoGAIL”, “Input: Expert trajectories”, “Output: Learned policy πθ” See at least Page 5. Also: “Take a policy step from θi to θi+1, using the TRPO update rule with the following objective … (with reward augmentation): Eˆ χi [Dωi+1 (s, a)] − λ0η(πθi ) − λ1LI (πθi , Qψi+1 )” See at least Page 5. Also: “For the policy network, input visual features are passed through two convolutional layers, and then combined with the auxiliary information vector and (in the case of InfoGAIL) the latent code c.” See at least Page 6. 

Regarding Claim 14: Nair discloses the above limitations. Additionally, Nair discloses in which the step of using the neural network to generate a plurality of sets of commands is performed at least once by supplying to the neural network proprioceptive data which characterizes a state associated with one of the performances of the task (“We consider the standard Markov Decision Process framework for picking optimal actions to maximize rewards over discrete timesteps in an environment E. We assume that the environment is fully observable. At every timestep t, an agent is in a state xt, takes an action at, receives a reward rt, and E evolves to state xt+1. In reinforcement learning, the agent must learn a policy at = π(xt) to maximize expected returns.” See at least Page 3. Also: “To train our models, we use Adam [37] as the optimizer with learning rate 10−3 . We use N = 1024, ND = 128, λ1 = 10−3 , λ2 = 1.0/ND. The discount factor γ is 0.98. We use 100 demonstrations to initialize RD. The function approximators π and Q are deep neural networks with ReLU activations.” See at least Page 4). Nair does not appear to disclose image data.
	However, Li teaches supplying to the neural network image data and proprioceptive data which characterizes a state associated with one of the performances of the task (“Network architecture for the policy/generator πθ. conv denotes a convolutional layer, and fc denotes a fully connected layer” See at least Page 5 and Fig. 1. Also: “The policy accepts raw visual inputs as the only external inputs for the state” See at least Page 5. Also: “our policy requires certain auxiliary information as internal input to serve as a short-term memory. These auxiliary information can be accessed along with the Inferring The Latent Structure of Human Decision-Making raw visual inputs. In our experiments, the auxiliary information for the policy at time t consists of the following: 1) velocity at time t, which is a three dimensional vector; 2) actions at time t−1 and t−2, which are both three dimensional vectors; 3) damage of the car, which is a real value” See at least Page 5 and 6). 
	Nair provides a reinforcement learning system for training an agent, upon which the claimed invention’s use of image data as an input can be seen as an improvement. However, Li demonstrates that the prior art already knew of using image data as an input representing environmental data for reinforcement learning agents. One of ordinary skill in the art could have easily applied the techniques of Li to the system of Nair. Further, one of ordinary skill in the art would have recognized that such an application of Li would have resulted in an improved system which could use visual information as an input. As such, the application of Li and the claimed invention would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention in view of the disclosures of Nair and the teachings of Li.

Regarding Claim 21: Nair discloses the above limitations. Additionally, Nair discloses wherein the data characterizing the environment and the agent comprises image data encoding proprioceptive data comprising one or more variables describing positions of one or more mechanical components of the agent (“We consider the standard Markov Decision Process framework for picking optimal actions to maximize rewards over discrete timesteps in an environment E. We assume that the environment is fully observable. At every timestep t, an agent is in a state xt, takes an action at, receives a reward rt, and E evolves to state xt+1. In reinforcement learning, the agent must learn a policy at = π(xt) to maximize expected returns.” See at least Page 3. Also: “To train our models, we use Adam [37] as the optimizer with learning rate 10−3 . We use N = 1024, ND = 128, λ1 = 10−3 , λ2 = 1.0/ND. The discount factor γ is 0.98. We use 100 demonstrations to initialize RD. The function approximators π and Q are deep neural networks with ReLU activations.” See at least Page 4). Nair does not appear to disclose captured images of the environment. However, Li teaches using captured images of the environment in the training of an agent (“Network architecture for the policy/generator πθ. conv denotes a convolutional layer, and fc denotes a fully connected layer” See at least Page 5 and Fig. 1. Also: “The policy accepts raw visual inputs as the only external inputs for the state” See at least Page 5. Also: “our policy requires certain auxiliary information as internal input to serve as a short-term memory. These auxiliary information can be accessed along with the Inferring The Latent Structure of Human Decision-Making raw visual inputs. In our experiments, the auxiliary information for the policy at time t consists of the following: 1) velocity at time t, which is a three dimensional vector; 2) actions at time t−1 and t−2, which are both three dimensional vectors; 3) damage of the car, which is a real value” See at least Page 5 and 6).
Nair provides a reinforcement learning system for training an agent, upon which the claimed invention’s use of image data a can be seen as an improvement. However, Li demonstrates that the prior art already knew of using image data to describe environmental data for reinforcement learning agents. One of ordinary skill in the art could have easily applied the techniques of Li to the system of Nair. Further, one of ordinary skill in the art would have recognized that such an application of Li would have resulted in an improved system which could use visual information as an input. As such, the application of Li and the claimed invention would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention in view of the disclosures of Nair and the teachings of Li.

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Nair et al. (“Overcoming Exploration in Reinforcement Learning with Demonstrations”) in view of Li et al. (Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs), and further in view of Bui et al. (“Using Grayscale Images for Object Recognition with Convolutional-Recursive Neural Network”). 

Regarding Claim 11: Nair in view of Li teaches the above limitations. As previously noted in combination with Nair, Li teaches in which the neural network further comprises a neural network, which receives input data generated both from the image data and the proprioceptive data (“Network architecture for the policy/generator πθ. conv denotes a convolutional layer, and fc denotes a fully connected layer” See at least Page 5 and Fig. 1. Also: “input visual features are passed through two convolutional layers, and then combined with the auxiliary information vector” See at least Page 6). Nair does not appear to disclose a recursive neural network. Bui teaches applying a recursive neural network subsequent to a convolutional neural network (“Recursive Neural Networks (RNNs) recently proposed by Socher et al. [21] showed a very good capability for analyzing recursively structured data. Generally, low-level patterns such as edges and textons can be found repetitively over the whole image. Whilst CNNs are effective in detecting such features, often a significant amount of data is produced due to the nature of 2D convolution. RNN, being intrinsically a recursive structure, can consolidate those features into a compact representation.” See at least Page 2. Also: “In the proposed approach, the recognizer consists of a CNN layer to firstly map an input image to a feature space, and a RNN structure to subsequently map features into a lower-dimensional space that is more suitable for input into the classifier.” See at least Page 2 and Fig. 2). 
	Nair and Li provides a system for training an agent based on image data, upon which the claimed invention’s use of a recursive neural network can be seen as an improvement. However, Bui demonstrates that the prior art already knew of using recursive neural networks subsequent to a convolutional neural network to process the CNN outputs. One of ordinary skill in the art could have easily applied the techniques of Bui to the output of Li’s convolutional neural network. Further, one of ordinary skill in the art would have recognized that such an application of Bui would have resulted in the system more accurately classifying the image data. As such, the application of Bui and the claimed invention would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention in view of the disclosures of Nair and the teachings of Li and Bui. 

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Nair et al. (“Overcoming Exploration in Reinforcement Learning with Demonstrations”) in view of Nair et al. (Massively Parallel Methods of Deep Reinforcement Learning) [hereafter referenced as “Srivivasan”]. 

Regarding Claim 13: Nair discloses the above limitations. As previously noted, Nair discloses the adjustment of the parameters of the neural network being additionally based on reward values indicative of how successfully the task is carried out by simulated agents based on sets of commands generated by the neural networks (“DDPG maintains an actor function π(s) with parameters θπ, a critic function Q(s, a) with parameters θQ, and a replay buffer R as a set of tuples (st, at, rt, st+1) for each transition experienced. DDPG alternates between running the policy to collect experience and updating the parameters.” See at least Page 3. Also: “During each training step, DDPG samples a minibatch consisting of N tuples from R to update the actor and critic networks. DDPG minimizes the following loss L w.r.t. θQ to update the critic” See at least page 3). Nair does not appear to disclose in which the training of the neural network is performed in parallel with the training of a plurality of additional instances of the neural network by respective workers. Srivivasan teaches in which the training of the neural network is performed in parallel with the training of a plurality of additional instances of the neural network by respective workers (“This architecture consists of four main components: parallel actors that generate new behaviour; parallel learners that are trained from stored experience .” See at least Page 1. Also: “In order to generate more data, we deploy multiple agents running in parallel that interact with multiple instances of the same environment. Each such actor can store its own record of past experience, effectively providing a distributed experience replay memory with vastly increased capacity compared to a single machine implementation. Alternatively this experience can be explicitly aggregated into a distributed database. In addition to generating more data, distributed actors can explore the state space more effectively, as each actor behaves according to a slightly different policy.” See at Least Pages 1 and 2).
	Nair provides a system for training an agent, upon which the claimed invention’s multi-agent learning can be seen as an improvement. However, Srivivasan demonstrates that the prior art already knew of using multiple agents to generate experience data for training the system. One of ordinary skill in the art could have easily applied the techniques of Srivivasan to the system of Nair. Further, one of ordinary skill in the art would have recognized that such an application of Srivivasan would have predictably resulted in an improved system which would be able to train on a greater amount of experiential data and thus produce a superior agent. As such, the claimed invention would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention in view of the disclosures of Nair and the teachings of Srivivasan. 

Response to Arguments
Applicant’s Argument Regarding 112(b) Rejections of claim 15: Applicant has canceled claim 15. 
Examiner’s Response: Applicant's arguments filed 16 October 2022 have been fully considered, and they resolve the identified issue. 

Applicant’s Argument Regarding 102 and 103 Rejections of claims 1-19: Applicant has amended independent claims 1 and 18. Applicant submits that the applied references, either independently or in combination, fail to teach or suggest, at least [the amended limitations]. 
Examiner’s Response: Applicant's arguments filed 16 October 2022 have been fully considered but they are rendered moot by the amendment of claims 1 and 18.

Additional Considerations
The prior art made of record and not relied upon that is considered pertinent to applicant’s disclosure can be found in the PTO-892 of the prior office action dated 9 November 2021.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Bion A Shelden whose telephone number is (571)270-0515. The examiner can normally be reached M-F, 12pm-10pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hajime S Rojas can be reached on (571)270-5491. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Bion A Shelden/Examiner, Art Unit 3681                                                                                                                                                                                                        2022-11-19