DETAILED ACTION
Status of Claims
This is a first office action on the merits in response to the application filed on 29 October 2018 and the preliminary amendments filed 22 January 2019. 
Claim 20 was canceled. Claims 5, 8, 9, 11-15 and 18 were amended. Claims 1-19 are currently pending and have been examined. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
This application claims priority of US Provisional Application No. 62/578368 filed on 27 October 2017. Applicant’s claim for the benefit of these prior filed applications is acknowledged. 

Examiner’s Note
Claims 12-15 are marked up as “original” but appear to include amendments in the preliminary amendments. These claims will be treated as amended. 

Information Disclosure Statement
The information disclosure statements (IDS(s)) submitted on 29 January 2019, 11 July 2019, and 9 April 2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
		
Subject Matter Eligibility
Per MPEP 2106.04(a)(1): “Some claims are not directed to an abstract idea because they do not recite an abstract idea, although it may be apparent that at some level they are based on or involve an abstract idea. Because these claims do not recite an abstract idea (or other judicial exception), they are eligible at Step 2A Prong One (Pathway B).” This portion of the MPEP also provides an example of a method of training a neural network for facial detection as an example of a claim which does not recite an abstract idea. 
The claim is similar to the neural network example provided in MPEP 2106.04(a)(1), and the claims 
Thus under the standards of the 2019 Patent Eligibility Update, Claims 1, 16, and 18 do not reasonably recite an abstract idea. As such, the claims are determined to recite eligible subject matter.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-10, 14, and 16-19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Li et al. (Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs). 

Regarding Claim 1 and 18: Li discloses a computer-implemented method of training a neural network to generate commands for controlling an agent to perform a task in an environment, the method comprising:
obtaining, for each of a plurality of performances of the task, a respective dataset characterizing the corresponding performance of the task (“The demonstrations are collected from human experts, by manually driving along the race track, and demonstrate typical behaviors like staying within lanes, avoiding collisions with other cars, and surpassing other cars” See at least Page 5. Also: “The expert demonstrations τE are a set of trajectories generated using policy πE, each of which consists of a sequence of state-action pairs” See at least Page 2). 
using the dataset, training a neural network to generate commands for controlling the agent based on image data encoding captured images of the environment and proprioceptive data comprising one or more variables describing configurations of the agent (“Algorithm 1 InfoGAIL”, “Input: Expert trajectories”, “Output: Learned policy πθ” See at least Page 5. Also: “The training process of GAIL can 
wherein the training the neural network comprises: using the neural network to generate a plurality of sets of one or more commands, for each set of commands generating at least one corresponding reward value indicative of how successfully the task is carried out upon implementation of the set of commands by the agent (“we have three networks to update in the InfoGAIL framework: the discriminator network Dω(s, a), the policy network πθ(a|s, c), and the posterior estimator network Qψ(c|s, a). We update Dω using RMSprop (as suggested in the original WGAN paper), and update Qψ and πθ using Adam and TRPO respectively” See at least Page 4. Also: “the final layer, which is just a one dimensional output indicates the expected accumulated future rewards” See at least Page 6). 
adjusting one or more parameters of the neural network based on the datasets, the sets of commands and the corresponding reward values (“Update ωi by ascending with gradients”, “Update ψi+1 by descending with gradients” See at least Page 5). 

Regarding Claim 2: Li discloses the above limitations. Additionally, Li discloses adjusting the one or more parameters of the neural network comprises adjusting the neural network based on a hybrid energy function, the hybrid energy function including both an imitation reward value derived using the datasets and the generated sets of commands, and task reward term calculated using the generated reward values (“Take a policy step from θi to θi+1, using the TRPO update rule with the following objective … (with reward augmentation): Eˆ χi [Dωi+1 (s, a)] − λ0η(πθi ) − λ1LI (πθi , Qψi+1 )” See at least Page 5. Also: “This motivates the introduction of reward augmentation, a general framework to incorporate prior knowledge in ∼πθ [r(s)] that reflects our biases over the desired agent’s behavior: min θ max ω V (θ, ω) − λ0η(πθ) where λ0 > 0 is a hyper-parameter. This approach can be seen as a hybrid between imitation and reinforcement learning, where part of the reinforcement signal for the policy optimization is coming from the surrogate reward and part from the discriminator, i.e., from mimicking the expert. The surrogate reward can also be thought of as side information provided to the generator. For example, in our autonomous driving experiment below we show that by providing the agent with a penalty if it collides with other cars, we are able to significantly reduce the collision rate of the policy” See at least Page 4). 

Regarding Claim 3: Li discloses the above limitations. Additionally, Li discloses using the datasets to generate a discriminator network (“To determine a reasonable distance metric, a discriminator is jointly trained to distinguish expert trajectories from ones produced by the policy” See at least Page 1. Also: “the neural network is a discriminator that tries to differentiate the two distributions. See at least Page 2), and deriving the imitation reward value using the discriminator network and the sets of one or more commands (“the neural network is a discriminator that tries to differentiate the two distributions. The formal GAIL objective is denoted as minθ maxω V (θ, ω), where V (θ, ω) is Eπθ [log Dω(s, a)] + EπE [log(1 − Dω(s, a))] − λH(πθ)” See at least Page 2. Also: “We achieve this by specifying a surrogate state-based reward η(πθ) = Es∼πθ [r(s)] that reflects our biases over the desired agent’s behavior: min θ max ω V (θ, ω) − λ0η(πθ) (4) where λ0 > 0 is a hyper-parameter. This approach can be seen as a hybrid between imitation and reinforcement learning, where part of the reinforcement signal for the policy optimization is coming from the surrogate reward and part from the discriminator, i.e., from mimicking the expert” See at least Page 4).

Regarding Claim 4: Li discloses the above limitations. Additionally, Li discloses where the discriminator network receives data characterizing the positions of objects in the environment (“the discriminator network Dω(s, a)” See at least Page 4. Also: “The discriminator Dω accepts three elements as input: a resized image with lower resolution, the auxiliary information, and the current action” See at least page 6). 

Regarding Claim 5: Li disclose the above limitations. Additionally, Li discloses where the reward value is generated by computationally simulating a process carried out by the agent in the environment based on the corresponding set of commands to generate a final state of the environment and calculating an initial reward value based at least on the final state of the environment (“This motivates the introduction of reward augmentation, a general framework to incorporate prior knowledge in imitation learning by providing additional incentives to the agent without interfering with the imitation learning process. We achieve this by specifying a surrogate state-based reward η(πθ) = Es∼πθ [r(s)] that reflects our biases over the desired agent’s behavior: min θ max ω V (θ, ω) − λ0η(πθ) where λ0 > 0 is a hyper-parameter. This approach can be seen as a hybrid between imitation and reinforcement learning, where part of the reinforcement signal for the policy optimization is coming from the surrogate reward and part from the discriminator, i.e., from mimicking the expert. The surrogate reward can also be thought of as side information provided to the generator. For example, in our autonomous driving experiment below we show that by providing the agent with a penalty if it collides with other cars, we are able to significantly reduce the collision rate of the policy” See at least Page 4. Also: “We conduct a series of ablation experiments to demonstrate that our proposed techniques are indeed crucial for learning an effective policy. The experiments consider a long-term setting: our policy drives a car on the race track along with other cars, whereas the human expert provides trajectories by trying to drive as fast as possible without collision. Reward augmentation is performed by adding a reward that encourages the car to drive faster to the imitation learning objective. The performance of the policy is determined by the average distance. Therefore, a longer average rollout distance indicates a better policy. See at least Page 7).

Regarding Claim 6: Li discloses the above limitations. Additionally, Li discloses where updates to the neural network are calculated using an activation function estimator obtained by calculating a value function with the initial reward value and the initial reward value is calculated according to a task reward function based on the final state of the environment (We achieve this by specifying a surrogate state-based reward η(πθ) = Es∼πθ [r(s)] that reflects our biases over the desired agent’s behavior: min θ max ω V (θ, ω) − λ0η(πθ) (4) where λ0 > 0 is a hyper-parameter. This approach can be seen as a hybrid between imitation and reinforcement learning, where part of the reinforcement signal for the policy optimization is coming from 

Regarding Claim 7: Li discloses the above limitations. Additionally, Li discloses where the value function is calculated using data characterizing the positions of objects in the environment (We achieve this by specifying a surrogate state-based reward η(πθ) = Es∼πθ [r(s)] that reflects our biases over the desired agent’s behavior: min θ max ω V (θ, ω) − λ0η(πθ) (4) where λ0 > 0 is a hyper-parameter. This approach can be seen as a hybrid between imitation and reinforcement learning, where part of the reinforcement signal for the policy optimization is coming from the surrogate reward and part from the discriminator, i.e., from mimicking the expert. The surrogate reward can also be thought of as side information provided to the generator. For example, in our autonomous driving experiment below we show that by providing the agent with a penalty if it collides with other cars, we are able to significantly reduce the collision rate of the policy.” See at least Page 4).

Regarding Claim 8: Li discloses the above limitations. Additionally, Li discloses where the value function is calculated by an adaptive model (We achieve this by specifying a surrogate state-based reward η(πθ) = Es∼πθ [r(s)] that reflects our biases over the desired agent’s behavior: min θ max ω V (θ, ω) − λ0η(πθ) (4) where λ0 > 0 is a hyper-parameter. This approach can be seen as a hybrid between imitation and reinforcement learning, where part of the reinforcement signal for the policy optimization is coming from the surrogate reward and part from the discriminator, i.e., from mimicking the expert. The surrogate reward can also be thought of as side information provided to the generator. For example, in our autonomous driving experiment below we show that by providing the agent with a penalty if it collides with other cars, we are able to significantly reduce the collision rate of the policy.” See at least Page 4).

Regarding Claim 9: Li discloses the above limitations. Additionally, Li discloses in which the neural network comprises a convolutional neural network which receives the image data and from it generates convolved data, the neural network further comprising at least one adaptive component which receives the output of the convolutional neural network and the proprioceptive data (“Network architecture for the policy/generator πθ. conv denotes a convolutional layer, and fc denotes a fully connected layer” See at least Page 5 and Fig. 1. Also: “input visual features are passed through two convolutional layers, and then combined with the auxiliary information vector” See at least Page 6). 

Regarding Claim 10: Li discloses the above limitations. Additionally, Li discloses in which the adaptive component is a perceptron (“Network architecture for the policy/generator πθ. conv denotes a convolutional layer, and fc denotes a fully connected layer” See at least Page 5 and Fig. 1). 

Regarding Claim 11: Li discloses the above limitations. Additionally, Li discloses in which the neural network further comprises a recursive neural network, which receives input data generated both from the image data and the proprioceptive data (“Network architecture for the policy/generator πθ. conv denotes a convolutional layer, and fc denotes a fully connected layer” See at least Page 5 and Fig. 1).

Regarding Claim 12: Li discloses the above limitations. Additionally, Li discloses defining at least one auxiliary task, and training the convolutional network as part of an adaptive system which is trained to perform the auxiliary task based on image data (“The surrogate reward can also be thought of as side information provided to the generator. For example, in our autonomous driving experiment below we show that by providing the agent with a penalty if it collides with other cars, we are able to significantly reduce the collision rate of the policy” See at least Page 4. Also: “Algorithm 1 InfoGAIL”, “Input: Expert trajectories”, “Output: Learned policy πθ” See at least Page 5. Also: “Take a policy step from θi to θi+1, using the TRPO update rule with the following objective … (with reward augmentation): Eˆ χi [Dωi+1 (s, a)] − λ0η(πθi ) − λ1LI (πθi , Qψi+1 )” See at least Page 5). 

Regarding Claim 14: Li discloses the above limitations. Additionally, Li discloses in which the step of using the neural network to generate a plurality of sets of commands is performed at least once by supplying to the neural network image data and proprioceptive data which characterizes a state associated with one of the performances of the task (“Network architecture for the policy/generator πθ. conv denotes a convolutional layer, and fc denotes a fully connected layer” See at least Page 5 and Fig. 1. Also: “The policy accepts raw visual inputs as the only external inputs for the state” See at least Page 5. Also: “our policy requires certain auxiliary information as internal input to serve as a short-term memory. These auxiliary information can be accessed along with the Inferring The Latent Structure of Human Decision-Making raw visual inputs. In our experiments, the auxiliary information for the policy at time t consists of the following: 1) velocity at time t, which is a three dimensional vector; 2) actions at time t−1 and t−2, which are both three dimensional vectors; 3) damage of the car, which is a real value” See at least Page 5 and 6). 

Regarding Claim 16: Li discloses a method of performing a task, the method comprising: training a neural network to generate commands for controlling an agent to perform the task in an environment, by a method according to any preceding claim (See rejection of claim 1 above); and a plurality of times performing the steps of: (i) capturing images of an environment and generating image data encoding the images; (ii) capturing proprioceptive data comprising one or more variables describing configurations of the agent;
(iii) transmitting the image data and the proprioceptive data to the neural network, the neural network generating at least one command based on the image data and the proprioceptive data; and (iv) transmitting the command to the agent, the agent being operative to perform the command within the environment; whereby the neural network successively generates a sequence of commands to control the agent to perform the task (“The Open Racing Car Simulator (TORCS, Wymann et al. (2000)) is a popular simulator environment for research in autonomous vehicles. We packaged it into a client-server framework with APIs similar to OpenAI Gym (Brockman et al., 2016). Our framework produces a realistic dashboard view and driving related information, and communicates with the policy (client) through TCP packets, so that the policy can be written in languages other than C++. In particular, we implemented our policy using the TensorFlow Python API (Abadi et al., 2016). This framework and the code for reproducing the experiments are available at https://github.com/YunzhuLi/InfoGAIL. All of our experiments are conducted in the TORCS environment. The demonstrations are collected from human experts, by manually driving along the race track, and demonstrate typical behaviors like staying within lanes, avoiding collisions with other cars, and surpassing other cars. The policy accepts raw visual inputs as the only external inputs for the state, and 

Regarding Claim 17: Li discloses the above limitations. Additionally, Li discloses in which the step of obtaining, for each of a plurality of performances of the task, a respective dataset characterizing the corresponding performance of the task, is performed by controlling the agent to perform the task a plurality of times, and for each performance generating a respective dataset characterizing the performance (“The demonstrations are collected from human experts, by manually driving along the race track, and demonstrate typical behaviors like staying within lanes, avoiding collisions with other cars, and surpassing other cars” See at least Page 5). 

Regarding Claim 19: Li discloses the above limitations. Additionally, Li discloses an agent operative to perform commands generated by the neural network; at least one image capture device operative to capture images of an environment and generate image data encoding the images; and at least one device operative to capture proprioceptive data comprising the one or more variables describing configurations of the agent  (“The Open Racing Car Simulator (TORCS, Wymann et al. (2000)) is a popular simulator environment for research in autonomous vehicles. We packaged it into a client-server framework with APIs similar to OpenAI Gym (Brockman et al., 2016). Our framework produces a realistic dashboard view and driving related information, and communicates with the policy (client) through TCP packets, so that the policy can be written in languages other than C++. In particular, we implemented our policy using the .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.

4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs) in view of Nair et al. (Massively Parallel Methods of Deep Reinforcement Learning).

Regarding Claim 13: Li discloses the above limitations. Additionally, Li discloses the adjustment of the parameters of the neural network being additional based on reward values indicative of how successfully the task is carried out by simulated agents based on sets of commands generated by the neural network (“This motivates the introduction of reward augmentation, a general framework to incorporate prior knowledge in imitation learning by providing additional incentives to the agent without interfering with the imitation learning process. We achieve this by specifying a surrogate state-based reward η(πθ) = Es∼πθ [r(s)] that reflects our biases over the desired agent’s behavior: min θ max ω V (θ, ω) − λ0η(πθ) where λ0 > 0 is a hyper-parameter. This approach can be seen as a hybrid between imitation and reinforcement learning, where part of the reinforcement signal for the policy optimization is coming from the surrogate reward and part from the discriminator, i.e., from mimicking the expert. The surrogate reward can also be thought of as side information provided to the generator. For example, in our autonomous driving experiment below we show that by providing the agent with a penalty if it collides with other cars, we are able to significantly reduce the collision rate of the policy” See at least Page 4. Also: “We conduct a series the training of the neural network is performed in parallel with the training of a plurality of additional instances of the neural network by respective workers. 
	However, Nair teaches the training of the neural network is performed in parallel with the training of a plurality of additional instances of the neural network by respective workers and the adjustment of the parameters of the neural network being additionally based on simulated agents based on sets of commands generated by the additional neural networks (“This architecture consists of four main components: parallel actors that generate new behaviour; parallel learners that are trained from stored experience .” See at least Page 1. Also: “In order to generate more data, we deploy multiple agents running in parallel that interact with multiple instances of the same environment. Each such actor can store its own record of past experience, effectively providing a distributed experience replay memory with vastly increased capacity compared to a single machine implementation. Alternatively this experience can be explicitly aggregated into a distributed database. In addition to generating more data, distributed actors can explore the state space more effectively, as each actor behaves according to a slightly different policy.” See at Least Pages 1 and 2). 
	Li provides a system for training an agent, upon which the claimed invention’s multi-agent learning can be seen as an improvement. However, Nair demonstrates that the prior art already knew of using multiple agents to generate experience data for training the system. One of ordinary skill in the art could have easily applied the techniques of Nair to the system of Li. Further, one of ordinary skill in the art would have recognized that such an application of Nair would have predictably resulted in an improved system which would be able to train on a greater amount of experiential data and thus produce a superior agent. As such, the claimed invention would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention in view of the disclosures of Li and the teachings of Nair. 

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs) in view of Singh (Transfer of Learning by Composing Solutions of Elemental Sequential Tasks)

Regarding Claim 15: Li discloses the above limitations. Additionally, Li discloses the step of using the neural network to generate a plurality of sets of commands being performed at least once by supplying to the neural network image data and proprioceptive data which characterizes one of the corresponding plurality of initial states (“Take a policy step from θi to θi+1, using the TRPO update rule with the following objective … (with reward augmentation): Eˆ χi [Dωi+1 (s, a)] − λ0η(πθi ) − λ1LI (πθi , Qψi+1 )” See at least Page 5. Also: “The policy accepts raw visual inputs as the only external inputs for the state, and produces a three-dimensional action that consists of steering, acceleration, and braking” See at least Page 5. Also: “our policy requires certain auxiliary information as internal input to serve as a short-term memory. These auxiliary information can be accessed along with the raw visual inputs. In our experiments, the auxiliary information for the policy at time t consists of the following: 1) velocity at time t, which is a three dimensional vector; 2) actions at time t−1 and t−2, which are both three dimensional vectors; 3) damage of the car, which is a real value” See at least Pages 5 and 6). However, Li does not appear to disclose, prior to training, defining a plurality of stages of the task, and for each stage of the task defining a respective plurality of initial states,
Singh discloses prior to training, defining a plurality of stages of the task, and for each stage of the task defining a respective plurality of initial states (“In this paper I consider a learning agent that interacts with a dynamic external environment and faces multiple sequential decision tasks. Each task requires the agent to execute a sequence of actions to control the environment, ~ either to bring it to a desired state or to traverse a desired state trajectory over time. In addition to the environment dynamics that define state transitions, such tasks are defined by a payoff function that specifies the immediate evaluation of each action taken by the agent.” See at least Page 1 and 2. Also: “Several different elemental tasks can be defined in the same environment. … All elemental tasks are MDTs that share the same state set S, action set A, and the same environment dynamics. The payoff function, however, can be different across the 
	Li provides a system for training an agent, upon which the claimed invention’s composite-step training techniques can be seen as an improvement. However, Singh demonstrates that the prior art already knew of defining subtasks for a learning agent to train the agent to complete composite tasks. One of ordinary skill in the art could have easily applied the techniques of Singh to the system of Li to train Li’s agent to complete composite tasks. One of ordinary skill in the art would have recognized that such an application of Singh would have predictably resulted in an improved system which could more efficiently learn complex tasks. As such, the claimed invention would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention in view of the disclosures of Li and the teachings of Singh. 

Additional Considerations
The prior art made of record and not relied upon that is considered pertinent to applicant’s disclosure can be found in the PTO-892 Notice of References Cited. 
Van Seijen et al. (US 2018/0165603 A1) discusses a hybrid reward system for reinforcement learning that relates to the claimed invention’s hybrid rewards and multi-agent techniques. 
James and Johns (3D Simulation for Robot Arm Control with Deep Q-Learning) demonstrates that applicability of deep neural networks to controlling a robot arm as discussed in narrower embodiments of the disclosure. 
	
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Bion A Shelden whose telephone number is (571)270-0515. The examiner can normally be reached M-F, 12pm-10pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hajime S Rojas can be reached on (571)270-5491. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Bion A Shelden/Examiner, Art Unit 3681                                                                                                                                                                                                        2021-11-05