DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response to amendments and remarks filed on 11/16/2021. In the current amendments, claims 1, 11, and 20 are amended. Claims 1-20 are pending and have been examined.
In response to amendments and remarks filed on 11/16/2021, the 35 U.S.C. 101 rejection to claims 1-20 put forth in the previous Office Action has been withdrawn.

Information Disclosure Statement
The information disclosure statement (IDS) was submitted on 11/03/2021.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Objections
Claim 20 is objected to because of the following informalities:  claim 20 line 9 recites “protagonist environment environment”, it should be “protagonist environment .  Appropriate correction is required.

Applicant is reminded of 37 CFR 1.121(c)(2), which provides the following, “All claims being currently amended in an amendment paper shall be presented in the claim listing, indicate a status of "currently amended," and be submitted with markings to indicate the changes that have been made relative to the immediate prior version of the claims. The text of any added subject matter must be shown by underlining the added text” (emphasis added). The amendments submitted on 11/16/2021 did not use 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 7-9, 11, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Hwang et al. (“Inverse Reinforcement Learning based on Critical State”) in view of Ma et al. (“Improved Robustness and Safety for Autonomous Vehicle Control with Adversarial Reinforcement Learning”) and further in view of Arel et al. (US 2017/0213150 A1).
Regarding Claim 1,
Hwang et al. teaches A computer-implemented method for obtaining a plurality of bad demonstrations, comprising (pg. 771 Section 1: “most IRL problems need to have many demonstrations that are demonstrated by experts. These demonstrations are viewed as correct behaviors. But, incorrect demonstrations should be viewed important information equally. The agent can learn good behaviors by incorrect demonstrations. In this paper, IRLCS algorithm is proposed to search appropriate reward indexes in whole state space by two sets of demonstrations, good trajectories and bad trajectories” teaches obtaining demonstrations of bad trajectories (corresponds to bad demonstrations) from experts; Fig. 4 teaches that the simulation of demonstrations is performed on a computer):
reading...a protagonist environment (pg. 774 Section 4.1: “Fig. 4 is the simulation scenario that we test the IRLCS algorithm. The task is to navigate a car on a three-lane highway. All vehicles, except for the agent’s marked in blue, are moving at a speed, said level 1, and appear from the top of the screen randomly. The agent can drive at speeds 1 to 4, and can move one lane left or right. There are five actions, shift agent right, speed up, shift agent left, speed down, do nothing. The objectives of this task are that driving as fast as possible, but sincerely taking account of collision-avoidance and speeding- avoidance. There are eight dimensions in the state space of this task” teaches establishing (reading) the agent (protagonist)’s environment);
training...a plurality of antagonist agents to fail a task by reinforcement learning using the protagonist environment; collecting...the plurality of bad demonstrations by playing the trained antagonist agents on the protagonist environment (pg. 771 Section 1: “most IRL problems need to have many demonstrations that are demonstrated by experts. These demonstrations are viewed as correct behaviors. But, incorrect demonstrations should be viewed important information equally. The agent can learn good behaviors by incorrect demonstrations. In this paper, IRLCS algorithm is proposed to search appropriate reward indexes in whole state space by two sets of demonstrations, good trajectories and bad trajectories” and pg. 773 Section 3.4: “we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally” teach the experts (antagonist agents) provide demonstrations of bad trajectories (correspond to bad demonstrations) using the Inverse Reinforcement Learning based on Critical State (IRLCS) algorithm (corresponds to reinforcement learning) wherein the experts are trained to do a bad operation to cause failing of a mission (task) intentionally in the agent (protagonist)’s environment (for example, a car driving environment with the goal of avoiding collision) and the demonstrations of bad trajectories are obtained (collected));
and controlling a vehicle system to control a vehicle using...one or more of the plurality of bad demonstrations to avoid a collision (pg. 773 Section 3.4: “we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally” teach controlling a simulated vehicle system (see Fig. 4) to control a vehicle using demonstrations of bad trajectories to avoid a collision; also see pg. 771 Section 1).
Hwang et al. does not appear to explicitly teach controlling a vehicle system to control a vehicle using a neural network trained on one or more of the plurality of bad demonstrations to avoid a collision.
However, Ma et al. teaches controlling a vehicle system to control a vehicle using a neural network trained on one or more of the plurality of bad demonstrations to avoid a collision (Figure 2 and pg. 1669 first full paragraph: “For vehicular control, there are many vehicle specific model parameters that significantly impact the dynamics, such as the axle distance” and pg. 1668 Section C: “We hypothesize that using the proposed robust methods will improve performance when the policy is executed on a real vehicle, despite modeling errors and noise. This hypothesis is validated using a dynamical model closer to the test vehicle provided by the SAIC Innovation Center. Figure 2 shows the test vehicle” teach controlling a vehicle system to control a vehicle using the proposed Adversarial Reinforcement Learning model; pg. 1667 second full paragraph: “1) RARL:...this approach uses neural network policies to represent each of the players and iteratively trains the protagonist and the adversary against each other, alternating at each training epoch [9]. The adversarial network learns to perturb the vehicle trajectories to maximize its reward. Then, by optimizing the protagonist against this evolving adversary, hypothetically, the protagonist learns to perform robustly over the entire disturbance space” and pg. 1666 ninth full paragraph: “Thus, the adversary would need to trade-off between the competitive and cooperative rewards. The desired disturbance distribution would aim to cause a failure while using the minimal disturbance magnitude. This disturbance is also desired for the protagonist since it is easier for the protagonist to recover. By adding a cooperative reward, we make the adversary adaptively adjust its disturbance magnitude such that it allows the protagonist to effectively learn, while being adversarial” teaches the proposed Adversarial Reinforcement Learning model trains a neural network policy based on demonstrations with failures (correspond to bad demonstrations) and that the goal is for the adversarial network to learn to perturb the vehicle trajectories to maximize its reward wherein maximizing award means to avoid collisions, see pg. 1668 fourth full paragraph: “If the ego vehicle goes off the road, meaning that a failure and collision has occurred, r1;t = -5.0 and the trajectory ends”, which teaches that collisions are given negative rewards).
Hwang et al. and Ma et al. are analogous art to the claimed invention because they are directed to reinforcement learning for vehicle control.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Ma et al. to the disclosed invention of Hwang et al.

Hwang et al. in view of Ma et al. does not appear to explicitly teach reading, by a processor device...training, by the processor device...collecting, by the processor device.
However, Arel et al. teaches reading, by a processor device (pg. 6 [0074] “The system obtains mentor interaction data (step 402). The mentor interaction data represents interactions by another entity, which may be referred to as the "mentor," with the environment and returns resulting from those interactions. In particular, the mentor interaction data includes, for each action performed by the mentor, a state representation for the state of the environment when the action was performed and the return resulting from the action being performed” teaches obtaining (reading) interaction data of a mentor (corresponds to protagonist) wherein the data is about the mentor’s environment (protagonist environment); pg. 7 [0088]-[0089] teach processor)...
training, by the processor device (pg. 6 [0075]: “The system trains a supervised learning model on the mentor interaction data (step 404) to adjust the values of the parameters of the supervised learning model” teaches training; pg. 7 [0088]-[0089] teach processor)...
collecting, by the processor device (pg. 7 [0088]-[0089] teach processor; as discussed above, pg. 6 [0076] teaches collecting bad demonstrations).
Hwang et al., Ma et al., and Arel et al. are analogous art to the claimed invention because they are directed to reinforcement learning for vehicle control.

One of ordinary skill in the arts would have been motivated to make this modification “to realize one or more of the following advantages. Undesirable effects associated with catastrophic forgetting during training of supervised learning models used to select actions to be performed by an agent interacting with an environment can be mitigated in a scalable manner. The supervised learning models can be trained in a scalable manner to effectively select actions in response to new state representations without adversely affecting their performance when the environment is in other states” (Arel et al. pg. 1 [0007]).
Regarding Claim 7,
Hwang et al. in view of Ma et al. in view of Arel et al. teaches the computer-implemented method of claim 1.
Hwang et al. further teaches further comprising learning state-dependent action constraints using the plurality of bad demonstrations and a plurality of good demonstrations (pg. 771 Section 1: “most IRL problems need to have many demonstrations that are demonstrated by experts. These demonstrations are viewed as correct behaviors. But, incorrect demonstrations should be viewed important information equally. The agent can learn good behaviors by incorrect demonstrations. In this paper, IRLCS algorithm is proposed to search appropriate reward indexes in whole state space by two sets of demonstrations, good trajectories and bad trajectories” and pg. 773 Section 3.4: “we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally” teach using demonstrations of good and bad trajectories (good and bad demonstrations) to learn state-dependent action constraints such as avoiding collision or driving fast, see pg. 774 left column, which provides the Inverse Reinforcement Learning based on Critical State algorithm, wherein line 2.6 teaches action that can be selected is based on the state).
Regarding Claim 8,
Hwang et al. in view of Ma et al. in view of Arel et al. teaches the computer-implemented method of claim 7.
Hwang et al. further teaches further comprising training a protagonist policy by reinforcement learning using the state-dependent action constraints for exploration guidance (pg. 771 Section 1: “most IRL problems need to have many demonstrations that are demonstrated by experts. These demonstrations are viewed as correct behaviors. But, incorrect demonstrations should be viewed important information equally. The agent can learn good behaviors by incorrect demonstrations. In this paper, IRLCS algorithm is proposed to search appropriate reward indexes in whole state space by two sets of demonstrations, good trajectories and bad trajectories” and pg. 773 Section 3.4: “we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally” teach using demonstrations of good and bad trajectories (good and bad demonstrations) to learn state-dependent action constraints such as avoiding collision or driving fast; pg. 774 left column provides the Inverse Reinforcement Learning based on Critical State algorithm, wherein line 2.6 teaches action that can be selected is based on the state, which is used to for exploration guidance to determine a reward pg. 773 Section 3.4: “Based on the above concept, we propose an algorithm, Inverse Reinforcement Learning based on Critical State (IRLCS), which is able to do self-organization and search an appropriate reward function through the good and bad demonstrations...Q-Learning algorithm is applied to search a policy π(i). The policy used to select an action in the action space is epsilon greedy. There are a parameter, ε, means exploration probability. In exploration, we randomly choose an action from action space. Otherwise, we take a greedy action that causes a maximum state-action value in current state”).
Regarding Claim 9,
Hwang et al. in view of Ma et al. in view of Arel et al. teaches the computer-implemented method of claim 1.
Ma et al. further teaches wherein each of the trained antagonist agents is a respective stochastic neural network policy (pg. 1667 second full paragraph: “1) RARL: As originally proposed by Pinto et al., this approach uses neural network policies to represent each of the players and iteratively trains the protagonist and the adversary against each other, alternating at each training epoch [9]. The adversarial network learns to perturb the vehicle trajectories to maximize its reward. Then, by optimizing the protagonist against this evolving adversary, hypothetically, the protagonist learns to perform robustly over the entire disturbance space. The training algorithm is shown in algorithm 1 assuming the protagonist and adversary follow stochastic policies πP and πA” teaches the trained agents are stochastic neural network policies wherein each agent can be considered “antagonistic” to the other because each is trained against another).
Hwang et al. and Ma et al. are analogous art to the claimed invention because they are directed to reinforcement learning for vehicle control.

One of ordinary skill in the arts would have been motivated to make this modification in order to leverage “adversarial learning methods...where the adversary...is incentivized to adjust the disturbance magnitude according to the current capability of the control policy. We show that this not only improves safety, but also improves robustness under different environment models” (Ma et al. pg. 1665 eighth full paragraph).
Regarding Claim 11,
Hwang et al. teaches A computer program product for obtaining a plurality of bad demonstrations (pg. 771 Section 1: “most IRL problems need to have many demonstrations that are demonstrated by experts. These demonstrations are viewed as correct behaviors. But, incorrect demonstrations should be viewed important information equally. The agent can learn good behaviors by incorrect demonstrations. In this paper, IRLCS algorithm is proposed to search appropriate reward indexes in whole state space by two sets of demonstrations, good trajectories and bad trajectories” teaches obtaining demonstrations of bad trajectories (corresponds to bad demonstrations) from experts; Fig. 4 teaches that the simulation of demonstrations is performed through a computer program product)...
reading...a protagonist environment (pg. 774 Section 4.1: “Fig. 4 is the simulation scenario that we test the IRLCS algorithm. The task is to navigate a car on a three-lane highway. All vehicles, except for the agent’s marked in blue, are moving at a speed, said level 1, and appear from the top of the screen randomly. The agent can drive at speeds 1 to 4, and can move one lane left or right. There are five actions, shift agent right, speed up, shift agent left, speed down, do nothing. The objectives of this task are that driving as fast as possible, but sincerely taking account of collision-avoidance and speeding- avoidance. There are eight dimensions in the state space of this task” teaches establishing (reading) the agent (protagonist)’s environment);
training...a plurality of antagonist agents to fail a task by reinforcement learning using the protagonist environment; collecting...the plurality of bad demonstrations by playing the trained antagonist agents on the protagonist environment (pg. 771 Section 1: “most IRL problems need to have many demonstrations that are demonstrated by experts. These demonstrations are viewed as correct behaviors. But, incorrect demonstrations should be viewed important information equally. The agent can learn good behaviors by incorrect demonstrations. In this paper, IRLCS algorithm is proposed to search appropriate reward indexes in whole state space by two sets of demonstrations, good trajectories and bad trajectories” and pg. 773 Section 3.4: “we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally” teach the experts (antagonist agents) provide demonstrations of bad trajectories (correspond to bad demonstrations) using the Inverse Reinforcement Learning based on Critical State (IRLCS) algorithm (corresponds to reinforcement learning) wherein the experts are trained to do a bad operation to cause failing of a mission (task) intentionally in the agent (protagonist)’s environment (for example, a car driving environment with the goal of avoiding collision) and the demonstrations of bad trajectories are obtained (collected));
and controlling a vehicle system to control a vehicle using...one or more of the plurality of bad demonstrations to avoid a collision (pg. 773 Section 3.4: “we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally” teach controlling a simulated vehicle system (see Fig. 4) to control a vehicle using demonstrations of bad trajectories to avoid a collision; also see pg. 771 Section 1).
Hwang et al. does not appear to explicitly teach controlling a vehicle system to control a vehicle using a neural network trained on one or more of the plurality of bad demonstrations to avoid a collision.
However, Ma et al. teaches controlling a vehicle system to control a vehicle using a neural network trained on one or more of the plurality of bad demonstrations to avoid a collision (Figure 2 and pg. 1669 first full paragraph: “For vehicular control, there are many vehicle specific model parameters that significantly impact the dynamics, such as the axle distance” and pg. 1668 Section C: “We hypothesize that using the proposed robust methods will improve performance when the policy is executed on a real vehicle, despite modeling errors and noise. This hypothesis is validated using a dynamical model closer to the test vehicle provided by the SAIC Innovation Center. Figure 2 shows the test vehicle” teach controlling a vehicle system to control a vehicle using the proposed Adversarial Reinforcement Learning model; pg. 1667 second full paragraph: “1) RARL:...this approach uses neural network policies to represent each of the players and iteratively trains the protagonist and the adversary against each other, alternating at each training epoch [9]. The adversarial network learns to perturb the vehicle trajectories to maximize its reward. Then, by optimizing the protagonist against this evolving adversary, hypothetically, the protagonist learns to perform robustly over the entire disturbance space” and pg. 1666 ninth full paragraph: “Thus, the adversary would need to trade-off between the competitive and cooperative rewards. The desired disturbance distribution would aim to cause a failure while using the minimal disturbance magnitude. This disturbance is also desired for the protagonist since it is easier for the protagonist to recover. By adding a cooperative reward, we make the adversary adaptively adjust its disturbance magnitude such that it allows the protagonist to effectively learn, while being adversarial” teaches the proposed Adversarial Reinforcement Learning model trains a neural network policy based on demonstrations with failures (correspond to bad demonstrations) and that the goal is for the adversarial network to learn to perturb the vehicle trajectories to maximize its reward wherein maximizing award means to avoid collisions, see pg. 1668 fourth full paragraph: “If the ego vehicle goes off the road, meaning that a failure and collision has occurred, r1;t = -5.0 and the trajectory ends”, which teaches that collisions are given negative rewards).
Hwang et al. and Ma et al. are analogous art to the claimed invention because they are directed to reinforcement learning for vehicle control.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Ma et al. to the disclosed invention of Hwang et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to leverage “adversarial learning methods...where the adversary...is incentivized to adjust the disturbance magnitude according to the current capability of the control policy. We show that this not only improves safety, but also improves robustness under different environment models” (Ma et al. pg. 1665 eighth full paragraph).
Hwang et al. in view of Ma et al. does not appear to explicitly teach the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising: reading, by a processor device...training, by the processor device...collecting, by the processor device.
Arel et al. teaches the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising (pg. 6-7 [0084] teaches computer storage medium; pg. 7 [0088]-[0089] teach computer, processor, memory, and instructions; pg. 7 [0086] teaches code; pg. 6 [0076]: “Once the supervised learning model has been trained, the system determines estimation errors for a set of combinations of action and state representations from the mentor action data using the trained supervised learning model (step 406). In particular, the system processes each combination of action and state representation in the set using the trained supervised learning model to determine a respective value function estimate for each combination. The system then determines an estimation error for the combination from the value function estimate for the combination and the actual return identified for the combination in the mentor interaction data.” teaches determining estimation errors in the combination of action and state representation in the set (correspond to a plurality of demonstrations), thus rendering the demonstrations can be “erroneous”, or “bad”):
reading, by a processor device (pg. 6 [0074] “The system obtains mentor interaction data (step 402). The mentor interaction data represents interactions by another entity, which may be referred to as the "mentor," with the environment and returns resulting from those interactions. In particular, the mentor interaction data includes, for each action performed by the mentor, a state representation for the state of the environment when the action was performed and the return resulting from the action being performed” teaches obtaining (reading) interaction data of a mentor (corresponds to protagonist) wherein the data is about the mentor’s environment (protagonist environment); pg. 7 [0088]-[0089] teach processor)...
pg. 6 [0075]: “The system trains a supervised learning model on the mentor interaction data (step 404) to adjust the values of the parameters of the supervised learning model” teaches training; pg. 7 [0088]-[0089] teach processor)...
collecting, by the processor device (pg. 7 [0088]-[0089] teach processor; as discussed above, pg. 6 [0076] teaches collecting bad demonstrations).
Hwang et al., Ma et al., and Arel et al. are analogous art to the claimed invention because they are directed to reinforcement learning for vehicle control.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Arel et al. to the disclosed invention of Hwang et al. in view of Ma et al.
One of ordinary skill in the arts would have been motivated to make this modification “to realize one or more of the following advantages. Undesirable effects associated with catastrophic forgetting during training of supervised learning models used to select actions to be performed by an agent interacting with an environment can be mitigated in a scalable manner. The supervised learning models can be trained in a scalable manner to effectively select actions in response to new state representations without adversely affecting their performance when the environment is in other states” (Arel et al. pg. 1 [0007]).
Regarding Claim 17,
Hwang et al. in view of Ma et al. in view of Arel et al. teaches the computer program product of claim 11.
Hwang et al. further teaches wherein the method further comprises learning state-dependent action constraints using the plurality of bad demonstrations and a plurality of good demonstrations (pg. 771 Section 1: “most IRL problems need to have many demonstrations that are demonstrated by experts. These demonstrations are viewed as correct behaviors. But, incorrect demonstrations should be viewed important information equally. The agent can learn good behaviors by incorrect demonstrations. In this paper, IRLCS algorithm is proposed to search appropriate reward indexes in whole state space by two sets of demonstrations, good trajectories and bad trajectories” and pg. 773 Section 3.4: “we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally” teach using demonstrations of good and bad trajectories (good and bad demonstrations) to learn state-dependent action constraints such as avoiding collision or driving fast, see pg. 774 left column, which provides the Inverse Reinforcement Learning based on Critical State algorithm, wherein line 2.6 teaches action that can be selected is based on the state).
Regarding Claim 18,
Hwang et al. in view of Ma et al. in view of Arel et al. teaches the computer program product of claim 17.
Hwang et al. further teaches wherein the method further comprises training a protagonist policy by reinforcement learning using the state-dependent action constraints for exploration guidance (pg. 771 Section 1: “most IRL problems need to have many demonstrations that are demonstrated by experts. These demonstrations are viewed as correct behaviors. But, incorrect demonstrations should be viewed important information equally. The agent can learn good behaviors by incorrect demonstrations. In this paper, IRLCS algorithm is proposed to search appropriate reward indexes in whole state space by two sets of demonstrations, good trajectories and bad trajectories” and pg. 773 Section 3.4: “we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally” teach using demonstrations of good and bad trajectories (good and bad demonstrations) to learn state-dependent action constraints such as avoiding collision or driving fast; pg. 774 left column provides the Inverse Reinforcement Learning based on Critical State algorithm, wherein line 2.6 teaches action that can be selected is based on the state, which is used to for exploration guidance to determine a reward function used to train a protagonist policy; also see pg. 773 Section 3.4: “Based on the above concept, we propose an algorithm, Inverse Reinforcement Learning based on Critical State (IRLCS), which is able to do self-organization and search an appropriate reward function through the good and bad demonstrations...Q-Learning algorithm is applied to search a policy π(i). The policy used to select an action in the action space is epsilon greedy. There are a parameter, ε, means exploration probability. In exploration, we randomly choose an action from action space. Otherwise, we take a greedy action that causes a maximum state-action value in current state”).
Regarding Claim 19,
Hwang et al. in view of Ma et al. in view of Arel et al. teaches the computer-implemented method of claim 1.
Ma et al. further teaches wherein each of the trained antagonist agents is a respective stochastic neural network policy (pg. 1667 second full paragraph: “1) RARL: As originally proposed by Pinto et al., this approach uses neural network policies to represent each of the players and iteratively trains the protagonist and the adversary against each other, alternating at each training epoch [9]. The adversarial network learns to perturb the vehicle trajectories to maximize its reward. Then, by optimizing the protagonist against this evolving adversary, hypothetically, the protagonist learns to perform robustly over the entire disturbance space. The training algorithm is shown in algorithm 1 assuming the protagonist and adversary follow stochastic policies πP and πA” teaches the trained agents are stochastic neural network policies wherein each agent can be considered “antagonistic” to the other because each is trained against another).
Hwang et al. and Ma et al. are analogous art to the claimed invention because they are directed to reinforcement learning for vehicle control.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Ma et al. to the disclosed invention of Hwang et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to leverage “adversarial learning methods...where the adversary...is incentivized to adjust the disturbance magnitude according to the current capability of the control policy. We show that this not only improves safety, but also improves robustness under different environment models” (Ma et al. pg. 1665 eighth full paragraph).
Regarding Claim 20,
Hwang et al. teaches A computer processing system for obtaining a plurality of bad demonstrations, comprising (pg. 771 Section 1: “most IRL problems need to have many demonstrations that are demonstrated by experts. These demonstrations are viewed as correct behaviors. But, incorrect demonstrations should be viewed important information equally. The agent can learn good behaviors by incorrect demonstrations. In this paper, IRLCS algorithm is proposed to search appropriate reward indexes in whole state space by two sets of demonstrations, good trajectories and bad trajectories” teaches obtaining demonstrations of bad trajectories (corresponds to bad demonstrations) from experts; Fig. 4 teaches that the simulation of demonstrations is performed on a computer):
pg. 774 Section 4.1: “Fig. 4 is the simulation scenario that we test the IRLCS algorithm. The task is to navigate a car on a three-lane highway. All vehicles, except for the agent’s marked in blue, are moving at a speed, said level 1, and appear from the top of the screen randomly. The agent can drive at speeds 1 to 4, and can move one lane left or right. There are five actions, shift agent right, speed up, shift agent left, speed down, do nothing. The objectives of this task are that driving as fast as possible, but sincerely taking account of collision-avoidance and speeding- avoidance. There are eight dimensions in the state space of this task” teaches establishing (reading) the agent (protagonist)’s environment);
train a plurality of antagonist agents to fail a task by reinforcement learning using the protagonist environment; collect the plurality of bad demonstrations by playing the trained antagonist agents on the protagonist environment environment (pg. 771 Section 1: “most IRL problems need to have many demonstrations that are demonstrated by experts. These demonstrations are viewed as correct behaviors. But, incorrect demonstrations should be viewed important information equally. The agent can learn good behaviors by incorrect demonstrations. In this paper, IRLCS algorithm is proposed to search appropriate reward indexes in whole state space by two sets of demonstrations, good trajectories and bad trajectories” and pg. 773 Section 3.4: “we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally” teach the experts (antagonist agents) provide demonstrations of bad trajectories (correspond to bad demonstrations) using the Inverse Reinforcement Learning based on Critical State (IRLCS) algorithm (corresponds to reinforcement learning) wherein the experts are trained to do a bad operation to cause failing of a mission (task) 
and control a vehicle system to control a vehicle...trained on one or more of the plurality of bad demonstrations to avoid a collision (pg. 773 Section 3.4: “we have to provide two example traces, the good demonstration DG and the bad demonstration DB. In the same environment, it may not only one goal. For example, in a car driving problem, sometimes we want to avoid collision, and sometimes want to drive as fast as possible. Therefore, the good demonstration is relevant to the objective of task. On the contrary, the bad demonstration we can do a random operation. Moreover, we may do a bad operation to cause the mission to fail intentionally” teach controlling a simulated vehicle system (see Fig. 4) to control a vehicle using demonstrations of bad trajectories to avoid a collision; also see pg. 771 Section 1).
Hwang et al. does not appear to explicitly teach control a vehicle system to control a vehicle using a neural network trained on one or more of the plurality of bad demonstrations to avoid a collision.
However, Ma et al. teaches control a vehicle system to control a vehicle using a neural network trained on one or more of the plurality of bad demonstrations to avoid a collision (Figure 2 and pg. 1669 first full paragraph: “For vehicular control, there are many vehicle specific model parameters that significantly impact the dynamics, such as the axle distance” and pg. 1668 Section C: “We hypothesize that using the proposed robust methods will improve performance when the policy is executed on a real vehicle, despite modeling errors and noise. This hypothesis is validated using a dynamical model closer to the test vehicle provided by the SAIC Innovation Center. Figure 2 shows the test vehicle” teach controlling a vehicle system to control a vehicle using the proposed Adversarial Reinforcement Learning model; pg. 1667 second full paragraph: “1) RARL:...this approach uses neural network policies to represent each of the players and iteratively trains the protagonist and the adversary against each other, alternating at each training epoch [9]. The adversarial network learns to perturb the vehicle trajectories to maximize its reward. Then, by optimizing the protagonist against this evolving adversary, hypothetically, the protagonist learns to perform robustly over the entire disturbance space” and pg. 1666 ninth full paragraph: “Thus, the adversary would need to trade-off between the competitive and cooperative rewards. The desired disturbance distribution would aim to cause a failure while using the minimal disturbance magnitude. This disturbance is also desired for the protagonist since it is easier for the protagonist to recover. By adding a cooperative reward, we make the adversary adaptively adjust its disturbance magnitude such that it allows the protagonist to effectively learn, while being adversarial” teaches the proposed Adversarial Reinforcement Learning model trains a neural network policy based on demonstrations with failures (correspond to bad demonstrations) and that the goal is for the adversarial network to learn to perturb the vehicle trajectories to maximize its reward wherein maximizing award means to avoid collisions, see pg. 1668 fourth full paragraph: “If the ego vehicle goes off the road, meaning that a failure and collision has occurred, r1;t = -5.0 and the trajectory ends”, which teaches that collisions are given negative rewards).
Hwang et al. and Ma et al. are analogous art to the claimed invention because they are directed to reinforcement learning for vehicle control.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Ma et al. to the disclosed invention of Hwang et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to leverage “adversarial learning methods...where the adversary...is incentivized to adjust the disturbance magnitude according to the current capability of the control policy. We show that this not only improves safety, but also improves robustness under different environment models” (Ma et al. pg. 1665 eighth full paragraph).
Hwang et al. in view of Ma et al. does not appear to explicitly teach a memory for storing program code; and a processor device operatively coupled to the memory for running the program code to.
However, Arel et al. teaches a memory for storing program code; and a processor device operatively coupled to the memory for running the program code to (pg. 7 [0088]-[0089] teach processor, memory, and instructions and pg. 7 [0086] teaches code for implementing operations of a reinforcement learning system).
Hwang et al., Ma et al., and Arel et al. are analogous art to the claimed invention because they are directed to reinforcement learning for vehicle control.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Arel et al. to the disclosed invention of Hwang et al. in view of Ma et al.
One of ordinary skill in the arts would have been motivated to make this modification “to realize one or more of the following advantages. Undesirable effects associated with catastrophic forgetting during training of supervised learning models used to select actions to be performed by an agent interacting with an environment can be mitigated in a scalable manner. The supervised learning models can be trained in a scalable manner to effectively select actions in response to new state representations without adversely affecting their performance when the environment is in other states” (Arel et al. pg. 1 [0007]).

Claims 6, 10, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Hwang et al. (“Inverse Reinforcement Learning based on Critical State”) in view of Ma et al. (“Improved Robustness and Safety for Autonomous Vehicle Control with Adversarial Reinforcement Learning”) in view of Arel et al. (US 2017/0213150 A1) and further in view of Price et al. (“Accelerating Reinforcement Learning through Implicit Imitation”).
Regarding Claim 6,
Hwang et al. in view of Ma et al. in view of Arel et al. teaches the computer-implemented method of claim 1.
Hwang et al. in view of Ma et al. in view of Arel et al. does not appear to explicitly teach wherein playing the trained antagonist agents on the protagonist environment comprises constructing the plurality of bad demonstrations from expert states in protagonist demonstrations and antagonistic actions leading to unrecoverable states.
However, Price et al. teaches wherein playing the trained antagonist agents on the protagonist environment comprises constructing the plurality of bad demonstrations from expert states in protagonist demonstrations and antagonistic actions leading to unrecoverable states (Figure 17 and pg. 609 second full paragraph: “Examining the graph in Figure 20, we see that both imitation agents experience an early negative dip as they are guided deep into the river by the mentor’s influence. The agent without repair eventually decides the mentor’s action is infeasible, and thereafter avoids the river (and the possibility of finding the goal). The imitator with repair also discovers the mentor’s action to be infeasible, but does not immediately dispense with the mentor’s guidance. It keeps exploring in the area of the mentor’s trajectory using a random walk, all the while accumulating a negative reward until it suddenly finds a bridge and rapidly converges on the optimal solution” teach playing trained imitation agents (antagonist) on the mentor’s (protagonist) environment includes obtaining demonstrations from expert mentor that lead to infeasible actions (correspond to bad demonstrations) and as the imitation agents (antagonist) follow the mentor’s infeasible actions, the imitation agents also take actions (correspond to antagonist actions) that are infeasible leading to unrecoverable states (see pg. 609 first full paragraph: “If we examine the value function estimate (after 1000 steps) of an imitator with feasibility testing but no repair capabilities, we see that, due to suppression by feasibility testing, the darkly shaded high-value states in Figure 19 (backed up from the goal) terminate abruptly at an infeasible transition without making it across the river”); Figure 17 teaches each of the imitation agents is played on the mentor’s environment in which the expert mentor have various states; Figure 11 and pg. 597 last paragraph: “Two expert agents with different start and goal states serve as potential mentors” teach that there can be multiple expert mentor demonstrations (protagonist demonstrations) with various expert states).
Hwang et al., Ma et al., Arel et al., and Price et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Price et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate implicit imitation into reinforcement learning because “[t]he benefit of implicit imitation lies in the way in which the models extracted from the mentor allow the observer to calculate a lower bound on the value function and use this lower bound to choose its greedy actions to move the agent towards higher-valued regions of state space. The result is quicker convergence to optimal policies and better short-term practical performance with respect to accumulated discounted reward while learning” (Price et al. 590 first full paragraph).
Regarding Claim 10,
Hwang et al. in view of Ma et al. in view of Arel et al. teaches the computer-implemented method of claim 1.
Hwang et al. in view of Ma et al. in view of Arel et al. does not appear to explicitly teach wherein the plurality of antagonist agents are trained to maximize an expected return in at least one 
However, Price et al. teaches wherein the plurality of antagonist agents are trained to maximize an expected return in at least one antagonist environment constructed from the protagonist environment, using different random seeds corresponding to different strategies (pg. 582 second & third paragraphs: “Instead of trying to directly learn a policy, an observer could attempt to use observed state transitions of other agents to improve its own environment model Pro(s, a, t). With a more accurate model and its own reward function, the observer could calculate more accurate values for states. The state values could then be used to guide the agent towards distant rewards and reduce the need for random exploration. This insight forms the core of our implicit imitation model...In addition to model information, mentors may also communicate information about the relevance or irrelevance of regions of the state space for certain classes of reward functions. An observer can use the set of states visited by the mentor as heuristic guidance about where to perform backup computations in the state space” teaches the observer agent (antagonist) can construct an environment from observations about mentor’s environment (protagonist environment); pg. 575 last full paragraph to 576: “Typically, τ is started high so that actions are randomly explored during the early stages of learning. As the agent gains knowledge about the effects of its actions and the value of these effects, the parameter τ is decayed so that the agent spends more time exploiting actions known to be valuable and less time randomly exploring actions” teaches that the agent starts with random exploration in the early stage of learning (corresponds to using different random seeds) in which the random actions correspond to different strategies; pg. 591 last full paragraph: “Both agents attempt to learn a policy that maximizes discounted return in a 10 × 10 grid world” teaches in reinforcement learning, an agent learns to maximize an expected return).

It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Price et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate implicit imitation into reinforcement learning because “[t]he benefit of implicit imitation lies in the way in which the models extracted from the mentor allow the observer to calculate a lower bound on the value function and use this lower bound to choose its greedy actions to move the agent towards higher-valued regions of state space. The result is quicker convergence to optimal policies and better short-term practical performance with respect to accumulated discounted reward while learning” (Price et al. 590 first full paragraph).
Regarding Claim 16,
Hwang et al. in view of Ma et al. in view of Arel et al. teaches the computer program product of claim 11.
Hwang et al. in view of Ma et al. in view of Arel et al. does not appear to explicitly teach wherein playing the trained antagonist agents on the protagonist environment comprises constructing the plurality of bad demonstrations from expert states in protagonist demonstrations and antagonistic actions leading to unrecoverable states.
However, Price et al. teaches wherein playing the trained antagonist agents on the protagonist environment comprises constructing the plurality of bad demonstrations from expert states in protagonist demonstrations and antagonistic actions leading to unrecoverable states (Figure 17 and pg. 609 second full paragraph: “Examining the graph in Figure 20, we see that both imitation agents experience an early negative dip as they are guided deep into the river by the mentor’s influence. The agent without repair eventually decides the mentor’s action is infeasible, and thereafter avoids the river (and the possibility of finding the goal). The imitator with repair also discovers the mentor’s action to be infeasible, but does not immediately dispense with the mentor’s guidance. It keeps exploring in the area of the mentor’s trajectory using a random walk, all the while accumulating a negative reward until it suddenly finds a bridge and rapidly converges on the optimal solution” teach playing trained imitation agents (antagonist) on the mentor’s (protagonist) environment includes obtaining demonstrations from expert mentor that lead to infeasible actions (correspond to bad demonstrations) and as the imitation agents (antagonist) follow the mentor’s infeasible actions, the imitation agents also take actions (correspond to antagonist actions) that are infeasible leading to unrecoverable states (see pg. 609 first full paragraph: “If we examine the value function estimate (after 1000 steps) of an imitator with feasibility testing but no repair capabilities, we see that, due to suppression by feasibility testing, the darkly shaded high-value states in Figure 19 (backed up from the goal) terminate abruptly at an infeasible transition without making it across the river”); Figure 17 teaches each of the imitation agents is played on the mentor’s environment in which the expert mentor have various states; Figure 11 and pg. 597 last paragraph: “Two expert agents with different start and goal states serve as potential mentors” teach that there can be multiple expert mentor demonstrations (protagonist demonstrations) with various expert states).
Hwang et al., Ma et al., Arel et al., and Price et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Price et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate implicit imitation into reinforcement learning because “[t]he benefit of implicit imitation lies .

Claims 2-5 and 12-15 are rejected under 35 U.S.C. 103 as being unpatentable over Hwang et al. (“Inverse Reinforcement Learning based on Critical State”) in view of Ma et al. (“Improved Robustness and Safety for Autonomous Vehicle Control with Adversarial Reinforcement Learning”) in view of Arel et al. (US 2017/0213150 A1) in view of Price et al. (“Accelerating Reinforcement Learning through Implicit Imitation”) and further in view of Bansal et al. (“EMERGENT COMPLEXITY VIA MULTI-AGENT COMPETITION”).
Regarding Claim 2,
Hwang et al. in view of Ma et al. in view of Arel et al. teaches the computer-implemented method of claim 1.
Hwang et al. in view of Ma et al. in view of Arel et al. does not appear to explicitly teach wherein training the plurality of antagonist agents comprises: resetting a plurality of antagonist environments using the protagonist environment.
However, Price et al. teaches wherein training the plurality of antagonist agents comprises: resetting a plurality of antagonist environments using the protagonist environment (pg. 582 second & third paragraphs: “Instead of trying to directly learn a policy, an observer could attempt to use observed state transitions of other agents to improve its own environment model Pro(s, a, t). With a more accurate model and its own reward function, the observer could calculate more accurate values for states. The state values could then be used to guide the agent towards distant rewards and reduce the need for random exploration. This insight forms the core of our implicit imitation model...In addition to model information, mentors may also communicate information about the relevance or irrelevance of regions of the state space for certain classes of reward functions. An observer can use the set of states visited by the mentor as heuristic guidance about where to perform backup computations in the state space” teaches the observer agent (antagonist) can improve (reset) its environment from observations about mentor’s environment (protagonist environment); pg. 609 second full paragraph teaches multiple observer/imitation agents (plurality of antagonist)).
Hwang et al., Ma et al., Arel et al., and Price et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Price et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate implicit imitation into reinforcement learning because “[t]he benefit of implicit imitation lies in the way in which the models extracted from the mentor allow the observer to calculate a lower bound on the value function and use this lower bound to choose its greedy actions to move the agent towards higher-valued regions of state space. The result is quicker convergence to optimal policies and better short-term practical performance with respect to accumulated discounted reward while learning” (Price et al. 590 first full paragraph).
Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al. does not appear to explicitly teach training the plurality of antagonist agents on a plurality of instances of each of the plurality of antagonist environments.
However, Bansal et al. teaches training the plurality of antagonist agents on a plurality of instances of each of the plurality of antagonist environments (pg. 3 first to third full paragraphs: “We introduce four competitive environments and experiment with two types of agents. In this paper we focus on two agent worlds, that is 1-vs-1 games, though these environments can be extended to include multiple agents for a mixed competitive and co-operative setup...Run to Goal: The agents start by facing each other in a 3D world and they each have goals on the opposite side of the word (see Fig.1a). The agent that reaches its goal first wins. Reaching the goal before the opponent gives a reward of +1000 to the agent and -1000 to the opponent. If no agent reaches its goal then they both get -1000. You Shall Not Pass: This is the same world as the previous task, but one agent (the blocker) now has the objective of blocking the other agent from reaching it’s goal while not falling down” teaches training at least two agents on at least two antagonist environments; since agents compete in 1-vs-1 games, thus rendering each agent can be the antagonist agent compared to the other agent; pg. 6 second full paragraph: “The co-efficient t in eq. 1 for the exploration reward is annealed to 0 in 500 iterations for all the environments except for kick-and-defend in which it is annealed in 1000 iterations” teaches training takes many iterations (plurality of instances in which training is done in an environment)).
Hwang et al., Ma et al., Arel et al., Price et al., and Bansal et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Bansal et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to leverage multi-agent competition in “new competitive multi-agent 3D physically simulated environments” for “the development of highly complex skills in simple environments with simple rewards” (Bansal et al. pg. 9 first full paragraph).

Regarding Claim 3,
Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al. in view of Bansal et al. teaches the computer-implemented method of claim 2.
Price et al. further teaches wherein resetting the plurality of antagonist environments comprises resetting the plurality of antagonist environments to a visited expert state in a protagonist demonstration (pg. 582 second & third paragraphs: “Instead of trying to directly learn a policy, an observer could attempt to use observed state transitions of other agents to improve its own environment model Pro(s, a, t). With a more accurate model and its own reward function, the observer could calculate more accurate values for states. The state values could then be used to guide the agent towards distant rewards and reduce the need for random exploration. This insight forms the core of our implicit imitation model...In addition to model information, mentors may also communicate information about the relevance or irrelevance of regions of the state space for certain classes of reward functions. An observer can use the set of states visited by the mentor as heuristic guidance about where to perform backup computations in the state space” teaches the observer agent (antagonist) can improve (reset) its environment from observations about mentor’s environment (protagonist environment) in which the set of states visited by the mentor used by an observer as heuristic guidance correspond to expert states visited in the mentor’s environment; pg. 609 second full paragraph teaches multiple observer/imitation agents (plurality of antagonist)).
Hwang et al., Ma et al., Arel et al., and Price et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Price et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al.

Regarding Claim 4,
Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al. in view of Bansal et al. teaches the computer-implemented method of claim 2.
Price et al. further teaches wherein the protagonist environment includes state and action transition functions and reward structure information, and wherein resetting the plurality of antagonist environments comprises using, for each of the plurality of antagonist environments, (i) a same one of the state and action transition functions as the protagonist environment,...and (iii) using visited states of the protagonist environment as an initial state (pg. 582 second & third paragraphs: “Instead of trying to directly learn a policy, an observer could attempt to use observed state transitions of other agents to improve its own environment model Pro(s, a, t). With a more accurate model and its own reward function, the observer could calculate more accurate values for states. The state values could then be used to guide the agent towards distant rewards and reduce the need for random exploration. This insight forms the core of our implicit imitation model...In addition to model information, mentors may also communicate information about the relevance or irrelevance of regions of the state space for certain classes of reward functions. An observer can use the set of states visited by the mentor as heuristic guidance about where to perform backup computations in the state space” teaches the observer agent (antagonist) can obtain mentor (protagonist) environment information, including pg. 583 last full paragraph to pg. 584:
    PNG
    media_image1.png
    515
    697
    media_image1.png
    Greyscale
 teaches observer (antagonist) can obtain state transition function of the mentor’s environment; pg. 584 first full paragraph:

    PNG
    media_image2.png
    206
    688
    media_image2.png
    Greyscale
teaches observer (antagonist) can obtain action transition function of the mentor’s environment, and that the observer can duplicate mentor’s policy including the state and action transitions; duplicating 
Hwang et al., Ma et al., Arel et al., and Price et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Price et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate implicit imitation into reinforcement learning because “[t]he benefit of implicit imitation lies in the way in which the models extracted from the mentor allow the observer to calculate a lower bound on the value function and use this lower bound to choose its greedy actions to move the agent towards higher-valued regions of state space. The result is quicker convergence to optimal policies and better short-term practical performance with respect to accumulated discounted reward while learning” (Price et al. 590 first full paragraph).
Bansal et al. further teaches (ii) a reward structure derived from that of the protagonist environment (pg. 3 first full paragraph: “We introduce four competitive environments and experiment with two types of agents. In this paper we focus on two agent worlds, that is 1-vs-1 games, though these environments can be extended to include multiple agents for a mixed competitive and co-operative setup” teaches that agents compete in 1-vs-1 games, thus rendering each agent can be the antagonist agent compared to the other agent; pg. 3 second to third full paragraph: “Run to Goal: The agents start by facing each other in a 3D world and they each have goals on the opposite side of the word (see Fig.1a). The agent that reaches its goal first wins. Reaching the goal before the opponent gives a reward of +1000 to the agent and -1000 to the opponent. If no agent reaches its goal then they both get -1000. You Shall Not Pass: This is the same world as the previous task, but one agent (the blocker) now has the objective of blocking the other agent from reaching it’s goal while not falling down. If the blocker is successful in preventing the opponent from reaching the goal and is standing at the end of episode then it gets +1000 reward, if it is not standing then it gets 0 reward, and the opponent gets -1000 reward. If the opponent is successful in reaching it’s goal then it gets +1000 reward and the blocker gets -1000 reward” teaches deriving one agent’s (an antagonist agent) reward from another agent’s (a protagonist agent) reward; for example, when one agent gets a +1000 reward for reaching a goal, the other agent gets a -1000 reward).
Hwang et al., Ma et al., Arel et al., Price et al., and Bansal et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Bansal et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to leverage multi-agent competition in “new competitive multi-agent 3D physically simulated environments” for “the development of highly complex skills in simple environments with simple rewards” (Bansal et al. pg. 9 first full paragraph).
Regarding Claim 5,
Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al. in view of Bansal et al. teaches the computer-implemented method of claim 4.
Bansal et al. further teaches wherein the reward structure derived from that of the protagonist environment includes a negative of a protagonist reward (pg. 3 first full paragraph: “We introduce four competitive environments and experiment with two types of agents. In this paper we focus on two agent worlds, that is 1-vs-1 games, though these environments can be extended to include multiple agents for a mixed competitive and co-operative setup” teaches that agents compete in 1-vs-1 games, pg. 3 second to third full paragraph: “Run to Goal: The agents start by facing each other in a 3D world and they each have goals on the opposite side of the word (see Fig.1a). The agent that reaches its goal first wins. Reaching the goal before the opponent gives a reward of +1000 to the agent and -1000 to the opponent. If no agent reaches its goal then they both get -1000. You Shall Not Pass: This is the same world as the previous task, but one agent (the blocker) now has the objective of blocking the other agent from reaching it’s goal while not falling down. If the blocker is successful in preventing the opponent from reaching the goal and is standing at the end of episode then it gets +1000 reward, if it is not standing then it gets 0 reward, and the opponent gets -1000 reward. If the opponent is successful in reaching it’s goal then it gets +1000 reward and the blocker gets -1000 reward” teaches deriving one agent’s (an antagonist agent) reward from another agent’s (a protagonist agent) reward; for example, when one agent gets a +1000 reward for reaching a goal, the other agent gets a -1000 reward (corresponds to a negative of a protagonist award)).
Hwang et al., Ma et al., Arel et al., Price et al., and Bansal et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Bansal et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to leverage multi-agent competition in “new competitive multi-agent 3D physically simulated environments” for “the development of highly complex skills in simple environments with simple rewards” (Bansal et al. pg. 9 first full paragraph).


Regarding Claim 12,
Hwang et al. in view of Ma et al. in view of Arel et al. teaches the computer program product of claim 11.
Hwang et al. in view of Ma et al. in view of Arel et al. does not appear to explicitly teach wherein training the plurality of antagonist agents comprises: resetting a plurality of antagonist environments using the protagonist environment.
However, Price et al. teaches wherein training the plurality of antagonist agents comprises: resetting a plurality of antagonist environments using the protagonist environment (pg. 582 second & third paragraphs: “Instead of trying to directly learn a policy, an observer could attempt to use observed state transitions of other agents to improve its own environment model Pro(s, a, t). With a more accurate model and its own reward function, the observer could calculate more accurate values for states. The state values could then be used to guide the agent towards distant rewards and reduce the need for random exploration. This insight forms the core of our implicit imitation model...In addition to model information, mentors may also communicate information about the relevance or irrelevance of regions of the state space for certain classes of reward functions. An observer can use the set of states visited by the mentor as heuristic guidance about where to perform backup computations in the state space” teaches the observer agent (antagonist) can improve (reset) its environment from observations about mentor’s environment (protagonist environment); pg. 609 second full paragraph teaches multiple observer/imitation agents (plurality of antagonist)).
Hwang et al., Ma et al., Arel et al., and Price et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Price et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al.

Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al. does not appear to explicitly teach training the plurality of antagonist agents on a plurality of instances of each of the plurality of antagonist environments.
However, Bansal et al. teaches training the plurality of antagonist agents on a plurality of instances of each of the plurality of antagonist environments (pg. 3 first to third full paragraphs: “We introduce four competitive environments and experiment with two types of agents. In this paper we focus on two agent worlds, that is 1-vs-1 games, though these environments can be extended to include multiple agents for a mixed competitive and co-operative setup...Run to Goal: The agents start by facing each other in a 3D world and they each have goals on the opposite side of the word (see Fig.1a). The agent that reaches its goal first wins. Reaching the goal before the opponent gives a reward of +1000 to the agent and -1000 to the opponent. If no agent reaches its goal then they both get -1000. You Shall Not Pass: This is the same world as the previous task, but one agent (the blocker) now has the objective of blocking the other agent from reaching it’s goal while not falling down” teaches training at least two agents on at least two antagonist environments; since agents compete in 1-vs-1 games, thus rendering each agent can be the antagonist agent compared to the other agent; pg. 6 second full paragraph: “The co-efficient t in eq. 1 for the exploration reward is annealed to 0 in 500 iterations for all the environments except for kick-and-defend in which it is annealed in 1000 iterations” teaches training takes many iterations (plurality of instances in which training is done in an environment)).
Hwang et al., Ma et al., Arel et al., Price et al., and Bansal et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Bansal et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to leverage multi-agent competition in “new competitive multi-agent 3D physically simulated environments” for “the development of highly complex skills in simple environments with simple rewards” (Bansal et al. pg. 9 first full paragraph).
Regarding Claim 13,
Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al. in view of Bansal et al. teaches the computer program product of claim 12.
Price et al. further teaches wherein resetting the plurality of antagonist environments comprises resetting the plurality of antagonist environments to a visited expert state in a protagonist demonstration (pg. 582 second & third paragraphs: “Instead of trying to directly learn a policy, an observer could attempt to use observed state transitions of other agents to improve its own environment model Pro(s, a, t). With a more accurate model and its own reward function, the observer could calculate more accurate values for states. The state values could then be used to guide the agent towards distant rewards and reduce the need for random exploration. This insight forms the core of our implicit imitation model...In addition to model information, mentors may also communicate information about the relevance or irrelevance of regions of the state space for certain classes of reward functions. An observer can use the set of states visited by the mentor as heuristic guidance about where to perform backup computations in the state space” teaches the observer agent (antagonist) can improve (reset) its environment from observations about mentor’s environment (protagonist environment) in which the set of states visited by the mentor used by an observer as heuristic guidance correspond to expert states visited in the mentor’s environment; pg. 609 second full paragraph teaches multiple observer/imitation agents (plurality of antagonist)).
Hwang et al., Ma et al., Arel et al., and Price et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Price et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate implicit imitation into reinforcement learning because “[t]he benefit of implicit imitation lies in the way in which the models extracted from the mentor allow the observer to calculate a lower bound on the value function and use this lower bound to choose its greedy actions to move the agent towards higher-valued regions of state space. The result is quicker convergence to optimal policies and better short-term practical performance with respect to accumulated discounted reward while learning” (Price et al. 590 first full paragraph).
Regarding Claim 14,
Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al. in view of Bansal et al. teaches the computer program product of claim 12.
Price et al. further teaches wherein the protagonist environment includes state and action transition functions and reward structure information, and wherein resetting the plurality of antagonist environments comprises using, for each of the plurality of antagonist environments, (i) a same one of the state and action transition functions as the protagonist environment,...and (iii) using visited states of pg. 582 second & third paragraphs: “Instead of trying to directly learn a policy, an observer could attempt to use observed state transitions of other agents to improve its own environment model Pro(s, a, t). With a more accurate model and its own reward function, the observer could calculate more accurate values for states. The state values could then be used to guide the agent towards distant rewards and reduce the need for random exploration. This insight forms the core of our implicit imitation model...In addition to model information, mentors may also communicate information about the relevance or irrelevance of regions of the state space for certain classes of reward functions. An observer can use the set of states visited by the mentor as heuristic guidance about where to perform backup computations in the state space” teaches the observer agent (antagonist) can obtain mentor (protagonist) environment information, including information about the set of states and “information about the relevance or irrelevance of regions of the state space for certain classes of reward functions” (correspond to reward structure information); pg. 583 last full paragraph to pg. 584:
    PNG
    media_image1.png
    515
    697
    media_image1.png
    Greyscale
 teaches  pg. 584 first full paragraph:

    PNG
    media_image2.png
    206
    688
    media_image2.png
    Greyscale
teaches observer (antagonist) can obtain action transition function of the mentor’s environment, and that the observer can duplicate mentor’s policy including the state and action transitions; duplicating the mentor’s policy (including visited states) corresponds to using protagonist’s visited states an initial states).
Hwang et al., Ma et al., Arel et al., and Price et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Price et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to incorporate implicit imitation into reinforcement learning because “[t]he benefit of implicit imitation lies in the way in which the models extracted from the mentor allow the observer to calculate a lower bound on the value function and use this lower bound to choose its greedy actions to move the agent towards higher-valued regions of state space. The result is quicker convergence to optimal policies and better short-term practical performance with respect to accumulated discounted reward while learning” (Price et al. 590 first full paragraph).
Bansal et al. further teaches (ii) a reward structure derived from that of the protagonist environment (pg. 3 first full paragraph: “We introduce four competitive environments and experiment with two types of agents. In this paper we focus on two agent worlds, that is 1-vs-1 games, though these environments can be extended to include multiple agents for a mixed competitive and co-operative setup” teaches that agents compete in 1-vs-1 games, thus rendering each agent can be the antagonist agent compared to the other agent; pg. 3 second to third full paragraph: “Run to Goal: The agents start by facing each other in a 3D world and they each have goals on the opposite side of the word (see Fig.1a). The agent that reaches its goal first wins. Reaching the goal before the opponent gives a reward of +1000 to the agent and -1000 to the opponent. If no agent reaches its goal then they both get -1000. You Shall Not Pass: This is the same world as the previous task, but one agent (the blocker) now has the objective of blocking the other agent from reaching it’s goal while not falling down. If the blocker is successful in preventing the opponent from reaching the goal and is standing at the end of episode then it gets +1000 reward, if it is not standing then it gets 0 reward, and the opponent gets -1000 reward. If the opponent is successful in reaching it’s goal then it gets +1000 reward and the blocker gets -1000 reward” teaches deriving one agent’s (an antagonist agent) reward from another agent’s (a protagonist agent) reward; for example, when one agent gets a +1000 reward for reaching a goal, the other agent gets a -1000 reward).
Hwang et al., Ma et al., Arel et al., Price et al., and Bansal et al. are analogous art to the claimed invention because they are directed to reinforcement learning.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Bansal et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to leverage multi-agent competition in “new competitive multi-agent 3D physically simulated 
Regarding Claim 15,
Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al. in view of Bansal et al. teaches the computer program product of claim 14.
Bansal et al. further teaches wherein the reward structure derived from that of the protagonist environment includes a negative of a protagonist reward (pg. 3 first full paragraph: “We introduce four competitive environments and experiment with two types of agents. In this paper we focus on two agent worlds, that is 1-vs-1 games, though these environments can be extended to include multiple agents for a mixed competitive and co-operative setup” teaches that agents compete in 1-vs-1 games, thus rendering each agent can be the antagonist agent compared to the other agent; pg. 3 second to third full paragraph: “Run to Goal: The agents start by facing each other in a 3D world and they each have goals on the opposite side of the word (see Fig.1a). The agent that reaches its goal first wins. Reaching the goal before the opponent gives a reward of +1000 to the agent and -1000 to the opponent. If no agent reaches its goal then they both get -1000. You Shall Not Pass: This is the same world as the previous task, but one agent (the blocker) now has the objective of blocking the other agent from reaching it’s goal while not falling down. If the blocker is successful in preventing the opponent from reaching the goal and is standing at the end of episode then it gets +1000 reward, if it is not standing then it gets 0 reward, and the opponent gets -1000 reward. If the opponent is successful in reaching it’s goal then it gets +1000 reward and the blocker gets -1000 reward” teaches deriving one agent’s (an antagonist agent) reward from another agent’s (a protagonist agent) reward; for example, when one agent gets a +1000 reward for reaching a goal, the other agent gets a -1000 reward (corresponds to a negative of a protagonist award)).

It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the above limitation(s) as taught by Bansal et al. to the disclosed invention of Hwang et al. in view of Ma et al. in view of Arel et al. in view of Price et al.
One of ordinary skill in the arts would have been motivated to make this modification in order to leverage multi-agent competition in “new competitive multi-agent 3D physically simulated environments” for “the development of highly complex skills in simple environments with simple rewards” (Bansal et al. pg. 9 first full paragraph).

Response to Arguments
Applicant’s arguments filed on 11/26/2021 with respect to the 35 U.S.C. 103 rejection to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. 
In particular, Applicant argues that Price does not teach “training, by the processor device, a plurality of antagonist agents to fail a task by reinforcement learning using the protagonist environment” in claim 1 (and analogous claims 11 and 20), and Arel does not cure this deficiency (Remarks, pg. 10). This argument is moot because the combination of Price and Arel is no longer relied upon in teaching the limitation. 
Moreover, Applicant argues that none of the cited references teaches “controlling a vehicle system to control a vehicle using a neural network trained on one or more of the plurality of bad demonstrations to avoid a collision” claim 1 (and analogous claims 11 and 20) (Remarks, pg. 10-11). This 
Applicant relies on the arguments presented above for each of the dependent claims, therefore the above responses are applicable to the dependent claims.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YING YU CHEN whose telephone number is (571)270-1484. The examiner can normally be reached Monday-Friday 7:30 am-5:00 pm (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/YING YU CHEN/               Examiner, Art Unit 2125