DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 03/27/2020, 07/31/2020, and 11/04/2021 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiners. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-5, 7-8, 10-15, 17-18, 20-25, 27-28, and 30 are rejected under 35 U.S.C. 103 as being unpatentable over Mguni et al. (Pub. No. US 2021/0319362), hereinafter Mguni; in view of Heess et al. (Patent. No. US 11,210,585), hereinafter Heess. 

Claim 1. 	Mguni discloses a learning system for multi-agent applications, the system comprising: 
one or more processors and a memory, the memory being a non-transitory 5computer-readable medium having executable instructions encoded thereon, such that upon execution of the instructions (The art teaches in Parag. [0017] that each agent 100 is further associated with a computing module comprising memory and processor circuitry), the one or more processors perform operations of: 
initializing a plurality of learning agents, the learning agents including both tactical agents and strategic agents (The art teaches in Parag. [0017] and FIG. 1 a set N={1, . . . , n} of autonomous agents 100 (of which agent 100.1, agent 100.2, and agent 100.n are shown) interact with an environment 110. In the present example, the environment 110 is a physical environment, and each agent 100 is associated with a robot having one or more sensors and one or more actuators);   
10causing one or more strategic agents to take an observation from an environment and select one or more of the tactical agents to produce an action that is used to control a platform's actuators or simulated movements in the environment to complete a task (The art teaches in Parag. [0018] that each agent 100 (i.e., strategic agent) selects actions according to a respective (stochastic) policy, and sends control signals to the associated robot (i.e., tactical agents) corresponding to the selected actions, causing the associated robot to perform the selected actions on the environment 110 using the one or more actuators); and 
causing one or tactical agents to produce the action corresponding 15to a learned behavior to control the platform's actuators or simulated movements in the environment to complete the task (The art teaches in Parag. [0018] that each agent 100 (i.e., strategic agent) selects actions according to a respective (stochastic) policy, and sends control signals to the associated robot (i.e., tactical agents) corresponding to the selected actions, causing the associated robot to perform the selected actions on the environment 110 using the one or more actuators. The art teaches in Parag. [0024] that the process by which each agent 100 learns a respective policy so that the collective behaviour of the agents 100 converges towards an equilibrium is referred to as multi-agent reinforcement learning (MARL). During the MARL process, each agent 100 iteratively updates its respective policy with the objective of maximizing an expected sum of (possibly discounted) rewards over a sequence of time steps. Eventually, the respective policies of the agents 100 will converge to fixed policies, resulting in an equilibrium in which no agent 100 can increase its cumulative discounted reward by deviating from its current respective policy). 
Mguni doesn’t explicitly disclose that the learned behavior is a learned low-level behavior.
However, Heess discloses that the learned behavior is a learned low-level behavior (The art teaches in Col. 1 lines 43-67 that the reinforcement learning system can effectively select actions to be performed by an agent in high-dimensional action spaces in order to complete a task, i.e., by using a hierarchical control structure. The hierarchical control structure includes a high-level controller and low-level controller that differ both in their access to information contained in observations and the time scales at which they operate. This hierarchical control structure enables the low-level controller to focus on reactive control (e.g., swimming or walking) while the high-level controller directs behavior towards a task goal (e.g., reaching a specified target) by modulating these low-level controller behaviors. In addition, by using this hierarchy control structure, the reinforcement learning system may avoid re-training an agent from scratch every time a new task is encountered because the low-level controller can be re-used across a variety of related tasks. For example, the low-level controller can be trained to control the movement of the joints of an agent while performing one task. The same low-level controller can then be used for a different task, e.g., reaching a different goal or accomplishing a different robotic objective).
		It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Mguni to incorporate the teachings of Heess. This would be convenient such that the reinforcement learning system can select actions to be performed by the agent in a resource-efficient manner (Col. 2 lines 1-4).


Claim 2. 	Mguni in view of Heess discloses the learning system as set forth in Claim 1, 
Mguni further discloses the learning system further comprising operations of: training the learning agents to maximize a reward function returned by the 20environment; maintaining a fitness level for each learning agent during training, where the fitness level represents an average of a net reward obtained by the learning agent from each episode of training (The art teaches in Parag. [0019] that Each time an agent 100 selects an action, causing the associated robot to perform the selected action, the agent 100 determines a reward. The reward determined at a given time step generally depends on the state of the environment 110 at the given time step, the action selected by that agent 100 at the given time step, and may depend on actions selected by one or more of the other agents 100 at the given time step. In this example, the reward is a real number, and the objective of each agent 100 is to update its respective policy, to maximise a (possibly discounted) expected cumulative reward over a predetermined (possibly infinite) number of time steps. Each agent 100 can therefore be described as self-interested or rational, as each agent 100 only seeks to maximise its own cumulative reward); and selecting one or more learning agents for additional training, based on 25their fitness with respect to a collective fitness of the learning agents (The art teaches in Parag. [0020] continuing or infinite horizon tasks, in which agents continue to interact with an environment for an indefinite number of time steps. For a continuing task, the number of time steps over which the agents seek to maximise an expected cumulative reward may be infinite, in which case a multiplicative discount factor is included to ensure convergence of the expected cumulative reward. Other examples involve episodic tasks, in which agents interact with an environment in a series of episodes, each episode having a finite number of time steps. The number of time steps for an episodic task may be predetermined, may be random, or may be dependent on the system reaching particular states (for example, an episodic task may have one or more predetermined terminal states, such that when the system reaches a terminal state, the episode ends). For episodic tasks, the initial state of the system may be different for different episodes, and may, for example, be modelled using an initial state probability distribution. The initial state probability distribution may be a priori unknown to the agents. For an episodic task, agents generally aim to maximise an expected cumulative reward over a single episode).  

Claim 3. 	Mguni in view of Heess discloses the learning system as set forth in Claim 2, 
Mguni doesn’t explicitly disclose the learning system further comprising an operation of adapting one or more of the plurality of learning agents to perform a new task in a Page 23 of 30HRL 636/181215A ROBUST, SCALABLE AND GENERALIZABLEMACHINE LEARNING PARADIGM FORnew domain by performing one or more operations selected from a group consisting of: re-training a high-level strategy network to produce an optimal behavior, where optimality is based on maximizing reward signals obtained from episodes 5in the new domain; re-training one or more low-level behavior networks to produce optimal behavior in the new domain; or adding and training new behaviors and re-training the high-level strategy network to select these new behaviors based on maximizing reward signals from 10the new domain. 
However, Heess discloses an operation of adapting one or more of the plurality of learning agents to perform a new task in a Page 23 of 30HRL 636/181215A ROBUST, SCALABLE AND GENERALIZABLEMACHINE LEARNING PARADIGM FORnew domain by performing one or more operations selected from a group consisting of: re-training a high-level strategy network to produce an optimal behavior, where optimality is based on maximizing reward signals obtained from episodes 5in the new domain; re-training one or more low-level behavior networks to produce optimal behavior in the new domain; or adding and training new behaviors and re-training the high-level strategy network to select these new behaviors based on maximizing reward signals from 10the new domain (The art teaches in Col. 1 lines 43-67 that the reinforcement learning system can effectively select actions to be performed by an agent in high-dimensional action spaces in order to complete a task, i.e., by using a hierarchical control structure. The hierarchical control structure includes a high-level controller and low-level controller that differ both in their access to information contained in observations and the time scales at which they operate. This hierarchical control structure enables the low-level controller to focus on reactive control (e.g., swimming or walking) while the high-level controller directs behavior towards a task goal (e.g., reaching a specified target) by modulating these low-level controller behaviors. In addition, by using this hierarchy control structure, the reinforcement learning system may avoid re-training an agent from scratch every time a new task is encountered because the low-level controller can be re-used across a variety of related tasks. For example, the low-level controller can be trained to control the movement of the joints of an agent while performing one task. The same low-level controller can then be used for a different task, e.g., reaching a different goal or accomplishing a different robotic objective).
		It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Mguni to incorporate the teachings of Heess. This would be convenient such that the reinforcement learning system can select actions to be performed by the agent in a resource-efficient manner (Col. 2 lines 1-4).

Claim 4. 	Mguni in view of Heess discloses the learning system as set forth in Claim 2,  
Mguni further discloses wherein each learning agent is trained in an initial state space, the initial state space being a set of all possible conditions that may exist in a simulated environment at a start of a training episode (The art teaches in Parag. [0020] involving episodic tasks, in which agents interact with an environment in a series of episodes, each episode having a finite number of time steps. The number of time steps for an episodic task may be predetermined, may be random, or may be dependent on the system reaching particular states (for example, an episodic task may have one or more predetermined terminal states, such that when the system reaches a terminal state, the episode ends). For episodic tasks, the initial state of the system may be different for different episodes, and may, for example, be modelled using an initial state probability distribution. The initial state probability distribution may be a priori unknown to the agents. For an episodic task, agents generally aim to maximize an expected cumulative reward over a single episode. The art teaches in Parag. [0074] that the system of agents and a meta-agent may interact with a virtual environment, for example in a computer game or simulation).  

Claim 5. 	Mguni in view of Heess discloses the learning system as set forth in Claim 4,   
Mguni further discloses wherein the initial state space is sequentially expanded after at least two of the learning agents have fitness levels within a predetermined threshold (The art teaches in Parag. [0020] involving episodic tasks, in which agents interact with an environment in a series of episodes, each episode having a finite number of time steps. The number of time steps for an episodic task may be predetermined, may be random, or may be dependent on the system reaching particular states (for example, an episodic task may have one or more predetermined terminal states, such that when the system reaches a terminal state, the episode ends). For episodic tasks, the initial state of the system may be different for different episodes, and may, for example, be modelled using an initial state probability distribution. The initial state probability distribution may be a priori unknown to the agents. For an episodic task, agents generally aim to maximize an expected cumulative reward (i.e., fitness level (i.e., representing the average of a net reward) within a threshold) over a single episode).  

Claim 7. 	Mguni in view of Heess discloses the learning system as set forth in Claim 2, 
Mguni further discloses where training of learning agents is terminated if no improvement is made for a predetermined number of episodes (The art teaches in Parag. [0020] involving episodic tasks, in which agents interact with an environment in a series of episodes, each episode having a finite number of time steps. The number of time steps for an episodic task may be predetermined, may be random, or may be dependent on the system reaching particular states (for example, an episodic task may have one or more predetermined terminal states, such that when the system reaches a terminal state, the episode ends). For episodic tasks, the initial state of the system may be different for different episodes, and may, for example, be modelled using an initial state probability distribution. The initial state probability distribution may be a priori unknown to the agents. For an episodic task, agents generally aim to maximize an expected cumulative reward over a single episode).  

Claim 8. 	Mguni in view of Heess discloses the learning system as set forth in Claim 1,    
Mguni further discloses where different learning agents are initialized and trained with different hyperparameters (The art teaches in Parag. [0053-0054] that The routine of FIG. 2 is an inner-outer loop method, where the inner loop refers to the M iterations of MARL performed by the agents 100, and the outer loop refers to the K iterations of optimisation of the reward modifier parameter performed by the meta-agent 120. The inner-outer loop method of FIG. 2 has favourable convergence properties compared with, for example, a method that attempts to simultaneously update the respective policies of the agents along with the reward modifier parameter. the meta-agent 120 uses Bayesian optimisation to update the reward modifier parameter. Accordingly, the meta-agent 120 treats the system value J (w, π) as a random function of w having a prior distribution over the space of functions. FIG. 3 shows a routine performed by the meta-agent 120 at each of the K optimisation iterations. The meta-agent 120 loads, at S301, data corresponding to the prior distribution into working memory. In this example, the prior distribution is constructed from a predetermined Gaussian process prior, and the data corresponding to the prior distribution includes a choice of kernel (for example, a Matérn kernel) as well as hyperparameters for the resulting Gaussian process prior).  

Claim 10. 	Mguni in view of Heess discloses the learning system as set forth in Claim 1,  
Mguni further discloses wherein a function is used for reinforcement learning by the learning agents, the function is based on a Kullback-Leibler divergence between an action probability distribution selected 5by a strategic agent that is being trained with reinforcement learning, and an average of all probability distributions for all of other strategic agents in the population (The art teaches in Parag. [0069] that a distribution of robots desired by a meta-agent that observes, at each time step, the locations of all of the robots in the region 800, where regions bounded within the closed curves 802, 804, and 806 represent regions of decreasing desired robot density The meta-agent determines a system reward for each episode which depends on a sum of distances from the desired distribution to the distributions observed at each time step (as measured by a Kullback-Leibler (KL) divergence in this example). Specifically, the system reward for an episode in this example is given by minus the sum of the KL divergences determined at each time step. The meta-agent seeks to maximise a system value corresponding to the expected system reward for an episode).   

Claim 11. 	Mguni discloses a computer program product for multi-agent applications, the computer program 10product comprising: 
a non-transitory computer-readable medium having executable instructions encoded thereon, such that upon execution of the instructions by one or more processors (The art teaches in Parag. [0017] that each agent 100 is further associated with a computing module comprising memory and processor circuitry), the one or more processors perform operations of:  
initializing a plurality of learning agents, the learning agents 15including both tactical agents and strategic agents (The art teaches in Parag. [0017] and FIG. 1 a set N={1, . . . , n} of autonomous agents 100 (of which agent 100.1, agent 100.2, and agent 100.n are shown) interact with an environment 110. In the present example, the environment 110 is a physical environment, and each agent 100 is associated with a robot having one or more sensors and one or more actuators);   
causing one or more strategic agents to take an observation from an environment and select one or more of the tactical agents to produce an action that is used to control a platform's actuators or simulated movements in the environment to complete a task (The art teaches in Parag. [0018] that each agent 100 (i.e., strategic agent) selects actions according to a respective (stochastic) policy, and sends control signals to the associated robot (i.e., tactical agents) corresponding to the selected actions, causing the associated robot to perform the selected actions on the environment 110 using the one or more actuators); and    
20causing one or tactical agents to produce the action corresponding to a learned behavior to control the platform's actuators or simulated movements in the environment to complete the task (The art teaches in Parag. [0018] that each agent 100 (i.e., strategic agent) selects actions according to a respective (stochastic) policy, and sends control signals to the associated robot (i.e., tactical agents) corresponding to the selected actions, causing the associated robot to perform the selected actions on the environment 110 using the one or more actuators. The art teaches in Parag. [0024] that the process by which each agent 100 learns a respective policy so that the collective behaviour of the agents 100 converges towards an equilibrium is referred to as multi-agent reinforcement learning (MARL). During the MARL process, each agent 100 iteratively updates its respective policy with the objective of maximizing an expected sum of (possibly discounted) rewards over a sequence of time steps. Eventually, the respective policies of the agents 100 will converge to fixed policies, resulting in an equilibrium in which no agent 100 can increase its cumulative discounted reward by deviating from its current respective policy). 
Mguni doesn’t explicitly disclose that the learned behavior is a learned low-level behavior.
However, Heess discloses that the learned behavior is a learned low-level behavior (The art teaches in Col. 1 lines 43-67 that the reinforcement learning system can effectively select actions to be performed by an agent in high-dimensional action spaces in order to complete a task, i.e., by using a hierarchical control structure. The hierarchical control structure includes a high-level controller and low-level controller that differ both in their access to information contained in observations and the time scales at which they operate. This hierarchical control structure enables the low-level controller to focus on reactive control (e.g., swimming or walking) while the high-level controller directs behavior towards a task goal (e.g., reaching a specified target) by modulating these low-level controller behaviors. In addition, by using this hierarchy control structure, the reinforcement learning system may avoid re-training an agent from scratch every time a new task is encountered because the low-level controller can be re-used across a variety of related tasks. For example, the low-level controller can be trained to control the movement of the joints of an agent while performing one task. The same low-level controller can then be used for a different task, e.g., reaching a different goal or accomplishing a different robotic objective).
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Mguni to incorporate the teachings of Heess. This would be convenient such that the reinforcement learning system can select actions to be performed by the agent in a resource-efficient manner (Col. 2 lines 1-4).  

Claims 12-15 is taught by Mguni in view of Heess as described for claim 2-5. respectively. 

Claims 17, 18, and 20 is taught by Mguni in view of Heess as described for claim 7, 8, and 10. respectively.

Claim 21. 	Mguni discloses a computer implemented method for multi-agent applications, the method comprising an act of: 
causing one or more processers to execute instructions encoded on a non- 20transitory computer-readable medium, such that upon execution (The art teaches in Parag. [0017] that each agent 100 is further associated with a computing module comprising memory and processor circuitry), the one or more processors perform operations of:  
initializing a plurality of learning agents, the learning agents including both tactical agents and strategic agents (The art teaches in Parag. [0017] and FIG. 1 a set N={1, . . . , n} of autonomous agents 100 (of which agent 100.1, agent 100.2, and agent 100.n are shown) interact with an environment 110. In the present example, the environment 110 is a physical environment, and each agent 100 is associated with a robot having one or more sensors and one or more actuators); 
causing one or more strategic agents to take an observation from 25an environment and select one or more of the tactical agents to produce an action that is used to control a platform's actuators or simulated movements in the environment to complete a task (The art teaches in Parag. [0018] that each agent 100 (i.e., strategic agent) selects actions according to a respective (stochastic) policy, and sends control signals to the associated robot (i.e., tactical agents) corresponding to the selected actions, causing the associated robot to perform the selected actions on the environment 110 using the one or more actuators); and  
causing one or tactical agents to produce the action corresponding to a learned behavior to control the platform's actuators or 30simulated movements in the environment to complete the task (The art teaches in Parag. [0018] that each agent 100 (i.e., strategic agent) selects actions according to a respective (stochastic) policy, and sends control signals to the associated robot (i.e., tactical agents) corresponding to the selected actions, causing the associated robot to perform the selected actions on the environment 110 using the one or more actuators. The art teaches in Parag. [0024] that the process by which each agent 100 learns a respective policy so that the collective behaviour of the agents 100 converges towards an equilibrium is referred to as multi-agent reinforcement learning (MARL). During the MARL process, each agent 100 iteratively updates its respective policy with the objective of maximizing an expected sum of (possibly discounted) rewards over a sequence of time steps. Eventually, the respective policies of the agents 100 will converge to fixed policies, resulting in an equilibrium in which no agent 100 can increase its cumulative discounted reward by deviating from its current respective policy). 
Mguni doesn’t explicitly disclose that the learned behavior is a learned low-level behavior.
However, Heess discloses that the learned behavior is a learned low-level behavior (The art teaches in Col. 1 lines 43-67 that the reinforcement learning system can effectively select actions to be performed by an agent in high-dimensional action spaces in order to complete a task, i.e., by using a hierarchical control structure. The hierarchical control structure includes a high-level controller and low-level controller that differ both in their access to information contained in observations and the time scales at which they operate. This hierarchical control structure enables the low-level controller to focus on reactive control (e.g., swimming or walking) while the high-level controller directs behavior towards a task goal (e.g., reaching a specified target) by modulating these low-level controller behaviors. In addition, by using this hierarchy control structure, the reinforcement learning system may avoid re-training an agent from scratch every time a new task is encountered because the low-level controller can be re-used across a variety of related tasks. For example, the low-level controller can be trained to control the movement of the joints of an agent while performing one task. The same low-level controller can then be used for a different task, e.g., reaching a different goal or accomplishing a different robotic objective).
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Mguni to incorporate the teachings of Heess. This would be convenient such that the reinforcement learning system can select actions to be performed by the agent in a resource-efficient manner (Col. 2 lines 1-4).  

Claims 22-25 is taught by Mguni in view of Heess as described for claim 2-5. respectively. 

Claims 27, 28, and 30 is taught by Mguni in view of Heess as described for claim 7, 8, and 10. respectively.

Claims 6, 16, and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Mguni et al. (Pub. No. US 2021/0319362), hereinafter Mguni; in view of Heess et al. (Patent. No. US 11,210,585), hereinafter Heess, and in view of Fong (Pub. No. US 2019/0197244).

Claim 206. 	Mguni in view of Heess discloses the learning system as set forth in Claim 2,  
The combination doesn’t explicitly disclose where a difficulty of obtaining positive rewards increases during training.  
However, Fong discloses where a difficulty of obtaining positive rewards increases during training (The art teaches in Parag. [0064] a reinforcement learning model 340 may be configured and trained by a machine learning agent 310. The agent 310 may initially define an action space 330, success or reward levels, and keep track of one or more policies which may be set by a scoring engine 320. An environment 350 may be an application, a platform, a system, or a set of applications. The art teaches in Parag. [0068] that Setting the appropriate reward for “positive” and “negative” system resource usage may be difficult).
		It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify the combination to incorporate the teachings of Fong. This would be convenient to train the system efficiently by putting challenge during the training.

Claims 16, and 26 is taught by Mguni in view of Heess and Fong as described for claim 6. 

Claims 9, 19, and 29 are rejected under 35 U.S.C. 103 as being unpatentable over Mguni et al. (Pub. No. US 2021/0319362), hereinafter Mguni; in view of Heess et al. (Patent. No. US 11,210,585), hereinafter Heess, and in view of Chai et al. (Pub. No. US 2020/0134461), hereinafter Chai.     

Claim 9. 	Mguni in view of Heess discloses the learning system as set forth in Claim 1,   
The combination doesn’t explicitly disclose wherein the low-level behavior 30includes a behavior selected from a group consisting of pursuit of opponents, evasion of opponents, and evasion of enemy projectiles.
However, Chai wherein the low-level behavior 30includes a behavior selected from a group consisting of pursuit of opponents, evasion of opponents, and evasion of enemy projectiles (The art teaches in Parag. [0158-0162] that computing system 100 may prevent an adversary from learning its AI behavior, by enabling computing system 100 to self-reconfigure its learning approach (e.g., using different hyperparameters in the loss function described in Equation 16 to arrive at a new set of DNN parameters, with different bit-precision and range of values). From a cyber-security perspective, such a learning approach may be resilient to adversarial targeting. The following are example application embodiments for cyber-security: Causative attacks (e.g., attacks that are missed because they are gradual over time because AI vulnerabilities are introduced during training)—the AI system can change learning method such that AI vulnerabilities are difficult to detect. An adversary may attempt to cause the AI system to classify some set of input incorrectly by manipulating the training data. For instance, if the AI system is used to detect suspicious credit card transactions, an adversary who was planning a theft using a particular type of credit card transaction may manipulate the training data such that the AI system does not recognize the particular type of credit card transaction as suspicious. However, such an attack may be less likely to succeed if there are multiple version of the neural network software architecture and which version of the neural network software architecture is deployed changes. Exploratory attacks (e.g., rule-based triggers can be inferred with sufficient sampling of system outputs, exploiting vulnerabilities after training)—our AI system can reconfigure to change the underlying response and reward functions (e.g., in a reinforcement learning approach) so that its protective measures are difficult for adversaries to learn. For example, an adversary may be able to predict how an AI system will classify input data by observing a sufficient number of system outputs. In this example, if the AI system is designed to detect suspicious credit card transactions, the adversary may be able to identify a particular type of credit card transaction that the AI system incorrectly classifies as innocuous. Accordingly, in this example, the adversary may be able to start using the particular type of credit card transaction to commit a crime. However, by changing the response and reward functions (e.g., a different versions of a neural network software architecture are deployed as hardware architecture parameters change), it may be significantly more difficult for the adversary to identify sets of input data where the AI system misclassified the input data in a way that is favorable to the adversary. Evasion attacks (e.g., an attack signal is below detection thresholds, evading by obfuscation)—our AI system can increase detection capability (e.g., changing honeypot location and complexity) to detect enemy. Poisoning attacks (e.g., attacks in which adversaries corrupt training data which weaken the distribution of input data, which results in misclassification)). 
		It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify the combination to incorporate the teachings of Chai. This would be convenient to increase detection capability to detect enemy (Parag. [0161]).
	
Claims 19, and 29 is taught by Mguni in view of Heess and Fong as described for claim 9.

     Conclusion
		The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Van Seijen et al. (US 2018/0165603) – Related art in the area of Hybrid reward architecture for reinforcement learning, (Abstract, aspects provided herein are relevant to machine learning techniques, including decomposing single-agent reinforcement learning problems into simpler problems addressed by multiple agents. Actions proposed by the multiple agents are then aggregated using an aggregator, which selects an action to take with respect to an environment. Aspects provided herein are also relevant to a hybrid reward model). 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDELBASST TALIOUA whose telephone number is (571)272-4061.  The examiner can normally be reached on Monday-Thursday 7:30 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, William Trost can be reached on 571-272-7872.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/A.T./Patent Examiner, Art Unit 2442

/WILLIAM G TROST IV/Supervisory Patent Examiner, Art Unit 2442