DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 2/05/2021 and 12/12/2019 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Provisional Application
The present application claims benefit of provisional application No. 62/673,747 dated May 18, 2018. 

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 06/23/2020 has been entered.
 
Amendments
This action is in response to amendments filed 06/23/2020. As per applicants request, claims 1, 3, 10, and 12 are amended. No new claims have been added or cancelled. Claims 1, 3-10, 12, and 14-22 remain pending.


Response to Arguments
Applicant’s arguments filed on 06/23/2020 with regards to the 35 U.S.C. 103 rejection of claims 1, 10, and 12 have been considered but are moot because the arguments do not apply to any of the references being used in the current rejection. Gangwani is a new prior art reference being incorporated into the rejection in order to teach the newly amended limitations.

Applicant’s arguments filed on 06/23/2020 with regards to the 35 U.S.C. 112(b) rejection of claim 3 has been fully considered and is persuasive. The rejection is withdrawn.

Applicant’s arguments filed on 06/23/2020 with regards to the 35 U.S.C. 103 rejection of the dependent claims have been fully considered but are not persuasive as they rely upon the allowability of the independent claims. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 4-5, 10, 12, 15-16, and 21-22 are rejected under 35 U.S.C. 103 as being unpatentable over Egorov, M. (Egorov, M., “Multi-Agent Deep Reinforcement Learning”: hereinafter Egorov) in view of Rusu, A. A., et al.  (Rusu, A. A., “POLICY DISTILLATION”: Rusu) and Gangwani et al.” POLICY OPTIMIZATION BY GENETIC DISTILLATION”, hereinafter Gangwani. 

Regarding claim 1,
Egorov teaches  method of training a final agent policy neural network that is used to select actions to be performed by an agent interacting with an environment to perform a reinforcement learning task (p. 1, Section “Abstract”, “This work introduces a novel approach for solving reinforcement learning problems in multi-agent settings.”, lines 1 – 2, and p. 1, Section “1. Introduction”, “The goal of this work is to study multi-agent systems using deep reinforcement learning (DRL).”, left col.,  ¶ 1, lines 5 – 6, and “We extend the state-of-the-art approach for solving DRL problems [13] to multi-agent systems with this state representation…. We describe a number of implementation contributions that make training efficient and allow agents to learn directly from the behavior of other agents in the system.”, right col., ¶ 2, lines 9 - 17:  the multi-agents are ‘candidate agent policy neural networks’),
the method comprising:
maintaining data specifying a plurality of candidate agent policy neural networks, wherein each candidate agent policy neural network is configured to process a network input to generate a policy output, and wherein the plurality of candidate agent policy neural networks includes the final agent policy neural network (p. 3, Section “3. Methods: 3.2. Multi-Agent Deep Reinforcement Learning”, “When multiple agents are interacting in an environment, their actions may directly impact the actions of other agents. To that end, agents must be able to reason about one another in order to act intelligently….After a set number of iterations the policy learned by the training agent gets distributed to all the other agents of its type. Specifically, an agent distributes its policy to all of its allies (see Figure 3). This process, allows one set of agents to incrementally improve their policy over time.”, left col., ¶ 3, lines 7 – 12: ‘one set of agents’ is interpreted as ‘the final agent policy neural network’”, left col., ¶ 3, lines 1 - 12), 
and wherein the final agent policy neural network defines an action selection policy for the agent that is more complex than an action selection policy defined by at least one other candidate agent policy neural network in the plurality of candidate agent policy neural networks (p. 3, Section “3. Methods: 3.2. Multi-Agent Deep Reinforcement Learning”, “After a set number of iterations the policy learned by the training agent gets distributed to all the other agents of its type. Specifically, an agent distributes its policy to all of its allies (see Figure 3). This process, allows one set of agents to incrementally improve their policy over time.”, left col., ¶ 3, lines 7 – 12: ‘one set of agents’ is interpreted as ‘the final agent policy neural network’ and this set of agents with the incrementally improved policy may be interpreted as ‘the final agent policy neural network defines an action selection policy for the agent that is more complex than an action selection policy defined by at least one other candidate agent policy neural network in the plurality of candidate agent policy neural networks’ as recited in the claim);
training the plurality of candidate agent policy neural networks jointly to perform the reinforcement learning task, comprising, at each of a plurality of training iterations (p. 3, Section “3. Methods: 3.2. Multi-Agent Deep Reinforcement Learning”, “To that end, agents must be able to reason about one another in order to act intelligently. In order to incorporate multi-agent training, we train one agent at a time, and keep the policies of all the other agents fixed during this period. After a set number of iterations the policy learned by the training agent gets distributed to all the other agents of its type. Specifically, an agent distributes its policy to all of its allies (see Figure 3). This process, allows one set of agents to incrementally improve their policy over time.”, left col., ¶ 3):
obtaining, from the training data a reinforcement learning training network input comprising a first observation of the environment (p. 3, Figure 3, and p. 2, Section “3. Methods: 3.2. Multi-Agent Deep Reinforcement Learning”, “This section outlines an approach for multi-agent deep reinforcement learning (MADRL)….These assumptions allow us to represent the global system state as an image-like tensor, with each channel of the image containing agent and environment specific information (see Figure 2).”, right col., last two lines in ¶ 1 to first two lines on p.3 left col.),
generating, using the candidate agent policy neural networks …, a first combined action selection policy for controlling the agent using the reinforcement learning training network input  (p. 3, Section “3. Methods: 3.2. Multi-Agent Deep Reinforcement Learning”, “To that end, agents must be able to reason about one another in order to act intelligently. In order to incorporate multi-agent training, we train one agent at a time, and keep the policies of all the other agents fixed during this period. After a set number of iterations the policy learned by the training agent gets distributed to all the other agents of its type. Specifically, an agent distributes its policy to all of its allies (see Figure 3). This process, allows one set of agents to incrementally improve their policy over time.”, left col., ¶ 3),
Egorov does not explicitly teach, but Rusu teaches, in an analogous system,  determining a reinforcement learning parameter update for the candidate agent policy neural networks using a reinforcement learning technique to generate combined action selection policies …(p. 2, Sec. “3 Approach: 3.1 Deep Q-Learning”, “DQN is a state-of-the-art model-free approach to reinforcement learning using deep networks, in environments with discrete action choices,..”, ¶ 1, lines 1 – 3, and p. 4, Figure 2 (b), Sec. “3.3 Multi-Task Policy Distillation”, “We use n DQN single-game experts, each trained separately. These agents produce inputs and targets, just as with single-game distillation, and the data is stored in separate memory buffers….For multi-task DQN, the approach is similar to single-game learning: the network is optimized to predict the average discounted return of each possible action given a small number of consecutive observations….Policy distillation may offer a means of combining multiple policies into a single network without the damaging interference and scaling problems.”, entire sec.: the reinforcement learning is achieved using DQN to generate an optimized action selection policy, embodied in the policy student net, from the multi-task, i.e. combined policy DQN nets),
that result in improved performance of the agent on the reinforcement learning task… (p. 7, Table 2 Performance of a distilled multi-task agent on 10 Atari games, and Sec. “4.4 MULTI-GAME POLICY DISTILLATION RESULTS”, “The distilled agent is much more stable than the DQN teacher and achieves similar or equal performance on all games (see Appendix C for additional examples of online distillation).” )
comprising determining a gradient with respect to parameters of the candidate agent policy neural networks of a reinforcement learning loss function that encourages the combined action selection policies to show improved performance on the reinforcement learning task (p. 10, Appendix A Experimental Details: Policy Distillation Training Procedure, “We used the RmsProp (Tieleman and Hinton, 2012) variation of minibatch stochastic gradient descent to train student networks.”, ¶ 2, and “Distillation Targets Using DQN outputs we have defined three types of training targets that correspond to the three distillation loss functions discussed in Section3…. This way, performance on multiple games can be measured using the geometric mean (Fleming and Wallace, 1986).” ),
obtaining, from the training data, a matching training network input comprising a second observation of the environment (p. 2, Sec. “3 Approach: 3.1 Deep Q-Learning”, “In deep Q-learning, a neural network is optimized to predict the average discounted future return of each possible action given a small number of consecutive observations…. Thus, given an environment E whose interface at timestep i comprises actions …, observations …, and rewards …,”, ¶ 1, lines 3 – 8, and p. 4, Sec. “3.3 Multi-Task Policy Distillation”, “For multi-task DQN, the approach is similar to single-game learning: the network is optimized to predict the average discounted return of each possible action given a small number of consecutive observations.”, ¶ 2, lines 2- 4 ); 
p. 4, Figure 2 (b), Sec. “3.3 Multi-Task Policy Distillation”, “We use n DQN single-game experts, each trained separately. These agents produce inputs and targets, just as with single-game distillation, and the data is stored in separate memory buffers. The distillation
agent then learns from the n data stores sequentially, switching to a different one every episode.….For multi-task DQN, the approach is similar to single-game learning: the network is optimized to predict the average discounted return of each possible action given a small number of consecutive observations….”, entire sec.)
determining a matching parameter update for the candidate agent policy neural networks that encourages the candidate agent policy neural networks to generate policy outputs that are aligned with other action policy outputs that generated by the other candidate agent policy neural networks by processing the same training network input (p. 4, Figure 2 (b), Sec. “3.3 Multi-Task Policy Distillation”, “The distillation agent then learns from the n data stores sequentially, switching to a different one every episode. Since different tasks often have different action sets, a separate output layer (called the controller layer) is trained for each task and the id of the task is used to switch to the correct output during both training and evaluation.”, ¶ 1: the switching to the correct output during training teaches ‘determining a matching parameter update for the candidate agent policy neural networks that encourages the candidate agent policy neural networks to generate policy outputs that are aligned with other action policy outputs that generated by the other candidate agent policy neural networks’),
p. 4, Sec. “3.3 Multi-Task Policy Distillation”, “We also experiment with both the KL and NLL distillation loss functions for multi-task learning.”, ¶ 1, and p. 10, Appendix A Experimental Details: “Policy Distillation Training Procedure… We used the RmsProp (Tieleman and Hinton, 2012) variation of minibatch stochastic gradient descent to train student networks.”, ¶ 2, lines 4 - 5);
It would have been obvious to one having ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate determining a reinforcement learning parameter update for the candidate agent policy neural networks using a reinforcement learning technique to generate combined action selection policies utilizing a second observation of the environment and generate a respective second policy output for each candidate agent policy neural network and determine a matching parameter update for the candidate agent policy neural networks that encourages the candidate agent policy neural networks to generate policy outputs that are aligned with other action policy outputs that generated by the other candidate agent policy neural networks by processing the same training network input, as in Rusu, into the multi-agent deep reinforcement learning network of Egorov, to train a final agent policy neural network, from a jointly trained collection of candidate agent policy neural networks, to perform an action selection policy for a reinforcement learning task. The motivation behind the incorporation of the features from Rusu is the improvement in stability by distilling online the policy of the best performing agent (in Rusu, p. 8, Section “5 Discussion”, “In this work we have applied distillation to policy learnt in deep Q-networks. This procedure has been used for three distinct purposes: (1) to compress policies learnt on single games in smaller models, (2) to build agents that are capable of playing multiple games, (3) to improve the stability of the DQN algorithm by distilling online the policy of the best performing agent.”, ¶ 1, lines 1 - 4).
Egorov/Rusu does not explicitly teach, but Gangwani teaches, in an analogous system, initializing mixing data that assigns respective weights to each of the candidate agent policy neural networks that define how policy outputs generated by the candidate agent policy neural networks are combined to generate combined policy outputs that are used to select actions to be performed by the agent;(Section 3.3.3, page 7, Algorithm 1, discloses a select operator that uses initial policies with random parameters (as the initialized mixing data). The select function includes the KL fitness function for which finds the sum of expected returns of both parent policies, where weights are provided to the function to encourage exploration of the state space. (i.e. the assigned respective weights to each of the candidate agent policy neural networks (i.e the parent policies). Furthermore, these weights define how the policy outputs generated by the candidate policy neural networks are combined to generate the combined policy outputs that are used to selection actions to be performed as section 3.3.3 discloses that the select operator returns a set of policy-couples for use in the crossover step. The crossover step in section 3.3.1 discloses that the two parent policies are mixed together to produce a child policy that learns using the observation, action distribution, and trajectories of the parent policies. Furthermore, trajectories are sampled from the child policy in order to include in the training dataset to train the child policy (therefore an action must have been selected by an agent as a trajectory was sampled from using the child policy.)
…during the training, repeatedly generating training data for the plurality of candidate agent policy neural networks by controlling the agent using combined policy outputs generated in accordance with the respective weights for each of the candidate agent policy neural networks in the mixing data.(Section 3.3.1, Pages 5-6, “Our training dataset D is initialized with trajectories from the expert. After iteration i of training, we sample some trajectories from the current student (π (i) c ), label the actions in these trajectories using the expert and form a dataset Di . Training for iteration i + 1 then uses {D ∪ D1 . . . ∪ Di} to minimize the KL-divergence loss.” Discloses generating training data after some training iterations by sampling trajectories from the current student policy (i.e. the combined policy generated by mixing the two parent policies.) to produce datasets DI, and then further training for iteration i+1 using the generated dataset that includes the outputs generated from the combined student/child policy π (i) c.)… in accordance with the weights in the mixing data as of the training iteration…during the training, repeatedly adjusting the weights in the mixing data to, when generating combined policy outputs that are used to control the agent during the generating of the training data, favor higher-performing candidate agent policy neural networks. (Section 3.3.3, Page 7, Algorithm 1, “In the early rounds, a relatively higher weight could be provided to KL-driven fitness to encourage exploration of the state-space. The weight could be annealed with rounds of Algorithm 1 for encouraging high-performance policies” discloses annealing the weights with each round (iteration) of algorithm 1 (i.e. adjusting the weights in the mixing data) to encourage high-performance policies.)
It would have been obvious to one having ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the policy distillation procedure of Gangwani, into the reinforcement learning system of Egorov/Rusu. One of ordinary skill In the art would have been motivated to make the combination in order to produce a child policy that performs better than either parent policy it is produced from. (Gangwani, Section 3.2, page 5.)

Regarding claim 4, the rejection of claim 1 is incorporated and further:
Egorov further teaches wherein the final agent policy neural network has more parameters than at least one other candidate agent policy neural network (p. 3, Section “3. Methods: 3.2. Multi-Agent Deep Reinforcement Learning”, “To that end, agents must be able to reason about one another in order to act intelligently. In order to incorporate multi-agent training, we train one agent at a time, and keep the policies of all the other agents fixed during this period. After a set number of iterations the policy learned by the training agent gets distributed to all the other agents of its type. Specifically, an agent distributes its policy to all of its allies (see Figure 3). This process, allows one set of agents to incrementally improve their policy over time.”, left col., ¶ 3: The final one set of agents shows the incremental improvement of their policy over time because they have more parameters than at least one other candidate agent policy neural network as this final agent policy network is the outcome of the multi-agent deep reinforcement learning network as presented in Egorov).

Regarding claim 5, the rejection of claim 1 is incorporated and further:
Egorov further teaches wherein the final agent policy neural network generates outputs that define a larger action space for the agent than at least one other candidate agent policy neural network (p. 3, Section “3. Methods: 3.2. Multi-Agent Deep Reinforcement Learning”, “When multiple agents are interacting in an environment, their actions may directly impact the actions of other agents. To that end, agents must be able to reason about one another in order to act intelligently….After a set number of iterations the policy learned by the training agent gets distributed to all the other agents of its type. Specifically, an agent distributes its policy to all of its allies (see Figure 3). This process, allows one set of agents to incrementally improve their policy over time.”, left col., ¶ 3, lines 7 – 12:  the multi-agents are ‘candidate agent policy neural networks’. As the learning process progresses, this last one set of agents has incrementally improved its policy, i.e. generates outputs that define a larger action space , than the other candidate agent policy neural networks).

Regarding claim 21, the rejection of claim 1 is incorporated and further:
Rusu further teaches wherein the reinforcement learning training network input and the matching training network input are the same (p. 4, Sec. “3.3 MULTI-TASK POLICY DISTILLATION”, Figure 2(b) shows the input states to the DQN, teacher networks, and the final Policy Net network, student network, are the same).
Rusu, into the multi-agent deep reinforcement learning network of Egorov/Rusu/Gangwani, to train a final agent policy neural network, from a jointly trained collection of candidate agent policy neural networks, to perform an action selection policy for a reinforcement learning task. The motivation behind the incorporation of the features from Rusu is the improvement in stability by distilling online the policy of the best performing agent (in Rusu, p. 8, Section “5 Discussion”, “In this work we have applied distillation to policy learnt in deep Q-networks. This procedure has been used for three distinct purposes: (1) to compress policies learnt on single games in smaller models, (2) to build agents that are capable of playing multiple games, (3) to improve the stability of the DQN algorithm by distilling online the policy of the best performing agent.”, ¶ 1, lines 1 - 4).

Claim 10 recites non-transitory computer readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations corresponding to the method steps recited in and addressed above in connection with claim 1. In addition, the non-transitory media is inherent in Egorov’s disclosure of simulation of motions of the agents in Section 5 (“Results” ). Therefore claim 10 is rejected under rationale similar to that set forth in connection with the rejection of claim 1 above.


Claims 12, 15-16, and 22 recite a system comprising computers and storage devices for performing operations corresponding to the method steps recited in, and addressed above, in connection with claims 1 , 4-5, and 21 respectively. In addition, the computer and storage devices are inherent in Egorov’s disclosure of simulation of motions of the agents in Section 5 (“Results” ). Therefore claims 12, 15-16, and 22 are rejected under rationale similar to that set forth in connection with the rejection of claims 1 , 4-5, and 21 respectively above.




Claims 3, 6-9, 14, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Egorov, M. (Egorov, M., “Multi-Agent Deep Reinforcement Learning”: hereinafter Egorov) in view of Rusu, A. A., et al.  (Rusu, A. A., “POLICY DISTILLATION”: hereinafter Rusu) and Gangwani et al.” POLICY OPTIMIZATION BY GENETIC DISTILLATION”, hereinafter Gangwani, in view of Zhou, Z-H., et al. (Zhou, Z-H., “Ensembling neural networks: Many could be better than all”: hereinafter Zhou).

Regarding claim 3, the rejection of claim 1 is incorporated and further:
Rusu further teaches the matching parameter updates as related to the agent policy neural networks (p. 4, Sec. “3.3 Multi-Task Policy Distillation”, “The distillation agent then learns from the n data stores sequentially, switching to a different one every episode. Since different tasks often have different action sets, a separate output layer (called the controller layer) is trained for each task and the id of the task is used to switch to the correct output during both training and evaluation.”, ¶ 1).
It would have been obvious to one having ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the matching parameter updates as related to the agent policy neural networks, as in Rusu, into the multi-agent deep reinforcement learning network of Egorov/Rusu/Gangwani, to train a final agent policy neural network, from a jointly trained collection of candidate agent policy neural networks, to perform an action selection policy for a reinforcement learning task. The motivation behind the incorporation of the features from Rusu is the improvement in stability by distilling online the policy of the best performing agent (in Rusu, p. 8, Section “5 Discussion”, “In this work we have applied distillation to policy learnt in deep Q-networks. This procedure has been used for three distinct purposes: (1) to compress policies learnt on single games in smaller models, (2) to build agents that are capable of playing multiple games, (3) to improve the stability of the DQN algorithm by distilling online the policy of the best performing agent.”, ¶ 1, lines 1 - 4).

Egorov/Rusu/Gangwani does not explicitly teach, but Zhou teaches, in an analogous system,
 wherein training the candidate agent policy neural networks to generate action selection policies that are aligned comprises: 
p. 246, Section “3. Selective ensemble of neural networks”, “In this section we present a practical approach, i.e., GASEN, to find out the neural networks that should be excluded from the ensemble. The basic idea of this approach is a heuristics, i.e., assuming each neural network can be assigned a weight that could characterize the fitness of including this network in the ensemble…”, ¶ 2, and “GASEN assigns a random weight to each of the available neural networks at first. Then it employs genetic algorithm to evolve those weights so that they can characterize to some extent the fitness of the neural networks in joining the ensemble. Finally it selects the networks whose weight is bigger than a pre-set threshold λ to make up the ensemble.” ¶ 5 last two lines to the first line on p. 247: as training evolves, the ensemble networks of Zhou result in a combination of the impact of neural networks to the final architecture because of a poor fit to join the ensemble while favoring other networks, i.e. with a higher weight attached to them, to join the ensemble as they are a better fit. Of this last group, one network with the highest weight associated with it may be designated the ‘final agent policy neural network’.).
It would have been obvious to one having ordinary skill in the art, before the effective filing date of the claimed invention, to employ the use of training weighted ensemble general neural networks, as in Zhou, into the multi-agent deep reinforcement learning network of Egorov/Rusu/Gangwani, to train a final agent policy neural network, from a jointly trained collection of candidate agent policy neural networks, by decreasing the impact of parameter updates of the candidate agent policy neural networks to Zhou is that it showed that it may be better to ensemble many, or at least some of the networks, instead of all of the available networks to improve performance as well as reducing bias together with variance of the final network (in Zhou, p. 261, Section “6. Conclusions”, “In this paper, the relationship between the ensemble and its component neural networks is analyzed, which reveals that it may be a better choice to ensemble many instead of all the available neural networks.”, ¶ 1, lines 3 - 5, and “It seems that the success of GASEN mainly lies in that GASEN could reduce the bias as well as the variance.”, ¶ 3, lines 2 - 3).


Regarding claim 6, the rejection of claim 1 is incorporated and further:
Egorov/Rusu/Gangwani teaches the candidate agent policy neural networks (see previous citations) 
Egorov/Rusu/Gangwani does not teach, but Zhou teaches, in an analogous system, wherein generating, using the … neural networks and in accordance with the weights in the mixing data as of the training iteration, … using the training network input comprises: 
processing the training network input using each of the … neural networks to…; and combining … in accordance with the weights as of the training iteration to generate the combined…[network] (p. 246, Section “3. Selective ensemble of neural networks”, “In this section we present a practical approach, i.e., GASEN, to find out the neural networks that should be excluded from the ensemble. The basic idea of this approach is a heuristics, i.e., assuming each neural network can be assigned a weight that could characterize the fitness of including this network in the ensemble…”, ¶ 2, and “Here we explain the motivation of GASEN from the context of regression. Suppose the weight of the i th component neural network is wi , which satisfies both Eqs. (1) and (2). Then we get a weight vector w = (w1,w2, . . . ,wN). Since the optimum weights should minimize the generalization error of the ensemble,…”, ¶ 3, and “GASEN assigns a random weight to each of the available neural networks at first. Then it employs genetic algorithm to evolve those weights so that they can characterize to some extent the fitness of the neural networks in joining the ensemble.” ¶ 5 last two lines to the first line on p. 247).
It would have been obvious to one having ordinary skill in the art, before the effective filing date of the claimed invention, to employ the use of training weighted ensemble general neural networks, as in Zhou, into the multi-agent deep reinforcement learning network of Egorov/Rusu/Gangwani, to train a final agent policy neural network, from a jointly trained collection of candidate agent policy neural networks, by processing the training network input using each of the candidate agent policy neural networks to generate a respective action selection policy for each candidate agent policy neural network and combining the action selection policies in accordance with the weights as of the training iteration to generate the combined action selection policy. The motivation behind the incorporation of the weighted ensemble formulation from Zhou is that it showed that it may be better to ensemble many, or at least some of the networks, instead of all of the available networks to improve performance as well as reducing bias together Zhou, p. 261, Section “6. Conclusions”, “In this paper, the relationship between the ensemble and its component neural networks is analyzed, which reveals that it may be a better choice to ensemble many instead of all the available neural networks.”, ¶ 1, lines 3 - 5, and “It seems that the success of GASEN mainly lies in that GASEN could reduce the bias as well as the variance.”, ¶ 3, lines 2 - 3).

Regarding claim 7, the rejection of claim 1 is incorporated and further:
Egorov teaches wherein training the plurality of candidate agent policy neural networks jointly to perform the reinforcement learning task comprises: 
training a population of combinations of candidate agent policy neural networks (p. 1, Section “Abstract”, “This work introduces a novel approach for solving reinforcement learning problems in multi-agent settings.”, lines 1 – 2, and p. 1, Section “1. Introduction”, “The goal of this work is to study multi-agent systems using deep reinforcement learning (DRL).”, left col.,  ¶ 1, lines 5 – 6, and “We extend the state-of-the-art approach for solving DRL problems [13] to multi-agent systems with this state representation…. We describe a number of implementation contributions that make training efficient and allow agents to learn directly from the behavior of other agents in the system.”, right col., ¶ 2, lines 9 – 17:  the multi-agents are ‘candidate agent policy neural networks’),
Egorov/Rusu/Gangwani does not explicitly teach, but Zhou teaches, in an analogous system,… and wherein repeatedly adjusting the weights in the mixing data to favor higher-performing candidate agent policy neural networks… during the training, p. 246, Section “3. Selective ensemble of neural networks”, “In this section we present a practical approach, i.e., GASEN, to find out the neural networks that should be excluded from the ensemble. The basic idea of this approach is a heuristics, i.e., assuming each neural network can be assigned a weight that could characterize the fitness of including this network in the ensemble…”, ¶ 2: the weights in the mixing data, i.e. weight vector in Zhou, are adjusted accordingly to de-emphasize the networks that do not have a good fit in the ensemble, i.e. lower-performing combinations, in favor of the networks that do have a better fit, i.e. higher-performing combinations. This is done in the context of the ensemble neural network learning method detailed in Zhou which is similar to the ‘population-based training technique’ cited in the claim).
It would have been obvious to one having ordinary skill in the art, before the effective filing date of the claimed invention, to employ the use of training weighted ensemble general neural networks, as in Zhou, into the multi-agent deep reinforcement learning network of Egorov/Rusu/Gangwani, to train a final agent policy neural network, from a jointly trained collection of candidate agent policy neural networks, to repeatedly adjust the weights in the mixing data to favor higher-performing candidate agent policy neural networks and adjust the weights in the mixing data used by lower- performing combinations based on weights used by higher-performing combinations using a population-based training technique. The motivation behind the incorporation of the weighted ensemble formulation from Zhou is that it showed that it may be better to Zhou, p. 261, Section “6. Conclusions”, “In this paper, the relationship between the ensemble and its component neural networks is analyzed, which reveals that it may be a better choice to ensemble many instead of all the available neural networks.”, ¶ 1, lines 3 - 5, and “It seems that the success of GASEN mainly lies in that GASEN could reduce the bias as well as the variance.”, ¶ 3, lines 2 - 3).

Regarding claim 8, the rejection of claim 7 is incorporated and further:
Egorov/Rusu/Gangwani does not teach explicitly teach, but Zhou teaches, in an analogous system, wherein a performance of a combination is based on a quality of the combined … outputs generated during the training (p. 246, Section “3. Selective ensemble of neural networks”, “GASEN assigns a random weight to each of the available neural networks at first. Then it employs genetic algorithm to evolve those weights so that they can characterize to some extent the fitness of the neural networks in joining the ensemble. Finally it selects the networks whose weight is bigger than a pre-set threshold λ to make up the ensemble.” ¶ 5 last two lines to the first line on p. 247: the performance of the combination neural networks is based the combined outputs generated during training and depending on the quality, i.e. the goodness of fit of each neural network in the ensemble in joining the ensemble.).
It would have been obvious to one having ordinary skill in the art, before the effective filing date of the claimed invention, to employ the use of training weighted ensemble general neural networks, as in Zhou, into the multi-agent deep reinforcement Egorov/Rusu/Gangwani, to train a final agent policy neural network, from a jointly trained collection of candidate agent policy neural networks, by decreasing the impact of training the candidate agent policy neural networks to generate action selection policies that are aligned as the weight for the final agent policy neural network is increased. The motivation behind the incorporation of the weighted ensemble formulation from Zhou is that it showed that it may be better to ensemble many, or at least some of the networks, instead of all of the available networks to improve performance as well as reducing bias together with variance of the final network (in Zhou, p. 261, Section “6. Conclusions”, “In this paper, the relationship between the ensemble and its component neural networks is analyzed, which reveals that it may be a better choice to ensemble many instead of all the available neural networks.”, ¶ 1, lines 3 - 5, and “It seems that the success of GASEN mainly lies in that GASEN could reduce the bias as well as the variance.”, ¶ 3, lines 2 - 3).

Regarding claim 9, the rejection of claim 7 is incorporated and further:
Egorov teaches wherein a performance of a combination is based only on a quality of policy outputs generated by the final agent policy neural network in the combination and not on policy outputs generated by the other agent policy neural networks in the combination. (p. 3, Section “3. Methods: 3.2. Multi-Agent Deep Reinforcement Learning”, “When multiple agents are interacting in an environment, their actions may directly impact the actions of other agents. To that end, agents must be able to reason about one another in order to act intelligently….After a set number of iterations the policy learned by the training agent gets distributed to all the other agents of its type. Specifically, an agent distributes its policy to all of its allies (see Figure 3). This process, allows one set of agents to incrementally improve their policy over time.”, left col., ¶ 3, lines 7 – 12: ‘one set of agents’ is interpreted as ‘the final agent policy neural network’ and the performance is based on the policy outputs generated by it and not on policy outputs generated by the other agent policy neural networks in the combination.)

Regarding claims 14, and 17-20,
Claims 14, and 17-20 recite a system comprising computers and storage devices for performing operations corresponding to the method steps recited in and addressed above in connection with claims 3, and 6-9 respectively. In addition, the computer and storage devices are inherent in Egorov’s disclosure of simulation of motions of the agents in Section 5 (“Results” ). Therefore claims 14, and 17-20 are rejected under rationale similar to that set forth in connection with the rejection of claims 3, and 6-9 respectively above.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VASYL DYKYY whose telephone number is (571)270-5019.  The examiner can normally be reached on M-F 7:30 - 4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/V.D./Examiner, Art Unit 2122                                                                                                                                                                                                        22

/BABOUCARR FAAL/Primary Examiner, Art Unit 2184