Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on October 17, 2022 has been entered.

Remarks
This Office Action is in response to applicant’s amendment and RCE filed on October 17, 2022, under which claims 1-4, 6-9, 11-13, 15-21, and 23-24 are pending and under consideration.
 
Response to Arguments
Applicant’s amendments have overcome the previous claim objections and the previous § 103 rejections. Therefore, the previous objections and the previous § 103 rejections have been withdrawn. However, upon further consideration, new grounds of rejection have been made in view of new reference Fujiki, as set forth below.
Applicant’s arguments with respect to the previous § 103 rejection have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 9, 11-13, 15, and 16-21 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
In claim 9, the term “the second output” lacks antecedent basis, because “second output” was removed from claim 1 in the previous claim amendment. Therefore, “the second output” is indefinite. 
In claims 11 and 16, the two instances of “the observation” have insufficient antecedent basis or unclear antecedent. In regards to the first the instance of “the observation” in these claims, the claims do not recite “an observation” prior to reciting “the observation.” In regards to the second instance of “the observation” in these claims, it is unclear whether the second instance of “the observation” refers to the same thing as the first instance of “the observation.” The context suggests that the two are different observation, but the same term is used in both situations. This part of the rejection can be overcome by amending the claims to recite “observe a first environment of the task to obtain a first observation; and generate a first output based, in part, on the first observation” and “observe a second environment of the task to obtain a second observation; and generate a second output based, in part, on the second observation.” For purposes of examination, the term “the observation” has been interpreted in the manner of the suggested revision (i.e., the observation is the result of the respective act of observing).
Dependent claims 12-13, 15, and 17-21 are rejected due to their dependency on claim 11 or 16. These dependent claims incorporate the indefinite recitations of their parent claims without curing the deficiencies thereof. 

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 11-12, 15-17, and 19-21 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Fujiki et al., “Adaptive Action Selection of Body Expansion Behavior in Multi-Robot System using Communication,” in Journal of Advanced Computational Intelligence and Intelligent Informatics, January 2007 (“Fujiki”).
As to claim 11, Fujiki teaches a computer-implemented method comprising: [§ 5: “we enabled to treat communication as intention transmission action in multi-robot system and also examined its performance by computer simulations.” This description of “computer simulations” indicates that the method is performed by a computer, i.e., “computer-implemented.”]
generating a plurality of agents, each agent associated with a different aspect of a task, wherein the task defines an environment and a set of environment actions that can be taken with respect to the environment, and wherein the task comprises at least a first subtask associated with a first aspect of the task and a second subtask associated with a second aspect of the task; [Abstract: “In multi-robot system, cooperation is needed to execute tasks efficiently… We also run some computer simulations of collision avoidance as an example of cooperative task.” The robots in the simulated system correspond to agents that perform a task of avoiding collision in order to reach a goal, as described in § 4.1, paragraph 1: “There are two omni-directional mobile robots in simulation field, and the task is collision avoidance,” and in § 2.3 (“distance from the goal g(t)”). That is, Fujiki teaches an overall task of collision avoidance of two robots that are aiming to reach respective goals. The task of getting a particular robot to the goal while avoiding collision is an aspect of this overall task, and is regarded as a “subtask.” Since there are two robots, the tasks performed by the robots are respectively a first subtask and a second subtask. With respect to the limitation of “an environment and a set of environment actions that can be taken with respect to the environment,” the robots operate in the environments shown in FIG. 4 (“Overview of environment”), which is also referred to as a “simulation field” or “simulation area” in § 4.1. This environment constitutes a space in which the robots move and is represented by the state space shown in Table 2 A set of environment actions is disclosed in Table 1 in § 2.2, which include actions of speeding up/down and changing direction within the environment.]
using a first agent of the plurality of agents that is trained to: [The two robots robot discussed above respectively correspond to a “first agent” and a “second agent.” Furthermore, they are “trained” for the aforementioned task using reinforcement learning, which is described in § 2.1: “Reinforcement Learning…is widely used in robotic systems to emerge robots’ actions from the interaction between the environment.” See also § 3, paragraph 3: “In this paper, we use Q-Learning algorithm for SMDP and the Q values from the RL are used as numeric values for two step action selection. The implemented algorithm for the robot is shown in Figure 3.”] 
	observe a first environment of the task; [The robots operate in the environments shown in FIG. 4, which is the “simulation field” or “simulation area” that is described in § 4.1 and is represented by the state space shown in Table 2. As disclosed in Table 2, each of the two robots observe a state space. Thus, each robot perceives a respective environment represented by a respective state space.] and 
	generate a first output based, in part, on the observation and a first reward function associated with the first subtask, [§ 2.3 teaches “action selection” in general, wherein an action a is selected from a set of actions A based on the Q values for each state-action pair, and the Q values are in turn based on the reward function r and the state s (i.e., the observation of the environment). See equations 1 and 2 in § 2.1, and the “Observe State s” step in FIG. 3. § 3 describes the particular action selection algorithm shown in FIG. 2: “For this action adjustment, we introduce the algorithm which is illustrated in Figure 2. First, a robot decides whether to move itself or to make other robot move by communication. This is a selfish action selection which doesn’t consider the state of other robot.” The action selected in the “Selfish Action Selection” step (see FIG. 3) constitutes a “first output.”] wherein the first reward function is configured to cause the first agent to follow some requests by a second agent and ignore some requests by the second agent; [§ 4.1, paragraph 4: “Reward for the robots are calculated by equation (4), but in case of any collisions, r = −5 is given as punishment value.” § 3, paragraph 1: “When communication is treated as an action for intention transmission, accepting all the requested actions will only to improve other robots’ situations. However, for the whole system, it is seems that most effective way is to accept the request only when the situations of both robots can be improved. To accept such requests, there is a need for action adjustment function to compare the actions which are self determined action and a requested one by communication.” § 3, paragraph 2: “…there is a probability that the request will be refused, but whether to accept or reject the request is determined by the receiver. Next, a robot will determine which action to make; the selfish action that is decided at first step or a requested action by other robot.” The limitation of “to cause” is met because all decisions of the agents are, in some way, caused by the reward function, since Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward. The instant claim does not require any specific manner of causation.] 
using the second agent of the plurality of agents that is trained to: [As noted above, The two robots robot discussed above respectively correspond to a “first agent” and a “second agent.” The two robots have similar characteristics. Therefore, the features discussed above for the one robot (the first agent) also apply to the other robot (the second agent).]
	observe a second environment of the task; [As noted above, the robots operate in the environments shown in FIG. 4, which is the “simulation field” or “simulation area” that is described in § 4.1 and is represented by the state space shown in Table 2. Additionally, as disclosed in Table 2, the two robots have different respective state spaces, since the state space of each robot takes into account the status of the other robot. Thus, each robot perceives a respective environment represented by a different respective state space. The Examiner notes that the instant application generally teaches a single environment to which both agents observe, as shown in FIGS. 2 and 3. In light of the specification embodiments, the instant claim has been interpreted in the manner that the “first environment” and “second environment” can be the same environment in the general sense. Thus, the instant limitation is taught because both robots observe the same general environment shown in FIG. 4, or alternatively because they observe different environments in the sense of different respective state spaces.] and 
	generate a second output based, in part, on the observation and a second reward function associated with the second subtask; [This limitation is taught for the reasons given for the corresponding limitation for the first agent. In regards to “a second reward function,” see also § 4.1, paragraph 4: “Reward for the robots are calculated by equation (4), but in case of any collisions, r = −5 is given as punishment value.” That is, § 2.3 teaches that each robot has a reward function defined by equation (4), which depends on the “distance from the goal g(t).” As shown in FIG. 1 (lower-right quadrant) and FIG. 4, each robot has its own goal. Thus, the first and second robots (agents) have their respective first and second reward functions.]
based on the first output, causing the first agent to select an environment action of the set of environment actions associated with the first subtask by either following a request from the second agent for the environment action or ignoring a request from the second agent for another environment action. [§ 3, paragraph 2: “Next, a robot will determine which action to make; the selfish action that is decided at first step or a requested action by other robot. By those two steps, a robot can select an action considering a request from other robot.” As also stated in this part of the document, for any particular robot that receives a request, “there is a probability that the request will be refused, but whether to accept or reject the request is determined by the receiver.” Rejecting the request in this context corresponds to “ignoring” the request. Table 1 generally shows a set of environmental actions that a robot can take (e.g.,, speed up/down or change direction). The right side of this table, shows that the request by a robot is likewise an environment action that the other robot is requested to take. In summary, as shown in FIGS. 2-3, based on the previously determined self action, the robot (agent) then selects a final action, which may be an environment action (i.e., “an environment action of the set of environment actions”), a “Move Own Body” action as shown in FIG. 2 (top left) and Table 1 (left side). This final “Move Own Body” action may be an action requested by the other robot (Table 1, right side) that is performed by following the request of the other robot, or a selfishly determined “Move Own Body” action that is performed in the manner of ignoring the request from the other robot to perform the requested action. In regards to “based on the first output,” the selection of the final action to be executed aggregates the selected selfish action (the first output) and the requested action, and is therefore considered to be based on the selected selfish action.]

As to claim 12, Fujiki teaches the computer-implemented method of claim 11, wherein each output is selected from a set of outputs defined for each agent. [Fujiki, § 2.2, Table 1 teaches the set of actions that each robot can take. The outputs (first output and second output) are actions selected from among this set.]

As to claim 15, Fujiki teaches the computer-implemented method of claim 11, wherein the plurality of agents are non-cooperative. [The limitation of “non-cooperative” has been interpreted to be satisfied by the situation in which the agents have different reward functions, since paragraph 62 states that in a “cooperative” task to be one in which agents share the same reward function, while the category of “neither cooperative nor competitive” encompasses “mixed tasks” where the reward function are not the same. It is also noted that that the present context of the claim indicates that “non-cooperating” does not exclude cooperation between agents in the form of communication and the following of requests. Here, § 2.3 teaches that each robot has a reward function defined by equation (4), which depends on the “distance from the goal g(t).” As shown in FIG. 1 (lower-right quadrant) and FIG. 4, each robot has its own goal. Since each robot has a different respective reward function that accounts for the distance to the respective goal of the respective robot, the robots are considered to be “non-cooperative.”]

As to claim 16, this claim is directed to a non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to perform operations that are substantially the same as those recited in claim 11. Therefore, the rejection made to claim 11 is applied to claim 16.
Additionally, Fujiki  teaches “a non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to perform…” because it is understood that the method of Fujiki is performed using a computer, as discussed above (see Fujiki, § 5: “we enabled to treat communication as intention transmission action in multi-robot system and also examined its performance by computer simulations.”). It is furthermore understood that such a computer executes instructions stored on a non-transitory computer readable medium to perform the method. Therefore, the instant limitations are implicitly disclosed.

As to claim 17, Fujiki teaches the non-transitory computer readable medium of claim 16, wherein the environment action is selected from a subset of the set of environment actions. [Fujiki, Table 1 in § 2.2, teaches a set of environment actions of speeding up/down and changing direction within the environment. Selection of an action from the entire set also constitutes selection from an arbitrary subset of that set.]

As to claim 19, Fujiki teaches the non-transitory computer readable medium of claim 16, wherein the instructions further cause the processor to perform the chosen environment action. [§ 3, FIG. 3: “Execute Selected Action a” step, which executes the action (the chosen environment action) selected in the previous step.] 

As to claim 20, the further limitations recited n this claims are the same or substantially the same as those recited in claim 15. Therefore, the rejection made to claim 15 is applied to claim 20. 

As to claim 21, the further limitations recited n this claims are the same or substantially the same as those recited in claim 12. Therefore, the rejection made to claim 12 is applied to claim 21. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

1.	Claims 1-4 and 23-24 are rejected under 35 U.S.C. § 103 as being unpatentable over Fujiki in view of Ghavamzadeh et al., “Hierarchical multi-agent reinforcement learning,” Auton Agent Multi-Agent Sys (2006) 13: 197–229 (“Ghavamzadeh”).
As to claim 1, Fujiki teaches a method comprising: 
receiving a […] task having a set of environment actions, wherein the task comprises a first subtask associated with one aspect of the task and a second subtask associated with another aspect of the task; [Abstract: “In multi-robot system, cooperation is needed to execute tasks efficiently… We also run some computer simulations of collision avoidance as an example of cooperative task.” The robots in the system correspond to agents that seek to avoid collision in order to reach a goal, as further described in § 4.1, paragraph 1: “There are two omni-directional mobile robots in simulation field, and the task is collision avoidance,” and in § 2.3 (“distance from the goal g(t)”). That is, Fujiki teaches an overall task of collision avoidance of two robots that are aiming to reach respective goals. The task of getting a particular robot to the goal while avoiding collision is an aspect of this overall task, and is regarded as a subtask. Since there are two robots, Fujiki therefore teaches a first subtask and a second subtask. A set of environment actions is disclosed in Table 1 in § 2.2, which include actions of speeding up/down and changing direction.]
instantiating a plurality of non-cooperating agents comprising at least a first agent assigned to a first environment and a second agent assigned to a second environment [§ 4.1, paragraph 1: “There are two omni-directional mobile robots in simulation field, and the task is collision avoidance.” Thus, the two robots robot discussed above respectively correspond to a “first agent” and a “second agent.” These agents are “trained” for the aforementioned task using reinforcement learning, which is described in § 2.1: “Reinforcement Learning…is widely used in robotic systems to emerge robots’ actions from the interaction between the environment.” See also § 3, paragraph 3: “In this paper, we use Q-Learning algorithm for SMDP and the Q values from the RL are used as numeric values for two step action selection. The implemented algorithm for the robot is shown in Figure 3.” With respect to the limitation of “non-cooperating,” this term has been interpreted to be satisfied by the situation in which the agents have different reward functions, since paragraph 62 states that in a “cooperative” task to be one in which agents share the same reward function, while the category of “neither cooperative nor competitive” encompasses “mixed tasks” where the reward function are not the same. It is also noted that that the present context of the claim indicates that “non-cooperating” does not exclude cooperation between agents in the form of communication and the following of requests. Here, § 2.3 teaches that each robot has a reward function defined by equation (4), which depends on the “distance from the goal g(t).” As shown in FIG. 1 (lower-right quadrant) and FIG. 4, each robot has its own goal. Since each robot has a different respective reward function that accounts for the distance to the respective goal of the respective robot, the robots are considered to be “non-cooperating” (or non-cooperative). With respect to the limitations of “assigned to a first environment” and “assigned to a second environment,” the robots operate in the environments shown in FIG. 4 (“Overview of environment”), which is also referred to as a “simulation field” or “simulation area” in § 4.1. This environment constitutes a space in which the robots move and is represented by the state space shown in Table 2. The Examiner notes that the instant application generally teaches a single environment to which both agents are assigned, as shown in FIGS. 2 and 3. Accordingly, the instant claim has been interpreted such that that “first environment” and “second environment” can be the same environment in the general sense, as such would be consistent with the specification embodiments. Additionally, as disclosed in Table 2, the two robots have different state spaces, since the state space of each robot takes into account the status of the other robot. Thus, each robot perceives a respective environment represented by a different respective state space. That is, the instant limitations are taught because both robots observe the same general environment shown in FIG. 4, or alternatively because they are assigned to different environments in the sense of different respective state spaces.] wherein the first agent is trained to choose a first output of a first defined output set based at least on a first reward function [§ 2.3 teaches “action selection” in general, wherein an action a is selected from a set of actions A based on the Q values for each state-action pair, and the Q values are in turn based on the reward function r (see equations 1 and 2 in § 2.1). § 3 describes the particular action selection algorithm shown in FIG. 2: “For this action adjustment, we introduce the algorithm which is illustrated in Figure 2. First, a robot decides whether to move itself or to make other robot move by communication. This is a selfish action selection which doesn’t consider the state of other robot.” This set of possible actions in the selfish action selection process (see also FIG. 3), or any subset thereof, corresponds to a “first defined output set,” and are illustrated in Table 1, and the selected action constitutes a “first output.”]
configuring the first reward function to cause the first agent to follow some requests by the second agent and ignore some requests by the second agent; [§ 4.1, paragraph 4: “Reward for the robots are calculated by equation (4), but in case of any collisions, r = −5 is given as punishment value.” § 3, paragraph 1: “When communication is treated as an action for intention transmission, accepting all the requested actions will only to improve other robots’ situations. However, for the whole system, it is seems that most effective way is to accept the request only when the situations of both robots can be improved. To accept such requests, there is a need for action adjustment function to compare the actions which are self determined action and a requested one by communication.” § 3, paragraph 2: “…there is a probability that the request will be refused, but whether to accept or reject the request is determined by the receiver. Next, a robot will determine which action to make; the selfish action that is decided at first step or a requested action by other robot.” The limitation of “to cause” is met because all decisions of the agents are, in some way, caused by the reward function, since Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward. The instant claim does not require any specific manner of causation.] 
based on the configured first reward function, determining the first output according to the first reward function of the first defined output set; [As noted above, the “Selfish Action Selection” step in FIG. 3, which selects a selfish action (i.e., a first output), is based on the reward function.] and 
based on the first output, causing the first agent to select an environment action of the set of environment actions associated with the first subtask by either following a request from the second agent for the environment action or ignoring a request from the second agent for another environment action. [§ 3, paragraph 2: “Next, a robot will determine which action to make; the selfish action that is decided at first step or a requested action by other robot. By those two steps, a robot can select an action considering a request from other robot.” As also stated in this part of the document, for any particular robot that receives a request, “there is a probability that the request will be refused, but whether to accept or reject the request is determined by the receiver.” Rejecting the request in this context corresponds to “ignoring” the request. Table 1 generally shows a set of environmental actions that a robot can take (e.g.,, speed up/down or change direction). The right side of this table, shows that the request by a robot is likewise an environment action that the other robot is requested to take. In summary, as shown in FIGS. 2-3, based on the previously determined self action, the robot (agent) then selects a final action, which may be an environment action (i.e., “an environment action of the set of environment actions”), a “Move Own Body” action as shown in FIG. 2 (top left) and Table 1 (left side). This final “Move Own Body” action may be an action requested by the other robot (Table 1, right side) that is performed by following the request of the other robot, or a selfishly determined “Move Own Body” action that is performed in the manner of ignoring the request from the other robot to perform the requested action. In regards to “based on the first output,” the selection of the final action to be executed aggregates the selected selfish action (the first output) and the requested action, and is therefore considered to be based on the selected selfish action.] 
Fujiki does not explicitly teach the limitation that the overall “task” is specifically a “single-agent task.” While the individual robots can each be considered to be performing a part of an overall task of collision avoidance, the overall task is not explicitly formulated or otherwise described as a “single agent task.”
Ghavamzadeh, in an analogous art, teaches the above limitation. Ghavamzadeh generally relates to multi-agent reinforcement learning (title and abstract), and is therefore in the same field of endeavor as the claimed invention. 
In particular, Ghavamzadeh teaches a “single-agent task” that comprises multiple subtasks [§ 3, paragraph 1: “Our hierarchical multi-agent RL framework can be viewed as extending the existing single-agent HRL methods, including hierarchies of abstract machines.” § 3.1, paragraph 2: “the overall task is decomposed into a collection of primitive actions and temporally extended (non-primitive) subtasks that are important for solving the problem.” § 3.1, paragraph 1: “decomposing the overall task MDP M, into a finite set of subtasks.” In other words, Ghavamzadeh teaches multiple agents performing subtasks of a task that is can also be regarded as a single-agent task.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Fujiki with the teachings of Ghavamzadeh by modifying the overall task in Fujiki to be a single-agent task (which then comprises subtasks). The motivation for doing so would have been to learn a single-agent task that is decomposable into multiple subtasks in a way that facilities effective learning of skills, as suggested by Ghavamzadeh, page 201, § 3.1 (“Motivating Example”), paragraph 2: “The strength of the HRL methods (when extended to the multi-agent domains) is that they can serve as a substrate for efficiently learning all these three types of skills. In these methods, the overall task is decomposed into a collection of primitive actions and temporally extended (non-primitive) subtasks that are important for solving the problem.”). 

As to claim 2, the combination of Fujiki and Ghavamzadeh teaches the method of claim 1, wherein the first defined output set of the first agent of the plurality of non-cooperating agents comprises a second output associated with a communication action of the first agent. [Fujiki, § 2.2, Table 1 teaches that the actions of a particular robot (which would be among the actions selected in the “Selfish Action Selection” process) includes an action of communicating with the other robot to make a request. These actions are described in § 2.2, paragraph 2: “Considering communication as robots’ action, basic actions for robots are set as Table 1. Here, ‘Communication’ means intention transmission, which is a requesting action to other robot to make an asked action. This means that a robot can request any actions which the other robot can make.”]

As to claim 3, the combination of Fujiki and Ghavamzadeh teaches the method of claim 1, wherein the second defined output set of the second agent of the plurality of non-cooperating agents comprises only outputs associated with communication actions. [Fujiki, § 2.2, Table 1 teaches that the actions of a particular robot (which would be among the actions selected in the “Selfish Action Selection” process) includes an action of communicating with the other robot to make a request. These actions are described in § 2.2, paragraph 2: “Considering communication as robots’ action, basic actions for robots are set as Table 1. Here, ‘Communication’ means intention transmission, which is a requesting action to other robot to make an asked action. This means that a robot can request any actions which the other robot can make.” That is, the other robot (i.e., the second agent) may select an action that is a “communication action.” The subset of actions in Table 1 that are communication actions may be regarded as the “second defined output set.”]

As to claim 4, the combination of Fujiki and Ghavamzadeh teaches the method of claim 1, wherein the first defined output set of the first agent of the plurality of non-cooperating agents comprises only outputs associated with environment actions. [As noted in the rejection of claim 1, Fujiki, Table 1 teaches environment actions. The subset of actions in Table 1 that are environment actions may be regarded as the “first defined output set.”]

As to claim 23, the combination of Fujiki and Ghavamzadeh teaches the method of claim 1, further comprising: 
determining the first output according to the first reward function based upon a communication reward associated with a communication action of the second agent, wherein the communication action is a request from the second agent. [Fujiki, § 2.3 and equation (3) teaches that the an action a is selected based on the Q-value Q(S,a) and the Q-value of the other actions (i.e., other actions represented by “                        
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            ∈
                            A
                        
                    ” in equation 3). Therefore, when the requested action is evaluated against the selfish action in the “Total Action Selection” step in FIG. 3, the Q-value of the requested action includes a “communication reward” associated with the action requested by the second robot and, by extension, also associated with the communication action of the second robot. As shown in Table 1 of Fujiki, a communication action includes a transmission of a request from one robot to another. The Examiner notes that while specification embodiments mention a fixed additional reward for following a request, the instant claim is not so limited, and that “communication reward” has been given its broadest reasonable interpretation.]

As to claim 24, the combination of Fujiki and Ghavamzadeh teaches the method of claim 23, wherein the environment action is selected with higher dependence on the communication action of the second agent when the communication reward is higher, and wherein the environment action is selected with lower dependence on the communication action of the second agent when the communication reward is lower. [In accordance with Fujiki, § 2.3 and equation (3), an action with a higher Q value would be selected at a greater probability. Therefore, when the robot is deciding between the selfish action and the requested action, the final selected action is more likely to be the requested action (i.e.. more dependent on the communication from the other robot) when the associated Q-value is higher, and less likely to be the requested action (i.e., less dependent on the communication of the second robot) when the associated Q-value is higher.]

2.	Claim 6 is rejected under 35 U.S.C. § 103 as being unpatentable over Fujiki in view of Ghavamzadeh and further in view of Dolgov et al., “Graphical models in local, asymmetric multi-agent Markov decision processes,” Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004., New York, NY, USA, 2004, pp. 956-963. (“Dolgov”) and Yurchenko et al. (US 2017/0154123A1) (“Yurchenko”).
As to claim 6, the combination of Fujiki and Ghavamzadeh teaches the method of claim 1, and the limitation of “non-cooperating agents” as set forth in the rejection of claim 1, but does not teach the method further comprising the additionally recited operations.
Dolgov, in an analogous art, teaches “a cyclic relationship within the plurality of non-cooperating agents” and “an acyclic relationship.” Dolgov generally relates graphical models for multi-agent Markov decision processes (title) and is therefore in the field of machine learning, particularly reinforcement learning.
In particular, Dolgov teaches a cyclic relationship within the plurality of agents [§ 5.2: “Cyclic Dependency Graphs”] and an acyclic relationship [§ 5.1: “Acyclic Dependency Graphs”]. Dolgov teaches that cyclic and acyclic relationships are known examples of dependency relationships between agents in a multiagent Markov decision process (§ 1, especially paragraph 3). Therefore, Dolgov generally teaches that the dependency relationship of agents can be suitably represented using cyclic or acyclic graphs, depending on the particular application.
Furthermore, Dolgov teaches that the acyclic relationships have a further benefit of lower complexity (§ 7: “the complexity of solutions to be less prohibitive in some cases (acyclic dependency graphs)”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Fujiki and Ghavamzadeh with the teachings of Dolgov by implementing, for the agents of Fujiki, a cyclic relationship within the plurality of agents and an acyclic relationship, in order to utilize a dependency relationships between agents in a multiagent Markov decision process that are suitable, as suggested by Dolgov.
Yurchenko, in an analogous art, teaches “determining that there is a cyclic relationship” and “responsive to determining that there is a cyclic relationship, converting the cyclic relationship into an acyclic relationship.” Yurchenko generally relates to the data processing involving graph relationships (abstract). Yurchenko is in the same field of endeavor or is pertinent to reinforcement learning.
In particular, Yurchenko teaches determining that there is a cyclic relationship [abstract: “determine a graph of nodes and edges”; see also [0027], [0032] and FIG. 3, step 210 (process the directed graph for cycles)]; and responsive to determining that there is a cyclic relationship, converting the cyclic relationship into an acyclic relationship [abstract: “converting, by the processor of the computer, the graph from a cyclic graph to an acyclic graph”; see also [0027], [0033] and FIG. 3, step 218]. Yurchenko teaches that conversion of cyclic relationships to acyclic relationships may be used to determine an object sequence for processing techniques ([0012]: “The data processing system processes this metadata to determine a sequence of the objects defined by the metadata. This object sequence can be used…for…processing techniques.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Fujiki, Ghavamzadeh, and Dolgov with the teachings of Yurchenko by: (1) performing the operation of determining that there is a cyclic relationship within the plurality of agents and responsive to determining that there is a cyclic relationship; and (2) performing the operation of converting the cyclic relationship into an acyclic relationship, as taught by Yurchenko, in order to obtain a sequence of agents for processing techniques, as suggested by Yurchenko ([0012], quoted above), and reduce the complexity of the relationship of agents, as suggested by Dolgov (§ 7, quoted above).

3.	Claims 7-8 are rejected under 35 U.S.C. § 103 as being unpatentable over Fujiki in view of Ghavamzadeh, Dolgov, and Yurchenko, and further in view of Thomas et al., “Conjugate Markov decision processes.” Proceedings of the Twenty-Eighth International Conference on Machine Learning, June 28, 2011 (8 pages) (“Thomas”) (Cited by applicant in one of the information disclosure statements filed on June 29, 2017).
As to claim 7, the combination of Fujiki, Ghavamzadeh, Dolgov, and Yurchenko teaches the method of claim 6 and the limitation of “non-cooperating agents” as set forth in the rejection of claim 6, but does not teach that converting the cyclic relationship into an acyclic relationship comprises the further limitations of the instant claim.
Thomas, in an analogous art, teaches the further limitations. Thomas generally relates to Markov decision processes (see title) that involve multiple agents (see § 5) and suitable for reinforcement learning (see § 1). Therefore, Thomas is in the field of machine learning and is pertinent to reinforcement learning.
In particular, Thomas teaches “instantiating at least two trainer agents, each trainer agent associated with an agent of the plurality of agents” [§ 5, paragraph 3, disclosing “an approximation to grouped coordinate ascent for on-line methods in which the partial optimizations each run for k steps of M. For large k, this approaches grouped coordinate ascent. When k = 1 the agents take turns training every other time step. We define k = 0 to mean that both agents train during every time step of M” (emphasis added). It is noted that the limitation of “trainer agent” is met by any process that trains an agent, such being disclosed by Thomas as part of its approximate coordinate ascent method, whose algorithm is shown in Algorithm 1 of Thomas.]. Thomas teaches that the agent training (and that its approximate coordinate ascent method in general) may be used to solve multivariate optimization in a practical manner (§ 4 and § 5, second paragraph), and to search for a mapping that is used by an agent solving a Markov decision process (§ 9). Note that the agents/coagents in Thomas are analogous to the agents in Fujiki.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Fujiki, Ghavamzadeh, Dolgov, and Yurchenko with the teachings of Thomas by implementing the approximate coordinate ascent method taught in Thomas for the agents of Fujiki, so as to result in modifying the operation of “converting the cyclic relationship into the acyclic relationship” to further include “instantiating at least two trainer agents, each trainer agent associated with an agent of the plurality of agents,” in order to solve multivariate optimization in a practical manner and to search for a mapping that is used by an agent solving a Markov decision process, as suggested by Thomas (§ 4, § 5, second paragraph, and § 9).

As to claim 8, the combination of Fujiki, Ghavamzadeh, Dolgov, Yurchenko, and Thomas teaches the method of claim 7, further comprising:
pre-training agents having a trainer agent with their respective trainer agents; [Thomas, Algorithm 1 (Approximate Coordinate Ascent). In lines 13 and 16, policy-determining parameters θA and θC of agents A and C are trained/pre-trained. Note that the algorithm is iterative; therefore, training in one iteration may be regarded as pre-training in the context of a subsequent iteration.]
after pre-training, freezing weights of the pre-trained agents; [As illustrated in Algorithm 1. For example, in the case of k = 1 (i.e., the agents take turns training as described in § 5, paragraph 3) and two agents, the θ trained in the prior iteration is fixed (frozen) in the subsequent iteration for purposes of determining u’ or a’ (lines 10-11). In the case of k = 0, both θA and θC trained in the previous iteration are fixed for purposes of determining u’ and a’ in the subsequent iteration. Note that the approximate coordinate ascent algorithm described in Thomas is based on grouped coordinate ascent, in which the variables are partitioned in to two disjoint subsets, one of which is fixed while the objective function is maximized over the other (see § 4, first sentence).] and
after freezing the weights, training additional agents of the plurality of non-cooperating agents. [While Algorithm 1 shows the example of two agents (agent A and coagent C), Thomas teaches “adding additional coagents” (§ 6, near the end of the section). Additional agents would be trained after freezing the weights of the two agents. Section 6, paragraphs 4-6 also discloses examples of a larger number of coagents. For example, in the case of 1500 coagents trained simultaneously using k = 0 (§ 6, paragraph 6), it is understood that additional coagents are trained after freezing θA and θC.] 

4.	Claim 9 is rejected under 35 U.S.C. § 103 as being unpatentable over Fujiki in view of Ghavamzadeh and further in view of Wiering et al., "Ensemble Algorithms in Reinforcement Learning," in IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 4, pp. 930-936, Aug. 2008, doi: 10.1109/TSMCB.2008.920231 (“Wiering”).
As to claim 9, the combination of Fujiki and Ghavamzadeh teaches the method of claim 1, but does not teach the further limitations of the instant claim.
Wiering, in an analogous art, teaches the further limitations of the claim. Wiering generally relates to algorithms for reinforcement learning (title), and is therefore in the field of machine learning.
In particular, Wiering teaches “aggregating the first output and the second output, wherein aggregating the first output and the second output comprises using a technique selected from the group consisting of: majority voting, rank voting, and Q-value generalized means maximizer” [Abstract: “The aim is to enhance learning speed and final performance by combining the chosen actions…The intuitively designed ensemble methods, namely, majority voting (MV), rank voting, Boltzmann multiplication (BM), and Boltzmann addition, combine the policies derived from the value functions of the different RL algorithms” (emphasis added). In regards to these methods, Wiering further teaches: “The MV method combines the best action of each algorithm and bases its final decision on the number of times an action is preferred by each algorithm; 2) the rank voting (RV) method lets each algorithm rank the different actions and combines these rankings to select a final action” (first page, right column, top paragraph). See also § III.1-2. Note that the “actions” that are being combined as described above are considered to be analogous to the limitations of “first output” and “second output.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Fujiki and Ghavamzadeh with the teachings of Wiering by performing the operation of “aggregating the first output and the second output, wherein aggregating the first output and the second output comprises using a technique selected from the group consisting of: majority voting, rank voting, and Q-value generalized means maximizer,” in order to aggregate different reinforcement learning algorithms that learn separate value functions and policies, as suggested by Wiering (§ III, paragraph 1), particularly in a manner that combines the best action of each individual algorithm or lets each algorithm rank the different actions (Wiering, first page, right column, top paragraph, quoted above).  

5.	Claims 13 and 18 are rejected under 35 U.S.C. § 103 as being unpatentable over Fujiki in view of Wiering.
As to claim 13, Fujiki teaches the computer-implemented method of claim 11, as set forth in the rejection of claim 11, above, but does not teach the further limitation of “wherein selecting the environment action comprises using a technique selected from the group consisting of: majority voting, rank voting, and Q-value generalized means maximizer.”
Wiering, in an analogous art, teaches the above limitations. Wiering relates to algorithms for reinforcement learning (title), and is therefore in the field of machine learning.
In particular, Wiering teaches “using a technique selected from the group consisting of: majority voting, rank voting, and Q-value generalized means maximizer” [Abstract: “The intuitively designed ensemble methods, namely, majority voting (MV), rank voting, Boltzmann multiplication (BM), and Boltzmann addition, combine the policies derived from the value functions of the different RL algorithms”; see also § III.1-2]. Wiering also teaches: “The MV method combines the best action of each algorithm and bases its final decision on the number of times an action is preferred by each algorithm; 2) the rank voting (RV) method lets each algorithm rank the different actions and combines these rankings to select a final action” (first page, right column, top paragraph). See also § III.1-2.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Fujiki with the teachings of Wiering by modifying Fujiki such that selecting the environment action comprises “using a technique selected from the group consisting of: majority voting, rank voting, and Q-value generalized means maximizer,” in order to aggregate different reinforcement learning algorithms that learn separate value functions and policies, as suggested by Wiering (§ III, paragraph 1), particularly in a manner that combines the best action of each individual algorithm or lets each algorithm rank the different actions (Wiering, first page, right column, top paragraph, quoted above).  

As to claim 18, the further limitations of this claim are the same or substantially as the same as those recited in claim 13. Therefore, the rejection made to claim 13 is applied to claim 28. 


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Brys et al., “Multi-Objectivization of Reinforcement Learning Problems by Reward Shaping,” 2014 International Joint Conference on Neural Networks (IJCNN) July 6-11, 2014, Beijing, China teaches the use of additional rewards to control reinforcement learning.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to YAO DAVID HUANG whose telephone number is (571)270-1764. The examiner can normally be reached Monday - Friday 9:00 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Y.D.H./Examiner, Art Unit 2124                                                                                                                                                                                                        

/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124