DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
Claims 1, and 3-8 remain pending. Claims 1, 3, and 5-8 remain pending. Claims 9-13 have been added. Claim 2 has been cancelled. 
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 6-7 and 12 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 6 depends on claim 2, but claim 2 has been cancelled. Therefore, the claim is indefinite because it is not clear to the examiner what the entire scope of the claim is. 
Claim 7 recites the limitation "the occupancy measures" in line 12.  There is insufficient antecedent basis for this limitation in the claim. It is unclear what occupancy measure applicant is referring to. Therefore, the claim is indefinite. 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, and 7-13 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Mukadam et al (US 20190113929 A1) (Hereinafter referred to as Mukadam) 

Regarding Claim 1, Mukadam discloses a method of determining a movement policy for controlling a robot (See at least Mukadam Paragraphs 0013 and 0029-0032, the system generates a policy for determining the next action of the autonomous vehicle, which is interpreted as determining the movement policy for controlling a robot), comprising: 
acquiring sensor data representing an environment of the robot (See at least Mukadam Paragraph 0055, a sensor is used to detect the environment); 
identifying one or more objects in the environment of the robot from the sensor data (See at least Mukadam Paragraph 0055, the sensor detects other vehicles, which is interpreted as one or more objects); 
associating the robot and each of the one or more objects with a respective agent of a multiagent system of multiagent reinforcement learning (See at least Mukadam Paragraphs 0002, 0032-0034 and Figure 4b, the autonomous vehicle and the other vehicles are interpreted as agents in a multiagent system; See at least Mukadam Paragraphs 0034-0035 and 0039, the system uses a Q-learning system which implements reinforcement learning by simulating the actions of the agents in the multiagent system, which is interpreted as multiagent reinforcement learning); 
determining, for each agent of the multiagent system, a Q- function which includes a reward term for a movement action at a position (See at least Mukadam Paragraphs 0034-0036 and 0057, the system uses Q-masking and Q-learning, which is interpreted as a Q-function to simulate the actions of the other vehicles and the autonomous vehicle, which is interpreted as determining a Q-function for each agent; See at least Mukadam Paragraphs 0041-0042 and 0044, the terminal reward is interpreted as a reward term), an expectation term (See at least Mukadam Paragraphs 0042 and 0059, the expected reward is interpreted as an expectation term), and a coupling term (See at least Mukadam Paragraphs 0064 and 0068 and Figure 3a, the occupancy grid is interpreted as a coupling term), wherein the reward term and the expectation term are independent of the other agents (See at least Mukadam Paragraphs 0041-0042, the reward terms and expectation terms are goals for the respective agent, which are independent of other agents), and wherein the coupling is a function of occupancy measures of the other agents, wherein, for each agent, the occupancy measure for a position and a time denotes a likelihood of the agent being in the position at the time (See at least Mukadam Paragraphs 0037-0038 and 0064, the occupancy grid is a probabilistic grid with position information and that uses the occupancy at previous time intervals, which is interpreted as a function of occupancy measure for a position that denotes the likelihood of the position of the agent; See at least Mukadam Paragraphs 0034 and 0036, the simulations are carried out in time intervals, which is interpreted as an occupancy measure for a time);
determining the movement policy of the robot using multiagent reinforcement learning (See at least Mukadam Paragraphs 0036, 0039, and 0056-0057, the system is trained using the multiagent reinforcement learning, and determines the policy based on the remaining set of actions), wherein the movement policy selects movement actions with a higher value of the Q-function determined for the robot with higher probability than movement actions with a lower value of the Q-function (See at least Mukadam Paragraph 0072, the trajectory with the higher reward, which is interpreted as higher value of the Q-function, is selected; See at least Mukadam Paragraph 0041, the reward functions scores actions that result in collision negatively, which means the higher rewards have a higher probability of no collision). 

Regarding Claim 7, Mukadam discloses a robot controller configured to control a robot (See at least Mukadam Paragraphs 0030-0031 and Figure 1, the autonomous vehicle policy generation system is interpreted as the robot controller which controls an autonomous vehicle, which is interpreted as a robot), the robot controller configured to: 
determine a movement policy for controlling the robot (See at least Mukadam Paragraphs 0013 and 0029-0032, the system generates a policy for determining the next action of the autonomous vehicle, which is interpreted as determining the movement policy for controlling a robot), comprising: 
acquire sensor data representing an environment of the robot (See at least Mukadam Paragraph 0055, a sensor is used to detect the environment); 
identify one or more objects in the environment of the robot from the sensor data (See at least Mukadam Paragraph 0055, the sensor detects other vehicles, which is interpreted as one or more objects); 
associate the robot and each of the one or more objects with a respective agent of a multiagent system of multiagent reinforcement learning (See at least Mukadam Paragraphs 0002, 0032-0034 and Figure 4b, the autonomous vehicle and the other vehicles are interpreted as agents in a multiagent system; See at least Mukadam Paragraphs 0034-0035 and 0039, the system uses a Q-learning system which implements reinforcement learning by simulating the actions of the agents in the multiagent system, which is interpreted as multiagent reinforcement learning); 
determine, for each agent of the multiagent system, a Q- function which includes a reward term for a movement action at a position (See at least Mukadam Paragraphs 0034-0036 and 0057, the system uses Q-masking and Q-learning, which is interpreted as a Q-function to simulate the actions of the other vehicles and the autonomous vehicle, which is interpreted as determining a Q-function for each agent; See at least Mukadam Paragraphs 0041-0042 and 0044, the terminal reward is interpreted as a reward term), an expectation term (See at least Mukadam Paragraphs 0042 and 0059, the expected reward is interpreted as an expectation term), and a coupling term (See at least Mukadam Paragraphs 0064 and 0068 and Figure 3a, the occupancy grid is interpreted as a coupling term), wherein the reward term and the expectation term are independent of the other agents (See at least Mukadam Paragraphs 0041-0042, the reward terms and expectation terms are goals for the respective agent, which are independent of other agents), wherein, for each agent, the occupancy measure for a position and a time denotes a likelihood of the agent being in the position at the time (See at least Mukadam Paragraphs 0037-0038 and 0064, the occupancy grid is a probabilistic grid with position information and that uses the occupancy at previous time intervals, which is interpreted as a function of occupancy measure for a position that denotes the likelihood of the position of the agent; See at least Mukadam Paragraphs 0034 and 0036, the simulations are carried out in time intervals, which is interpreted as an occupancy measure for a time);
determine the movement policy of the robot using multiagent reinforcement learning (See at least Mukadam Paragraphs 0036, 0039, and 0056-0057, the system is trained using the multiagent reinforcement learning, and determines the policy based on the remaining set of actions), wherein the movement policy selects movement actions with a higher value of the Q-function determined for the robot with higher probability than movement actions with a lower value of the Q-function (See at least Mukadam Paragraph 0072, the trajectory with the higher reward, which is interpreted as higher value of the Q-function, is selected; See at least Mukadam Paragraph 0041, the reward functions scores actions that result in collision negatively, which means the higher rewards have a higher probability of no collision) and control the robot according to the movement policy (See at least Mukadam Paragraph 0031, the vehicle is controlled using the movement policy).

Regarding Claim 8, Mukadam discloses a non-transitory computer-readable medium on which are stored instructions of determining a movement policy for controlling a robot (See at least Mukadam Paragraphs 0013 and 0029-0032, the system generates a policy for determining the next action of the autonomous vehicle, which is interpreted as determining the movement policy for controlling a robot; See at least Mukadam Paragraph 0073, the computer-readable medium is non-transitory and has processor-executable instructions), the instructions, when executed by a computer, causing the computer to perform the following steps:
acquiring sensor data representing an environment of the robot (See at least Mukadam Paragraph 0055, a sensor is used to detect the environment); 
identifying one or more objects in the environment of the robot from the sensor data (See at least Mukadam Paragraph 0055, the sensor detects other vehicles, which is interpreted as one or more objects); 
associating the robot and each of the one or more objects with a respective agent of a multiagent system of multiagent reinforcement learning (See at least Mukadam Paragraphs 0002, 0032-0034 and Figure 4b, the autonomous vehicle and the other vehicles are interpreted as agents in a multiagent system; See at least Mukadam Paragraphs 0034-0035 and 0039, the system uses a Q-learning system which implements reinforcement learning by simulating the actions of the agents in the multiagent system, which is interpreted as multiagent reinforcement learning); 
determining, for each agent of the multiagent system, a Q- function which includes a reward term for a movement action at a position (See at least Mukadam Paragraphs 0034-0036 and 0057, the system uses Q-masking and Q-learning, which is interpreted as a Q-function to simulate the actions of the other vehicles and the autonomous vehicle, which is interpreted as determining a Q-function for each agent; See at least Mukadam Paragraphs 0041-0042 and 0044, the terminal reward is interpreted as a reward term), an expectation term (See at least Mukadam Paragraphs 0042 and 0059, the expected reward is interpreted as an expectation term), and a coupling term (See at least Mukadam Paragraphs 0064 and 0068 and Figure 3a, the occupancy grid is interpreted as a coupling term), wherein the reward term and the expectation term are independent of the other agents (See at least Mukadam Paragraphs 0041-0042, the reward terms and expectation terms are goals for the respective agent, which are independent of other agents), and wherein the coupling is a function of occupancy measures of the other agents, wherein, for each agent, the occupancy measure for a position and a time denotes a likelihood of the agent being in the position at the time (See at least Mukadam Paragraphs 0037-0038 and 0064, the occupancy grid is a probabilistic grid with position information and that uses the occupancy at previous time intervals, which is interpreted as a function of occupancy measure for a position that denotes the likelihood of the position of the agent; See at least Mukadam Paragraphs 0034 and 0036, the simulations are carried out in time intervals, which is interpreted as an occupancy measure for a time);
determining the movement policy of the robot using multiagent reinforcement learning (See at least Mukadam Paragraphs 0036, 0039, and 0056-0057, the system is trained using the multiagent reinforcement learning, and determines the policy based on the remaining set of actions), wherein the movement policy selects movement actions with a higher value of the Q-function determined for the robot with higher probability than movement actions with a lower value of the Q-function (See at least Mukadam Paragraph 0072, the trajectory with the higher reward, which is interpreted as higher value of the Q-function, is selected; See at least Mukadam Paragraph 0041, the reward functions scores actions that result in collision negatively, which means the higher rewards have a higher probability of no collision). 

Regarding Claim 9, Mukadam discloses controlling the robot according to the movement policy (See at least Mukadam Paragraph 0031, the vehicle is controlled using the movement policy).

Regarding Claim 10, Mukadam discloses controlling the robot according to the movement policy (See at least Mukadam Paragraph 0031, the vehicle is controlled using the movement policy).

Regarding Claim 11, Mukadam discloses determining the movement policy of the robot using multiagent reinforcement learning includes predicting a future state of the agents using the Q-functions determined for the agents (See at least Mukadam Paragraphs 0038-0040, the actions of the agents, such as the autonomous vehicle and other vehicles, are simulated using Q-learning, which is interpreted as predicting the future state of the agents using the Q-functions). 

Regarding Claim 12, Mukadam discloses the determination of the movement policy of the robot using multiagent reinforcement learning includes predicting a future state of the agents using the Q-functions determined for the agents (See at least Mukadam Paragraphs 0038-0040, the actions of the agents, such as the autonomous vehicle and other vehicles, are simulated using Q-learning, which is interpreted as predicting the future state of the agents using the Q-functions). 

Regarding Claim 13, Mukadam discloses determining the movement policy of the robot using multiagent reinforcement learning includes predicting a future state of the agents using the Q-functions determined for the agents (See at least Mukadam Paragraphs 0038-0040, the actions of the agents, such as the autonomous vehicle and other vehicles, are simulated using Q-learning, which is interpreted as predicting the future state of the agents using the Q-functions). 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 3-5 are rejected under 35 U.S.C. 103 as being unpatentable over Mukadam in view of Ostafew (US 20210237769 A1) (Hereinafter referred to as Ostafew) 

Regarding Claim 3, Mukadam discloses the determining of the Q-function includes iteratively determining the Q-functions in a plurality of iterations (See at least Mukadam Paragraphs 0034-0036 and 0056, the simulation to determine the action of the vehicle using the Q-function is repeated for additional time intervals, which is interpreted as determining the Q-function in a plurality of iterations), wherein each iteration includes a forward pass from an initial time to an end time over a plurality of time steps (See at least Mukadam Paragraph 0056, the time intervals, which are interpreted as time steps, go from an initial time to until the autonomous vehicle reaches a terminal state, which is interpreted as an end time)…
Even though Mukadam discloses iteratively determining the Q-functions over a plurality of time steps, Mukadam fails to discloses each iteration includes… a backward pass from the end time to the initial time over the plurality of time steps.
However, Ostafew discloses a backward pass from the end time to the initial time over the plurality of time steps (See at least Ostafew Paragraphs 0205-0209 and Figure 16, the position of the vehicle is determined by going backwards in time from an end time (t+X) to an initial time (t), which is interpreted as a backward pass).
It would have been obvious to one of ordinary skill to modify the teachings disclosed in Mukadam with Ostafew to have the iterations include a backward pass from the end time to the initial time over the plurality of time steps. By using the backward pass from an end time to an initial time, a vehicle can determine where it can clear another vehicle that is heading its way (See at least Ostafew Paragraphs 0205-0209 and Figure 16), which would increase the safety of the system by preventing collisions. 

Regarding Claim 4, Mukadam discloses the coupling term is a function of the occupancy measures of the other agents (See at least Mukadam Paragraphs 0037-0038 and 0064, the occupancy grid is a probabilistic grid with position information and that uses the occupancy at previous time intervals, which is interpreted as a function of occupancy measure for a position that denotes the likelihood of the position of the agent; See at least Mukadam Paragraphs 0034 and 0036, the simulations are carried out in time intervals, which is interpreted as an occupancy measure for a time) and the forward pass includes updating, for each agent, the occupancy measure of a next time step by propagating the occupancy measure of a current time step of an agent using the policy of the agent at the current time step (See at least Mukadam Paragraphs 0037-0038 and 0064 and Figure 3A, occupancy grid from previous time intervals are used to provide a probabilistic occupancy grid, which is interpreted as updating the next time step by using the policy at the current time step).

	Regarding Claim 5, Mukadam discloses the coupling term is a function of the occupancy measures of the other agents (See at least Mukadam Paragraphs 0037-0038 and 0064, the occupancy grid is a probabilistic grid with position information and that uses the occupancy at previous time intervals, which is interpreted as a function of occupancy measure for a position that denotes the likelihood of the position of the agent; See at least Mukadam Paragraphs 0034 and 0036, the simulations are carried out in time intervals, which is interpreted as an occupancy measure for a time) and…updating, for each agent, the Q-function and policy of the agent (See at least Mukadam Paragraphs 0067-0068 and Figures 2-3, the policy is determined by using the Q-function during the simulation, which is interpreted as updating the policy of the agent; See at least Mukadam Paragraphs 0064 and 0068 and Figure 3A, the Q-function is updated with the state inputs and occupancy grids from previous time intervals)…
	Even though Mukadam discloses updating the Q-function and policy of each agent, Van Heukelom fails to disclose the backward pass includes updating, for each agent, the Q-function and policy of the agent…at a current time step by using the occupancy measure of the other agents at a next time step.
	However, Ostafew discloses the backward pass includes updating the position and path of each agent at a current time step by using the occupancy measure of the other agents at a next time step (See at least Ostafew Paragraphs 0205-0209 and Figure 16, the backward pass updates the movement action of the agent at a current time step (t) by using the occupancy measures of the other vehicle at a next time step (t+X)).
It would have been obvious to one of ordinary skill to modify the teachings disclosed in Mukadam with Ostafew to update, for each agent, the Q-function and policy of the agent at a current time step by using the occupancy measure of the other agents at a next time step. By updating the agent at a current time step by using the occupancy measure of the other agents at a next time step, a vehicle can determine where it can clear another vehicle that is heading its way (See at least Ostafew Paragraphs 0205-0209 and Figure 16), which would increase the safety of the system by preventing collisions. 

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Mukadam in view of Palanisamy et al (US 20190278282 A1) (Hereinafter referred to as Palanisamy) 

Regarding Claim 6, Mukadam fails to disclose the movement policy is determined such that actions for a system state of the multiagent system are distributed according to a Boltzmann distribution depending on the Q-function determined for the robot.
However, Palanisamy discloses this limitation (See at least Palanisamy Paragraphs 0057 and 0059, the path of objects is predicted and used to plan the path for the vehicle, which is interpreted as movement policy of the multiagent system; See at least Palanisamy Paragraphs 0068-0069 and 0073, the movement policy is determined for the autonomous vehicle using a Boltzmann distribution depending on the reward of the Q-function, and the predicted paths of the objects).
It would have been obvious to one of ordinary skill to modify the teachings disclosed in Mukadam with Palanisamy to determine the movement policy based on the Boltzmann distribution depending on the Q-function. By using the Boltzmann distribution, the vehicle can choose the next movement policy/task for the next iteration for each of the N tasks, higher-level state, and the reward (See at least Palanisamy Paragraphs 0068-0069 and 0073). This would allow the vehicle to explore different methods of avoiding obstacles and select the easiest task (See at least Palanisamy Paragraphs 0068-0069 and 0073), which would increase the safety of the system by preventing accidents. 

 Response to Arguments
Applicant’s arguments with respect to claims 1 and 7-8 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. The claims are now rejected in view of Mukadam, which teaches using a Q-function to determine the movement policy of an autonomous vehicle by determining the probability of other vehicle’s actions. Therefore, the claims still stand rejected.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ESVINDER SINGH whose telephone number is (571)272-7875. The examiner can normally be reached Monday-Friday: 9 am-5 pm est.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abby Lin can be reached on 571-270-3976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/E.S./Examiner, Art Unit 3664                                                                                                                                                                                                        /ABBY Y LIN/Supervisory Patent Examiner, Art Unit 3664