DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The amendment filed on 07/20/2022 has been entered and fully considered.
Claims 5 and 14 have been amended.
Claims 1-20 are pending in Instant Application.

Response to Arguments
Applicant’s arguments with respect to the rejection(s) of claim(s) 1-5, 7, 9-14, 16, and 18-20 under 35 U.S.C. 103(a) have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Gendron-Bellemare et al. (USPGPub 2020/0327405).


In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103(a) are summarized as follows:
1.	Determining the scope and contents of the prior art.
2.	Ascertaining the differences between the prior art and the claims at issue.
3.	Resolving the level of ordinary skill in the pertinent art.
4.	Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-4, 7, 9-13, 16, and 18-20 are rejected under 35 U.S.C. 103(a) as being unpatentable over Banino et al. (USPGPub 2020/0191574), in view of Badia et al. (USPGPub 2020/0372366), and further in view of Gendron-Bellemare et al. (USPGPub 2020/0327405).	As per claim 1, Banino discloses a computer-implemented method for implementing reward based strategies for promoting exploration comprising: 	receiving data associated with an agent environment of an ego agent (Banino 2020/0191574 see at least paragraph 0037; wherein at each of multiple time steps, the action selection system 102 processes data characterizing: (i) the current velocity 110 of the agent 106, and (ii) the current state of the environment); 	receiving data associated with a dynamic operation of the ego agent within the agent environment (see at least paragraph 0037; wherein at each of multiple time steps, the action selection system 102 processes data characterizing: (i) the current velocity 110 of the agent 106, and (ii) the current state of the environment); 	implementing a reward function that is associated with exploration of at least one agent state within the agent environment, wherein the reward function includes assigning at least one reward based on if the at least one agent state is a novel unexplored agent state or a previously explored agent state (see at least paragraph 0041; wherein at each time step, the action selection system 102 may receive a reward 114 based on the current state of the environment 108 and the action 104 of the agent 106 at the time step. In general, the reward 114 is a numerical value. The reward 114 may indicate whether the agent 106 has accomplished a task, or the progress of the agent 106 towards accomplishing a task. For example, if the task specifies that the agent should navigate through the environment to a goal location, then the reward at each time step may have a positive value once the agent reaches the goal location, and a zero value otherwise. As another example, if the task specifies that the agent should explore the environment, then the reward at a time step may have a positive value if the agent navigates to a previously unexplored location at the time step, and a zero value otherwise); and 	training a neural network with the novel unexplored agent state (see at least paragraph 0067; wherein the system 102 can train the action selection network 204 using reinforcement learning training techniques. More specifically, the system 102 can iteratively adjust the values of the action selection network parameters using gradients of a reinforcement learning objective function with respect to the action selection system parameters to increase a cumulative measure of reward received by the system 102). Banino does not explicitly mention disclose a target agent; and wherein at least one simulation is processed to determine at least one additional novel unexplored agent state based on an analysis of at least one reward of the reward function.	However Badia does disclose:	a target agent (see at least Figure 1; item 104 – obvious to one of ordinary skilled in the art to have one agent or multiple agents); and 	wherein at least one simulation (see at least paragraph 0064; wherein training an agent in a simulated environment may enable the agent to learn from large amounts of simulated training data).	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Badia with the teachings as in Banino. The motivation for doing so would have been to avoid risks associated with training the agent in a real world environment, e.g., damage to the agent due to performing poorly chosen actions, see Badia paragraph 0064.	Banino and Badia do not explicitly mention disclose processed to determine at least one additional novel unexplored agent state based on an analysis of at least one reward of the reward function.	However Gendron-Bellemare does disclose:	processed to determine at least one additional novel unexplored agent state based on an analysis of at least one reward of the reward function (see at least paragraph 0040; wherein the system determines a combined reward corresponding to the first observation from the actual reward and an exploration reward bonus (step 204). An exploration reward bonus can be used to incentivize the agent to explore the environment. For example, the exploration reward bonus can be used to encourage the agent to explore new parts of the environment by receiving new observations that have not been observed before).  	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Gendron-Bellemare with the teachings as in Banino and Badia. The motivation for doing so would have been to improve exploration results while requiring a smaller number of training iterations, see Gendron-Bellemare paragraph 0019.	As per claims 2 and 11, Badia discloses wherein receiving data associated with the agent environment includes receiving image data and LiDAR data from at least one of the: ego agent and the target agent, wherein the image data and the LiDAR data are fused to determine a simulated agent environment model that pertains to the agent environment at a current time step (see at least paragraph 0055; wherein the observations may also include, for example, data obtained by one of more sensor devices which sense a real-world environment; for example, sensed electronic signals such as motor current or a temperature signal; and/or image or video data for example from a camera or a LIDAR sensor, e.g., data from sensors of the agent or data from sensors that are located separately from the agent in the environment).  	As per claims 3 and 12, Banino discloses wherein receiving data associated with the agent environment includes receiving dynamic data associated with the ego agent and dynamic data associated with the target agent, wherein the dynamic data is analyzed to determine the dynamic operation of the ego agent and the target agent within the agent environment at the current time step (see at least paragraph 0037; wherein at each of multiple time steps, the action selection system 102 processes data characterizing: (i) the current velocity 110 of the agent 106, and (ii) the current state of the environment, to determine the action 104 to be performed by the agent at the time step. Data characterizing the current state of the environment is referred to in this specification as an “observation” 112).  	As per claims 4 and 13, Banino discloses further including evaluating the simulated agent environment model and the dynamic data and 38Atty. Dkt. No. HRA-46290.01determining current agent states of the ego agent and the target agent, wherein the current agent states of the ego agent and the target agent pertain to the dynamic operation of the ego agent, the dynamic operation of the target agent, and environmental attributes that are associated with the agent environment at the current time step (see at least paragraph 0037; wherein at each of multiple time steps, the action selection system 102 processes data characterizing: (i) the current velocity 110 of the agent 106, and (ii) the current state of the environment, to determine the action 104 to be performed by the agent at the time step. Data characterizing the current state of the environment is referred to in this specification as an “observation” 112).  	As per claims 7 and 16, Banino and Badia disclose wherein implementing the reward function that is associated with the exploration of the at least one agent state includes analyzing the at least one agent state using a maximum reward function, wherein a minimum reward value is assigned to at least one agent state that is determined to be the previously explored agent state and a maximum reward value is assigned to at least one agent state that is determined to be the novel unexplored agent state (see at least paragraph 0041; wherein Banino discloses if the task specifies that the agent should navigate through the environment to a goal location, then the reward at each time step may have a positive value once the agent reaches the goal location, and a zero value otherwise. As another example, if the task specifies that the agent should explore the environment, then the reward at a time step may have a positive value if the agent navigates to a previously unexplored location at the time step, and a zero value otherwise…see at least paragraph 0037; wherein Badia discloses the system can use the information provided by the exploratory policies to learn a more effective “exploitative” action selection policy, i.e., that selects actions to maximize a cumulative measure of “task” rewards received by the agent rather than causing the agent to explore the environment. A task reward received by the agent may characterize a progress of the agent towards accomplishing a task. The information provided by the exploratory policies may include, e.g., information stored in the shared weights of the action selection neural network. Learning the exploratory policies enables the system to continually train the action selection neural network even if the task rewards are sparse, e.g., rarely non-zero).  	As per claims 9, 18 and 20, Banino discloses further including autonomously controlling at least one of the ego agent and the target agent based on agent states that are determined with respect to a simulated agent environment model (see at least paragraph 0080; wherein the environment 108 is a simulated environment and the agent 106 is implemented as one or more computer programs interacting with the simulated environment), wherein at least one of the ego agent and the target agent are autonomously controlled to perform at least one maneuver to travel to an intended location of the agent environment (see at least paragraph 0006; wherein an action selection system implemented as computer programs on one or more computers in one or more locations that can control an agent by selecting actions to be performed by the agent that cause the agent to solve tasks that involve navigating through an environment).  	As per claim 10, Banino discloses a system for implementing reward based strategies for promoting exploration comprising: 	a memory storing instructions when executed by a processor cause the processor (see at least paragraph 0104; wherein a central processing unit will receive instructions and data from a read-only memory or a random access memory or both) to: 	receive data associated with an agent environment of an ego agent (see at least paragraph 0037; wherein at each of multiple time steps, the action selection system 102 processes data characterizing: (i) the current velocity 110 of the agent 106, and (ii) the current state of the environment); 	receive data associated with a dynamic operation of the ego agent within the agent environment (see at least paragraph 0037; wherein at each of multiple time steps, the action selection system 102 processes data characterizing: (i) the current velocity 110 of the agent 106, and (ii) the current state of the environment);	implement a reward function that is associated with exploration of at least one agent state within the agent environment, wherein the reward function includes assigning at least one reward based on if the at least one agent state is a novel unexplored agent state or a previously explored agent state (see at least paragraph 0041; wherein at each time step, the action selection system 102 may receive a reward 114 based on the current state of the environment 108 and the action 104 of the agent 106 at the time step. In general, the reward 114 is a numerical value. The reward 114 may indicate whether the agent 106 has accomplished a task, or the progress of the agent 106 towards accomplishing a task. For example, if the task specifies that the agent should navigate through the environment to a goal location, then the reward at each time step may have a positive value once the agent reaches the goal location, and a zero value otherwise. As another example, if the task specifies that the agent should explore the environment, then the reward at a time step may have a positive value if the agent navigates to a previously unexplored location at the time step, and a zero value otherwise); and 	train a neural network with the novel unexplored agent state (see at least paragraph 0067; wherein the system 102 can train the action selection network 204 using reinforcement learning training techniques. More specifically, the system 102 can iteratively adjust the values of the action selection network parameters using gradients of a reinforcement learning objective function with respect to the action selection system parameters to increase a cumulative measure of reward received by the system 102). Banino does not explicitly mention a target agent; and wherein at least one simulation is processed to determine at least one additional novel unexplored agent state based on an analysis of at least one reward of the reward function.  	However Badia does disclose:	a target agent (see at least Figure 1; item 104 – obvious to one of ordinary skilled in the art to have one agent or multiple agents); and 	wherein at least one simulation (see at least paragraph 0064; wherein training an agent in a simulated environment may enable the agent to learn from large amounts of simulated training data).  	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Badia with the teachings as in Banino. The motivation for doing so would have been to avoid risks associated with training the agent in a real world environment, e.g., damage to the agent due to performing poorly chosen actions, see Badia paragraph 0064.	Banino and Badia do not explicitly mention disclose processed to determine at least one additional novel unexplored agent state based on an analysis of at least one reward of the reward function.	However Gendron-Bellemare does disclose:	processed to determine at least one additional novel unexplored agent state based on an analysis of at least one reward of the reward function (see at least paragraph 0040; wherein the system determines a combined reward corresponding to the first observation from the actual reward and an exploration reward bonus (step 204). An exploration reward bonus can be used to incentivize the agent to explore the environment. For example, the exploration reward bonus can be used to encourage the agent to explore new parts of the environment by receiving new observations that have not been observed before).  	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Gendron-Bellemare with the teachings as in Banino and Badia. The motivation for doing so would have been to improve exploration results while requiring a smaller number of training iterations, see Gendron-Bellemare paragraph 0019.	As per claim 19, Banino discloses a non-transitory computer readable storage medium storing instructions that when executed by a computer, which includes a processor perform a method (see at least paragraph 0007; wherein the system comprises one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement a grid cell neural network), the method comprising: 	receiving data associated with an agent environment of an ego agent (see at least paragraph 0037; wherein at each of multiple time steps, the action selection system 102 processes data characterizing: (i) the current velocity 110 of the agent 106, and (ii) the current state of the environment); 	receiving data associated with a dynamic operation of the ego agent within the agent environment (see at least paragraph 0037; wherein at each of multiple time steps, the action selection system 102 processes data characterizing: (i) the current velocity 110 of the agent 106, and (ii) the current state of the environment); 	implementing a reward function that is associated with exploration of at least one agent state within the agent environment, wherein the reward function includes assigning at least one reward based on if the at least one agent state is a novel unexplored agent state or a previously explored agent state (see at least paragraph 0041; wherein at each time step, the action selection system 102 may receive a reward 114 based on the current state of the environment 108 and the action 104 of the agent 106 at the time step. In general, the reward 114 is a numerical value. The reward 114 may indicate whether the agent 106 has accomplished a task, or the progress of the agent 106 towards accomplishing a task. For example, if the task specifies that the agent should navigate through the environment to a goal location, then the reward at each time step may have a positive value once the agent reaches the goal location, and a zero value otherwise. As another example, if the task specifies that the agent should explore the environment, then the reward at a time step may have a positive value if the agent navigates to a previously unexplored location at the time step, and a zero value otherwise); and 42Atty. Dkt. No. HRA-46290.01 	training a neural network with the novel unexplored agent state (see at least paragraph 0067; wherein the system 102 can train the action selection network 204 using reinforcement learning training techniques. More specifically, the system 102 can iteratively adjust the values of the action selection network parameters using gradients of a reinforcement learning objective function with respect to the action selection system parameters to increase a cumulative measure of reward received by the system 102). Banino does not explicitly mention a target agent; and wherein at least one simulation is processed to determine at least one additional novel unexplored agent state based on an analysis of at least one reward of the reward function.  	However Badia does disclose:	a target agent (see at least Figure 1; item 104 – obvious to one of ordinary skilled in the art to have one agent or multiple agents); and 	wherein at least one simulation (see at least paragraph 0064; wherein training an agent in a simulated environment may enable the agent to learn from large amounts of simulated training data).  	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Badia with the teachings as in Banino. The motivation for doing so would have been to avoid risks associated with training the agent in a real world environment, e.g., damage to the agent due to performing poorly chosen actions, see Badia paragraph 0064.	Banino and Badia do not explicitly mention disclose processed to determine at least one additional novel unexplored agent state based on an analysis of at least one reward of the reward function.	However Gendron-Bellemare does disclose:	processed to determine at least one additional novel unexplored agent state based on an analysis of at least one reward of the reward function (see at least paragraph 0040; wherein the system determines a combined reward corresponding to the first observation from the actual reward and an exploration reward bonus (step 204). An exploration reward bonus can be used to incentivize the agent to explore the environment. For example, the exploration reward bonus can be used to encourage the agent to explore new parts of the environment by receiving new observations that have not been observed before).  	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Gendron-Bellemare with the teachings as in Banino and Badia. The motivation for doing so would have been to improve exploration results while requiring a smaller number of training iterations, see Gendron-Bellemare paragraph 0019.

Allowable Subject Matter
Claim(s) 6 and 15 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten to include all of the limitations of the base claim and any intervening claims. The prior art fails to explicitly teach wherein implementing the reward function that is associated with the exploration of the at least one agent state includes analyzing the at least one agent state using a explored-or-not binary reward function, wherein a negative reward is assigned to at least one agent state that is determined to be the previously explored agent state and a zero reward is assigned to at least one agent state that is determined to be the novel unexplored agent state.
As per claims 5 and 14 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten to include all of the limitations of the base claim and any intervening claims. The prior art fails to explicitly teach implementing the reward function that is associated with the exploration of the at least one agent state includes predicting future agent states associated with the ego agent and the target agent that pertain to a trajectory and a speed of the ego agent and target agent during at least one future time step, wherein the current agent states and actions at the current time step are analyzed to infer future information pertaining to dynamic operations and environmental attributes for the future agent states.
Claim(s) 8 and 17 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten to include all of the limitations of the base claim and any intervening claims. The prior art fails to explicitly teach wherein implementing the reward function that is associated with the exploration of the at least one agent 39Atty. Dkt. No. HRA-46290.01 state includes analyzing the at least one agent state using a noisy explored-or-not reward function, wherein a negative mean value reward is assigned to at least one agent state that is determined to be the previously explored agent state and a zero reward is assigned to at least one agent state that is determined to be the novel unexplored agent state.

Relevant Art
The prior art made of record and not relied upon are considered pertinent to applicant’s disclosure:	USPGPub 2022/0092456 – Provides training a neural network to control an agent which operates in an environment. In particular, the agent is controlled to explore an environment, and optionally also to perform a task in the environment.	USPGPub 2022/0083869 – Provides training neural networks to perform multiple tasks, and to adaptive computer systems, such as neural network systems, for performing multiple tasks.	USPGPub 2020/0200556 - Providing information related to landmarks during travel. Example methods may include determining a set of landmark options based at least in part on a input indicative of locations, the set of landmark options comprising a landmark option; determining that the landmark option is selected by a user; determining a tour route based on the landmark option, wherein the tour route includes at least one landmark; and determining information to be provided to the user when the user is within a distance of the at least one landmark.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAHMOUD S ISMAIL whose telephone number is (571)272-1326. The examiner can normally be reached M - F: 9:00AM- 5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jelani Smith can be reached on 571-270-3969. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MAHMOUD S ISMAIL/Primary Examiner, Art Unit 3662