DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:

A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-6 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Li et al., U.S. Patent Application Publication No. 2019/0250568.
	
As per claims 1, 5 and 6, and as best understood, these claims recite a computer program, method or device that executes a process, whereby the process first determines a first action on a control target using a controller, then performing first reinforcement learning within a first action range around the first action to acquire a first policy, wherein the first action range is smaller than a limit action range for the control target, then determining a second action on the control target using the first policy, then updating the first policy to a second policy by performing a second reinforcement learning within a second action range around the second action, wherein the second action range is smaller than the limit action range.
Examiner Notes:	The “limit action range” for the control target is interpreted to correspond to the totality of potential movements for a control target. For example, in a game of Pong, the movement of the racquet can go either up or down, to a “limit” in both directions. Therefore, there exists many incremental moves within this “range” of maximum movement UP and maximum movement DOWN, and the maximum in either direction defines the limit of travel for the action range, per se.

All that being said, and as best understood, claims 1, 5 and 6 appears to set forth features of determining a first action on a control target, then performing reinforcement learning to generate a first policy whereby a subsequent action, that is within a smaller range of the limit action range for the target, is utilized for the learning, and then a further action which is also in a smaller range that the overall action range is utilized to perform further reinforcement learning to update the policy to second policy.

As per claims 1, 5 and 6, Li et al. discloses a control agent sensing an objects location within a state space, and based on the current location, the agents selects an available action for the object to execute. The agent is trained to maximize a statistically expected value of a reward over a finite, or infinite, number of temporal steps. The training of an agent includes the determination, the generation, and or updating of the agents policy, such that the agents selected actions within the environment via Q-learning (e.g. See [0026]). 

Li et al. further discloses that the training of an agent, or corresponding policy, includes iterative exploration of state-action pairs, and updating them such that the network more accurately predicts returns, and the thus the policy eventually returns actions that tend to work towards achieving the goal associated with the task (e.g. See [0048]). 

As per claim 2, as best understood, this claim appears to simply add a feature whereby a next action is repeatedly executed and the policy is updated. This is 

As per claims 3 and 4, a first controller using a first policy and a second controller using a second policy, and a third controller being implemented by integrating a first and second controller are features that appear to be adequately anticipated by the utilization of a learning agent that utilizes a learning signal and supervisor signal generated by a supervisor agent, whereby a supervisor coefficient weighs the signals so that a pioneer agent is updated to include a learning policy of the trained learning agent and then the super visor coefficient is reduced until the pioneer agent is terminated and the learning agent is then updated to include a pioneer policy of the trained pioneer agent (e.g. See [0008]).

References Cited But Not Relied Upon
	Prior art that has been cited but not relied upon in the rejection of the claims are as follows:
Chebotar et al., U.S. Patent Application Publication No. 2020/0276703 which discloses a robotic agent and reinforcement learning control architecture and method for accomplishing learning via demonstration;
Beckman et al., U.S. Patent No. 10/792,810 which discloses artificial intelligence learning system for robotic control policies;
Dulac-Arnold et al., U.S. Patent No. 10,885,432 which discloses a system for selecting actions from large discrete action sets using reinforcement learning;
Kalakrishnan et al., U.S. Patent No. 10,960,539 which discloses control policies for robotic agents;
Ueda et al., U.S. Patent Application Publication No. 2010/0114807 which discloses a reinforcement learning system for robotic applications;
Coenen, U.S. Patent Application Publication No. 2014/0277744 which discloses a robotic training apparatus and method that utilizes reinforcement learning techniques and methodologies;
Gu et al., U.S. Patent Application Publication No. 2017/0228662 which discloses reinforcement learning using advantage estimates and Q based learning techniques;
Nishi, U.S. Patent Application Publication No. 2018/0009445 which discloses an online learning and vehicle reinforcement methodology without active exploration for vehicle navigation and control;
Wright et al., U.S. Patent Application Publication No. 2018/0012137 which discloses an approximate value iteration with complex returns by bounding for a control system;
Izhikevich et al., U.S. Patent Application Publication No. 2014/0277718 which discloses an adaptive predictor apparatus and method for robotic movement control.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RONALD D HARTMAN JR whose telephone number is (571)272-3684. The examiner can normally be reached M-F 8:30 - 4:30 EST.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Mohammad Ali can be reached on (571) 272-4105. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, 



/RONALD D HARTMAN JR/Primary Patent Examiner, Art Unit 2119                                                                                                                                                                                                        November 20, 2021
/RDH/