DETAILED ACTION
Claims 1-20 are presented for examination.
Information Disclosure Statement
The information disclosure statements (IDS) submitted are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the Examiner. 
EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in an interview with James Barta on 06/23/2022.
The application has been amended as follows:
(Currently Amended) An apparatus comprising:						a memory storing a reinforcement learning policy having an optimization component and a data collection component;											a regularization component configured to apply regularization selectively between the optimization component of the reinforcement learning policy and the data collection component of the reinforcement learning policy;									a processor configured carry out a reinforcement learning process by:					triggering execution of an agent according to the reinforcement learning policy and with respect to a first task; 												observing values of variables comprising: an observation space of the agent, an action of the agent; and 												updating the reinforcement learning policy using reinforcement learning according to the observed values and taking into account the regularization, wherein updating the reinforcement learning policy comprises computing a loss function, and wherein the processor is configured to trigger execution of the agent according to the updated reinforcement learning policy and with respect to a second task, wherein the second task is different from the first task.
(Canceled) 
(Original) The apparatus of claim 1 wherein the memory is configured to store the reinforcement learning policy such that the optimization component and the data collection component are separate.
(Original) The apparatus of claim 1 wherein the regularization component is configured to apply no regularization to the data collection component and apply regularization to the optimization component.
(Original) The apparatus of claim 1 wherein the regularization component is configured to apply more regularization to the optimization component than to the data collection component. 
(Currently Amended) The apparatus of claim 1 wherein the reinforcement learning policy is computed using a machine learning model and wherein the regularization component is configured to restrict a capacity of the machine learning model. 
(Currently Amended) The apparatus of claim 1 wherein the reinforcement learning policy is computed using a machine learning model and wherein the regularization component is configured to use one or more of the following regularization methods: selecting an architecture of the machine learning model, stochastic regularization whereby noise is added to the machine learning model
(Original) The apparatus of claim 1 wherein the reinforcement learning process comprises computing a loss of the policy                     
                        
                            
                                L
                            
                            
                                A
                                C
                            
                            
                                I
                                B
                            
                        
                    
                  plus a first weight                     
                        
                            
                                ⋋
                            
                            
                                V
                            
                        
                    
                 times a loss                     
                        
                            
                                L
                            
                            
                                A
                                C
                            
                            
                                V
                            
                        
                    
                 of the critic minus a second weight                     
                        
                            
                                ⋋
                            
                            
                                H
                            
                        
                    
                 times a heuristic entropy bonus of the policy plus a Lagrangian multiplier hyperparameter                     
                        β
                    
                 times a regularization term. 
(Original) The apparatus of claim 8 wherein the reinforcement learning process is an actor-critic process. 
(Currently Amended) The apparatus of claim 1 wherein the the second task comprises operation of the physical entity in a second physical environment, which is different from the first physical environment. 
(Currently Amended) The apparatus of claim 1 wherein the agent is a chat bot and wherein the first task is to apply a skill in a first situation and [[a]] the second task is to apply a skill in a second situation, which is different from the first situation. 
(Original) The apparatus of claim 1 wherein the agent is a player in a computer game and wherein the first task is a task in the computer game. 
(Original) The apparatus of claim 1 wherein the agent is any of: a robotic vacuum cleaner, a manufacturing robot arm, a chat bot, an avatar in a video game. 
(Currently Amended) A computer-implemented method comprising:
storing, at a memory, a reinforcement learning policy having an optimization component and a data collection component;
applying regularization selectively between the optimization component of the reinforcement learning policy and the data collection component of the reinforcement learning policy;
carrying out a reinforcement learning process by:
	 triggering execution of an agent according to the reinforcement learning policy and with respect to a first task; 
	observing values of variables comprising: an observation space of the agent, an action of the agent;
	updating the reinforcement learning policy using reinforcement learning according to the observed values and taking into account the regularization, wherein updating the reinforcement learning policy comprises computing a loss function; and 						triggering execution of the agent according to the updated reinforcement learning policy and with respect to a second task, wherein the second task is different from the first task.
(Currently Amended) The method of claim 14 which is carried out in a cloud and wherein the agent is a physical agent or a digital agent.
(Canceled)
(Original) The method of claim 14 comprising applying no regularization to the data collection component and apply regularization to the optimization component.
(Original) The method of claim 14 comprising applying more regularization to the optimization component than to the data collection component.
(Currently Amended) The method of claim 14 comprising computing the reinforcement learning policy using a machine learning model and using one or more of the following regularization methods: selecting an architecture of the machine learning model, stochastic regularization whereby noise is added to the machine learning model
(Currently Amended) One or more computer storage 
storing, at a memory, a reinforcement learning policy having an optimization component and a data collection component;
applying a first amount of regularization to the optimization component of the reinforcement learning policy and a second amount of regularization to the data collection component of the reinforcement learning policy, where the first and second amounts are different;
carrying out a reinforcement learning process by:
	triggering execution of an agent according to the reinforcement learning policy and with respect to a first task; 
	observing values of variables comprising: an observation space of the agent, an action of the agent; 
	updating the reinforcement learning policy using reinforcement learning according to the observed values and taking into account the regularization, wherein updating the reinforcement learning policy comprises computing a loss function; and 						triggering execution of the agent according to the updated reinforcement learning policy and with respect to a second task, wherein the second task is different from the first task.
 eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster
(New) The one or more computer storage media of claim 20, wherein more regularization is applied to the optimization component than to the data collection component.
(New) The one or more computer storage media of claim 20, wherein the reinforcement learning process comprises computing a loss of the policy                     
                        
                            
                                L
                            
                            
                                A
                                C
                            
                            
                                I
                                B
                            
                        
                    
                  plus a first weight                     
                        
                            
                                ⋋
                            
                            
                                V
                            
                        
                    
                 times a loss                     
                        
                            
                                L
                            
                            
                                A
                                C
                            
                            
                                V
                            
                        
                    
                 of the critic minus a second weight                     
                        
                            
                                ⋋
                            
                            
                                H
                            
                        
                    
                 times a heuristic entropy bonus of the policy plus a Lagrangian multiplier hyperparameter                     
                        β
                    
                 times a regularization term.   eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster eodcmaster

Allowable Subject Matter
Claims 1, 3-15, and 17-22 are allowed. 
	The following is an Examiner’s statement of reasons for allowance:
	Diggle discloses an apparatus is described for training a behavior of an agent in a physical or digital environment. The apparatus comprises a memory storing the location of at least one reward token in the environment. The location has been specified by a user. At least one processor executes the agent in the environment according to a behavior policy. The processor is configured to observe values of variables comprising: an observation of the agent, an action of the agent and any reward resulting from the reward token. The processor is configured to update the behavior policy using reinforcement learning according to the observed values. 
	Baughman discloses taking an action by the learning agent. The method further includes observing a new state of the environment and calculating a reward for the action taken by the learning agent. The method also includes determining whether a policy related to the learning agent should be changed. The determination is conducted by a teaching agent that inputs the state of the environment and the reward as features. The method can also include changing the policy related to the learning agent upon a determination that a label outputted by the teaching agent exceeds a reward threshold.
	VAN SEIJEN discloses single-agent reinforcement learning problems into simpler problems addressed by multiple agents. Actions proposed by the multiple agents are then aggregated using an aggregator, which selects an action to take with respect to an environment. Aspects provided herein are also relevant to a hybrid reward model.
	However, all cited prior arts of record fail to disclose in claims 1, 14, and 20, “… a memory storing a reinforcement learning policy having an optimization component and a data collection component; a regularization component configured to apply regularization selectively between the optimization component of the reinforcement learning policy and the data collection component of the reinforcement learning policy; a processor configured carry out a reinforcement learning process, by: triggering execution of an agent according to the policy and with respect to a first task; observing values of variables comprising: an observation space of the agent, an action of the agent; and updating the policy using reinforcement learning according to the observed values and taking into account the regularization, wherein updating the policy comprises computing a loss function, and wherein the processor is configured to trigger execution of the agent according to the updated policy and with respect to a second task, wherein the second task is different from the first task.” (and similar limitations)
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
	US 20180361577 A1 - a method and system for a robotic device comprising a propulsion mechanism, an orientation sensor, a stored digital map of a service area, a sensor for sensing objects, a navigation and orientation system, and a processing facility comprising a processor and a memory, the processing facility causing the robotic device to determine and store a pose position of the robotic device at a plurality of sequential locations as the robotic device is guided by a user along a path from a start location to an end location through the service area, and as commanded by the user and utilizing the navigation and orientation system, re-trace the path from the start location to the end location replicating the stored pose position of the robotic device at the plurality of sequential locations.
	US 20170021497 A1 - One or more robots that make exploration and path planning decisions in a previously unknown or unmapped environment based on a map and localization data at least partially generated by human transported perception unit(s).
Inquiries 
Any inquiry concerning this communication or earlier communications from the Examiner should be directed to PAKEE FANG whose telephone number is (571)270-3633.  The Examiner can normally be reached on Mon-Fri 9:00AM-5:00PM.
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s supervisor, PÉREZ-GUTIÉRREZ RAFAEL can be reached on 571-272-7915.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PAKEE FANG/
Primary Examiner, Art Unit 2642