DETAILED ACTION
Introduction
This Final Office Action is in response to amendments and remarks filed on October 3, 2022, for the application with serial number 16/802,764.

Claim 2 is amended.
Claim 1 is canceled.
Claims 2-10 are pending.

Response to Remarks/Amendments
35 USC §101 Rejections
In light of the Applicant’s amendments, the rejections under 35 USC §101 are withdrawn.
35 USC §102/103 Rejections
The Applicant traverses the rejection of independent claim 1 as being obvious over Hermann, contending that Hermann does not teach two different artificial intelligence modules, as claimed.  In response, the Examiner points out that a software module merely amounts to a segment of code that could be part of an entire computer program.  As indicated in Hermann, multiple neural network modules may be used and combined.  See Hermann ¶[0039]-[0040].  In the Non-Final Office Action, ¶[0039]-[0040] of Hermann was cited as teaching this element of claim 1 (now canceled): 

the method comprising: performing, by a first artificial intelligence module, reinforcement learning of a first artificial neural network using a second artificial neural network of a second artificial intelligence module as the virtual environment

See Non-Final Office Action (6/3/2022), p. 3.  Cited ¶[0039]-[0040], particularly ¶[0039], describes how multiple neural network modules may be used in a reinforcement learning technique to determine shared weights of one of the models.  Therefore, a full reading of Hermann and the rejection of claims 1 and 2 demonstrates that Hermann anticipates the claimed reinforcement learning technique that uses at least two neural network modules.  Specifically, in Hermann, the reinforcement learning training module trains the environment neural network module, the task neural network module, and the policy defining neural network module.  
The Applicant further submits that Hermann does not teach “pre-stored measurement data;” because the observations of Hermann are “current observations.”  See Remarks pp. 10-11.  In response, the Examiner submits that, once Hermann’s “current observations” are captured from sensors, the observations become “pre-stored measurement data.”  Moreover, no functional distinction exists between a historical observation and a current observation.  
The Applicant makes additional Remarks that echo the Remarks made on pp. 7-9.  Specifically, the Applicant submits that Hermann does not teach the use of a second neural network to train a first neural network.  See Remarks pp. 11-12.  Again, the Examiner points to cited ¶[0039].  That passage explicitly teaches a reinforcement training module to train several neural network modules.  In Hermann, this technique is used to characterize the state of an environment, which creates a ‘virtual environment.’  See Hermann abstract.  Also, Hermann teaches that the environment may be a simulated environment.  See ¶[0073].
The rejection of the dependent claims stands or falls with the rejection of the independent claims.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 2-6 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by US 20210110115 A1 to Hermann et al. (hereinafter ‘HERMANN’).


Claim 2 (Currently Amended)
HERMANN discloses a method for reinforcement learning (see ¶[0001]-[0003]; a reinforcement learning system) using a virtual environment (see ¶[0073]; a simulated environment) generated by deep learning (see ¶[0004]; deep neural networks), the method performed by a computing device comprising a hardware processor and a memory storing one or more programs to be executed by the hardware processor (see abstract; computer programs encoded on a computer storage medium), the method comprising: receiving, by a second artificial intelligence module, pre-stored actual measurement data from a control environment (see ¶[0072] and [0097]; observations derived from data that is captured from sensors); 
learning, by the second artificial intelligence module, the second artificial neural network by determining a weight of the second artificial neural network (see ¶[0039]; train a neural network model using weights shared with one or more of the environment neural network model, the task neural network model, and the policy defining the neural network model) including a multi layered perceptron based on the actual measurement data (see ¶[0078]; a feedforward neural network (e.g., a multi-layer perceptron); 
performing, after the second artificial intelligence module learns the second artificial neural network, by a first artificial intelligence module, reinforcement learning of a first artificial neural network using the second artificial neural network as the virtual environment to determine a policy for maximizing an expected value  of the sum of rewards corresponding to action information (see ¶[0021]-[0022]; perform an iteration of a reinforcement learning technique to optimize a task-specific objective using the current reward and the current action selection output); 
determining, after the reinforcement learning of the first artificial neural network is completed, by the first artificial intelligence module, a control command by applying sensing information received from a sensor of the control environment to the first artificial neural network (see ¶[0055]; an electronic controller trained by reinforcement learning to control a system having multiple states, and, for each state, a set of actions to move from one of the states to next the [sic] state.  See also ¶[0061]; observations may be obtained from laser sensors); and 
providing, by the first artificial intelligence module, the control command to an actuator so that the actuator of the control environment is able to control a control target of the control environment according to the control command (see abstract and ¶[0061] & [0072]; an agent interacting with an environment that may be a robot or autonomous vehicle.  The actions may be control inputs to control the robot or autonomous vehicle using observations derived from sensors of the agent), 
wherein performing the reinforcement learning of the first artificial neural network determines the policy for maximizing the expected value of the sum of the rewards based on either of a Q-learning method and policy gradient (see ¶[0020] and [0107]; a task-specific update is determined by backpropagating gradients of the task-specific objective.  The reinforcement learning technique may be a policy gradient technique of n-step Q learning technique).

Claim 3 (Original)
HERMANN discloses the method as set forth in claim 2.
HERMANN further discloses wherein the second artificial neural network comprises a plurality of nodes connected to each other in a matrix form (see ¶[0113]; the temporal autoencoder neural network may determine an intermediate representation using parameter matrices), and comprises an input layer to which learning data included in the actual measurement data is input, a hidden layer for applying the weight to the learning data input to the input layer, and an output layer for determining a value output from the hidden layer (see ¶[0004]; predict an output for a received input using deep neural networks that include hidden layers) as a control environment state prediction result (see ¶[0024]-[0025] and [0041]-[0042]; predict a next observation that includes a future state of the environment input data).

Claim 4 (Original)
HERMANN discloses the method as set forth in claim 3.
HERMANN further discloses wherein the learning data comprises the sensing information generated by sensing a control environment state of the control target at a specific point in time (see ¶[0060]; use perceptual feedback from the observations available at each time step) and the control command applied to each control target corresponding to the sensing information (see ¶[0055] and [0061]; an electronic controller is trained by reinforcement learning to control a system according to a state-action policy.  Observations can be obtained from laser sensors).

Claim 5 (Original)
HERMANN discloses the method as set forth in claim 4.
HERMANN further discloses wherein the actual measurement data further comprises label data, and wherein the label data comprises state information of the control environment measured after a predetermined time elapses after the control command is applied to the control target at the specific point in time (see abstract and ¶[0007] & [0060; an observation characterizing the state of the environment.  Feedback is received from observations available at each time step).

Claim 6 (Original)
HERMANN discloses the method as set forth in claim 2.
HERMANN further discloses wherein learning the second artificial neural network comprises: performing, by the second artificial intelligence module, a forward propagation process for generating a control environment state prediction result based on learning data included in the actual measurement data (see ¶[0078]; determine a current combined embedding and select an action using a feedforward neural network.  See also ¶[0086]; predict modeling aspects of the environment); and 
performing a back propagation process for correcting the weight of the second artificial neural network based on an error value that is a difference between the control environment state prediction result generated through the forward propagation process and label data included in the actual measurement data (see ¶[0020], [0084], and [0118]-[0120]; backpropagate gradients of task-specific objectives.  Minimize a loss between the actual next observation and the predicted observation that is a mean-squared-error loss).


Claim Rejections - 35 USC § 103
The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the manner in which the invention was made.

Claim 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20210110115 A1 to HERMANN et al. in view of US 20190279293 A1 to Tang et al. (hereinafter ‘TANG’).

Claim 7 (Original)
HERMANN discloses the method as set forth in claim 6.
HERMANN does not specifically disclose, but TANG discloses, when the control environment state prediction result is compared with the label data and then the difference between the control environment state prediction result and the label data is larger than a threshold value, performing the back propagation process performs the back propagation process for correcting the weight so that the difference converges within the threshold value (see ¶[0124]; when the attribute identification function includes a neural network, training can include updating weights and offsets of the neural network using a backpropagation algorithm.  Tran the attribute function until a cost function based on prediction error is below a threshold value).
HERMANN discloses selecting actions using multi-modal inputs that includes machine learning (see ¶[0004]) with loss minimization (see ¶[0025]) including backpropagating gradients (see ¶[0020]).  TANG discloses machine learning with output estimation that includes a backpropagation algorithm that trains a cost function to be below a threshold error value.  It would have been obvious to minimize error as taught by TANG in the system executing the method of TANG with the motivation to reduce loss and improve accuracy.  

Claim 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20210110115 A1 to HERMANN et al. in view of US 20150100530 A1 to Mnih et al. (hereinafter ‘MNIH’).

Claim 8 (Original)
HERMANN discloses the method as set forth in claim 2.
HERMANN further discloses wherein performing the reinforcement learning of the first artificial neural network comprises: providing, by the first artificial intelligence module, action information according to a policy to the second artificial intelligence module (see ¶[0038]-[0039]; the system includes a policy defining neural network module to define a policy for the computing system.  The policy defines the action to perform in the environment responsive to the environment input data and to the task input data); 
calculating, by the second artificial intelligence module, a next state and rewards for the action information by applying the action information to the second artificial neural network (see ¶[0020]-[0021] & [0039]; receive a current reward as a result of the agent performing the current action.  The training module trains the modules in response to reward data representing performance of one or more tasks); 
providing, by the second artificial intelligence module, the next state and the rewards to the first artificial intelligence module (see ¶[0083]; train the system iteratively over multiple training iterations.  See also ¶[0040]; combine first and second neural network modules to combine input data and output representation data).
HERMANN does not specifically disclose, but MNIH discloses, determining, by the first artificial intelligence module, through a Markov decision process, a policy in which the expected value of the sum of the rewards is maximized (see ¶[0049]; a large but finite Markov decision process in which each sequence is a distinct state).
HERMANN discloses selecting actions using multi-modal inputs that includes reinforcement learning.  MNIH discloses reinforcement learning that includes a Markov decision process to emulate processes and consider sequences of actions and observations.  It would have been obvious to use a Markov decision process as taught by MNIH in the system executing the method of HERMANN with the motivation to implement reinforcement learning with distinct stats for actions and observations.

Claim 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20210110115 A1 to HERMANN et al. in view of US 20190014488 A1 to Tan et al. (hereinafter ‘TAN’).

Claim 9 (Original)
HERMANN discloses the method as set forth in claim 2.
HERMANN does not specifically disclose, but TAN discloses, wherein the Q-learning method may be either of Deep Q-Networks and Deep Double Q-networks (DDQN) (see ¶[0041]; the reinforcement learning system may use other learning networks, such as a deep Q-network).
HERMANN discloses selecting actions using multi-modal inputs that includes reinforcement learning with an n-step Q learning technique (see ¶[0107]).  TAN discloses a system and method for deep learning and wireless network optimizing using deep learning that includes reinforcement learning with a deep Q-network.  It would have been obvious to include the deep Q-network as taught by TAN in the system executing the method of HERMANN with the motivation to implement reinforcement learning using a Q learning technique.  

Claim 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20210110115 A1 to HERMANN et al. in view of US 10133275 B1 to Kobilarov et al. (hereinafter ‘KOBILAROV’).

Claim 10 (Original)
HERMANN discloses the method as set forth in claim 2.
HERMANN does not specifically disclose, but KOBILAROV discloses, wherein the policy gradient is any one of Deep Deterministic Policy Gradient, Trust Region Policy Optimization, and Proximal Policy Optimization (PPO) (see col 20, ln 33-41; deep reinforcement learning may include Deep Q-networks and deep deterministic policy gradients).
HERMANN discloses selecting actions using multi-modal inputs that includes reinforcement learning with an n-step Q learning technique (see ¶[0107]).  KOBILAROV discloses trajectory generation using temporal logic that includes reinforcement learning with Deep Q-networks and deterministic policy gradients.  It would have been obvious to include the deep deterministic policy gradients as taught by KOBILAROV in the system executing the method of HERMANN with the motivation to implement reinforcement learning. 
 
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RICHARD N SCHEUNEMANN whose telephone number is (571)270-7947. The examiner can normally be reached M-F 9am-5pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Patricia Munson can be reached on 571-270-5396. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RICHARD N SCHEUNEMANN/Primary Examiner, Art Unit 3624