Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This office action is responsive to Applicant’s reply filed on 10/17/2022.
Claims 1 – 20 have been examined; wherein claims 1, 3 – 7, 9, 10, 12 – 16, 19, and 20 have been amended.
Claims 1 – 20 are being finally rejected.

Response to Amendment
Objection for abstract is withdrawn in view of Applicant’s amendments.
Claim objections for claims 1 – 20 are withdrawn in view of Applicant’s amendments.

Response to Arguments
Applicant’s arguments with respect to claims 1, 9, and 10 have been considered but are moot in view of the new ground(s) of rejection as being necessitated by amendments.  See Siao-lei Ma et al. (CN 106205126A.)

Claim Objections
Claims 3 – 7, 12 – 16, and 20 are objected to because of the following informalities:  
Claim 3
	Line 4; change “each respective action” to --the respective action--.
Claim 4
	The claim is dependent claim of claim 3; thus, it suffers issues of claim 3.
Claim 5
	Last line; change “selects” to --select--
Claim 6
	Line 2; change “batch of transition tuples” to --transition tuple batch--
	Line 6; change “the transition tuples” to --the respective transition tuples—
Claim 7
	Line 2; change “a respective transition tuple” to --the respective transition tuple--
	Line 7; change “a subsequent observation” and “a reward” to --the subsequent observation-- and --the reward-- respectively
Claim 12
	Line 4; change “each respective action” to --the respective action--.
Claim 13
	The claim is dependent claim of claim 12; thus, it suffers issues of claim 12.
Claim 14
	Last line; change “selects” to --select--
Claim 15
	Line 2; change “batch of transition tuples” to --transition tuple batch--
	Line 6; change “the transition tuples” to --the respective transition tuples—
Claim 16
	Line 2; change “a respective transition tuple” to --the respective transition tuple--
	Line 7; change “a subsequent observation” and “a reward” to --the subsequent observation-- and --the reward—respectively
Claim 19
	Line 4; change “each respective action” to --the respective action--.
Claim 20
	The claim is dependent claim of claim 19; thus, it suffers issues of claim 19.
Appropriate correction is required.

Claim Rejections - 35 USC § 103
	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1 – 7, 9 –16, and 18 – 20 are rejected under 35 U.S.C. 103 as being unpatentable over Yuandon Tian et al. (NPL “ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games” version 1; hereinafter Tian) in view of Siao-Lei Ma et al. (CN 106205126-A; hereinafter Ma.)
CN 106205126-A is not in English language.  It was translated to English by Espacenet website.  And, the translated CN 106205126A is attached.

Claim 1
Tian teaches a method of training an action selection neural network to select actions to be performed by an agent interacting with an environment, wherein the action selection neural network has a plurality of network parameters and is configured to receive an input observation and to process the input observation in accordance with the network parameters to generate a network output that defines an action to be performed by the agent in response to the input observation (Tian; p. 1: Abstract, In this paper, we propose ELF, an Extensive, Lightweight and Flexible platform for fundamental reinforcement learning research…our platform is flexible in terms of environment-agent communication topologies, choices of RL methods, changes in game parameters…;  Fig. 1, p. 3: first half paragraph – last full paragraph, …During training, the consumers use the batch in various ways. For example, the actor takes the batch and returns the probabilities of actions (and values), then the actions are sampled and sent back…Each consumer knows the game environment identity from received batches, and typically contains one neural network model…We can assign one model to each game environment, or one-to-one (e.g., vanilla A3C [19]), in which each agent follows and updates its own copy of the model. Similarly, multiple environments can be assigned to a single model, or many-to-one (e.g., BatchA3C [33] or GA3C [1]), where the model can perform batched forward prediction to better utilize GPUs…), and wherein the method comprises: 
obtaining a plurality of current observations, each current observation characterizing a current state of a respective one of a plurality of environment replicas (Tian; Fig. 1, p. 2: last half paragraph, …The producer plays N games, each in a single C thread. When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…); 
processing the current observationsin parallel using the action selection neural network in accordance with current values of the network parameters to generate an action batch that includes, for each environment replica, a respective action to be performed by the agent in response to the current observation characterizing the current state of the environment replica (Tian; Fig. 1, p. 2: last half paragraph, …The producer plays N games, each in a single C thread. When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…; p. 3: last full paragraph, …Similarly, multiple environments can be assigned to a single model, or many-to-one (parallel) (e.g., BatchA3C [33] or GA3C [1]), where the model can perform batched forward prediction to better utilize GPUs…); 
obtaining a transition tuple batch comprising a respective transition tuple for each of the environment replicas (Tian; Fig. 1, p. 2: last half paragraph – p. 3: first half paragraph, …When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…), the respective transition tuple for each environment replica comprising: 
(i) a subsequent observation characterizing a subsequent state that the environment replica transitioned into as a result of the agent performing the respective action in the action batch for the environment replica (Tian; Fig. 1, p. 2: last half paragraph – p. 3: first half paragraph, …When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…Before the training (or evaluation) starts, different consumers register themselves for batches with different history length…The batch received by the optimizer already contains the sampled actions from the previous steps, and can be used to drive reinforcement learning algorithms such as A3C…), and 
(ii) a reward generated as a result of the environment replica transitioning into the subsequent state (Tian; p. 6: last full paragraph, …For Mini-RTS, the agent only receives a reward when the game ends (_1 for win/loss). An average game of Mini-RTS lasts for around 4000 ticks, which results in 80 decisions for a frame skip of 50, showing that the game is indeed delayed in reward. For Capturing the Flag, we give intermediate rewards when the flag moves towards player’s own base (one score when the flag “touches down”)…); and 
training the action selection neural network on the transition tuple batch to update the current values of the network parameters using a reinforcement learning technique (Tian; p. 2: last half paragraph – p. 3: first half paragraph, …The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…; p. 4: first half paragraph, Reinforcement Learning backend. We propose a Python-based RL backend. It has a flexible design that decouples RL methods from models. Multiple baseline methods (e.g., A3C [19], Policy Gradient [28], Q-learning, Trust Region Policy Optimization [24], etc.) are implemented, mostly with very few lines of Python codes.)
But, Tian does not explicitly teach generating an observation batch as a single tensor that combines the plurality of current observations; processing the current observations in the single tensor in parallel using the action selection neural network in accordance with current values of the network parameters to generate an action batch.
However, Ma teaches 
generating an observation batch as a single tensor that combines the plurality of current observations (Ma; [0033] S1 Collect the GPS data of the vehicle, and extract the vehicle operation data of each road section at each moment, and generate a matrix M (tensor) according to the obtained vehicle operation data…);
processing the current observations in the single tensor in parallel using the action selection neural network in accordance with current values of the network parameters to generate an action batch (Ma; [0038] S2. Generate a space-time heat map for at least one day according to the matrix M…; [0040] S3. On the spatiotemporal heat map, a data set (X, Y) is generated by window sliding. [0043 – 0044] S4, construct a convolutional neural network model, and utilize the data set (X, Y) to train the convolutional neural network model…; [0067 – 0068] The invention adopts the convolutional neural network CNN to learn the heatmap, and abstracts the speed heatmap into a single vector v through the convolution process and the pooling process of the convolutional neural network, and the vehicle speed information of the future traffic network can be predicted through the vector v…)
Tian and Ma are in the same analogous art as they are in the same field of endeavor, collecting and processing data.  Therefore, it would have been obvious to one with ordinary skill, in the art before the effective filing date of the claimed invention, to incorporate Ma teachings into Tian invention to improve learning efficiency for large data by utilizing TensorFlow platform as suggested by Ma ([0024].)

Claim 2
Tian also teaches each environment replica is maintained inside of a separate process (Tian; p. 2: last half paragraph, Fig. 1 shows the simple architecture of ELF, which follows a canonical producer-consumer model. The producer plays N games, each in a single C thread…; p. 3: first full paragraph, …Process-level parallelism will also introduce extra data exchange overhead between processes and increase complexity to framework design…), game can also be played in separated process.

Claim 3
Tian also teaches 
providing each respective action in the action batch to the separate process that maintains the environment replica corresponding to the respective action to cause the environment replica to transition into the subsequent state in parallel (Tian; Fig. 1, p. 2: last half paragraph, … The producer plays N games, each in a single C thread. When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…; p. 3: last full paragraph, …Similarly, multiple environments can be assigned to a single model, or many-to-one (parallel) (e.g., BatchA3C [33] or GA3C [1]), where the model can perform batched forward prediction to better utilize GPUs…; p. 3: first full paragraph, …Process-level parallelism will also introduce extra data exchange overhead between processes and increase complexity to framework design…), game can also be played in separated process; and 
obtaining, from each of the separate processes, the subsequent observation and the reward for the environment replica maintained inside of the separate process (Tian; Fig. 1, p. 2: last half paragraph – p. 3: first half paragraph, …When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…Before the training (or evaluation) starts, different consumers register themselves for batches with different history length…The batch received by the optimizer already contains the sampled actions from the previous steps, and can be used to drive reinforcement learning algorithms such as A3C…)

Claim 4
Tian also teaches after the subsequent observation and the reward have been obtained from all of the separate processes, generating the transition tuple batch from the data obtained from the separate processes (Tian; Fig. 1, p. 2: last half paragraph – p. 3: first half paragraph, …When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…; p. 3: first full paragraph, …Process-level parallelism will also introduce extra data exchange overhead between processes and increase complexity to framework design…), game can also be played in separated process.

Claim 5
Tian teaches executing an inference , wherein the inference  performs batched inference for the action selection neural network on the current observations in the single tensor to generate a respective network output for each current observation and selects a respective action from the respective network output (Tian; p. 3: last full paragraph, … Each consumer knows the game environment identity from received batches, and typically contains one neural network model. The models of different consumers may or may not share parameters, might update the weights, might reside in different processes or even on different machines. This architecture offers flexibility for switching topologies between game environments and models… Similarly, multiple environments can be assigned to a single model, or many-to-one (e.g., BatchA3C [33] or GA3C [1]), where the model can perform batched forward prediction (inference) to better utilize GPUs…)
Ma teaches an inference subgraph of a computation graph (Ma; [0038] S2. Generate a space-time heat map for at least one day according to the matrix M…; [0040] S3. On the spatiotemporal heat map, a data set (X, Y) is generated by window sliding. [0043 – 0044] S4, construct a convolutional neural network model, and utilize the data set (X, Y) to train the convolutional neural network model…; 
[0086] The fourth step is to build a convolutional neural network model.
Keras is a deep learning framework that can be based on Theano and TensorFlow. Keras is very simple to build a deep learning model by superimposing training layers, and Keras can call the system GPU for model calculation through Theano or TensorFlow, so Keras is selected as the building block in the example…)  TensorFlow platform builds data flow graphs to define how data moves through the graph.  Motivation for incorporating Ma into Titan is the same as motivation in claim 1.

Claim 6
Tian teaches executing a trainingthat takes as input the transition tuple batch and the current values of the network parameters and applies the reinforcement learning technique to the transition tuples in the transition tuple batch to generate update values of the network parameters (Tian; p. 2: last half paragraph – p. 3: first half paragraph, …The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…; p. 4: first half paragraph, Reinforcement Learning backend. We propose a Python-based RL backend. It has a flexible design that decouples RL methods from models. Multiple baseline methods (e.g., A3C [19], Policy Gradient [28], Q-learning, Trust Region Policy Optimization [24], etc.) are implemented, mostly with very few lines of Python codes.)
Ma teaches a training subgraph of the computation graph (Ma; [0038] S2. Generate a space-time heat map for at least one day according to the matrix M…; [0040] S3. On the spatiotemporal heat map, a data set (X, Y) is generated by window sliding. [0043 – 0044] S4, construct a convolutional neural network model, and utilize the data set (X, Y) to train the convolutional neural network model…; 
[0086] The fourth step is to build a convolutional neural network model.
Keras is a deep learning framework that can be based on Theano and TensorFlow. Keras is very simple to build a deep learning model by superimposing training layers, and Keras can call the system GPU for model calculation through Theano or TensorFlow, so Keras is selected as the building block in the example…)  TensorFlow platform builds data flow graphs to define how data moves through the graph.  Motivation for incorporating Ma into Titan is the same as motivation in claim 1. 

Claim 7
Tian also teaches 
issuing respective calls in parallel to each of the separate processes with the actions for the environment replicas (Tian; Fig. 1, p. 2: last half paragraph – p. 3: first half paragraph, …When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…; p. 3: first full paragraph, …Process-level parallelism will also introduce extra data exchange overhead between processes and increase complexity to framework design…), game can also be played in separated process; 
waiting until a subsequent observation and a reward are obtained from each of the separate processes in response to the respective calls (Tian; p. 6: last full paragraph, Rewards. For Mini-RTS, the agent only receives a reward when the game ends (±1 for win/loss)…; p. 7: third full paragraph, History length. History length T affects the convergence speed, as well as the final performance of A3C (Fig. 5). While Vanilla A3C [19] uses T = 5 for Atari games, the reward in Mini-RTS is more delayed…); and 
after determining that a subsequent observation and a reward have been obtained from each of the separate processes, generating the transition tuple batch using the obtained subsequent observations and rewards (Tian; Fig. 1, p. 2: last half paragraph – p. 3: first half paragraph, …When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…)

Claim 9
This is a system version of the rejected method version in claim 1; therefore, it is rejected for the same reasons.  Furthermore, Tian also teaches a system comprising one or more computers and one or more storage devices storing instructions (Tian; p. 6: first full paragraph, We run ELF on a single server with a different number of CPU cores to test the efficiency of parallelism…)

Claim 10
This is one or more non-transitory computer-readable storage media version of the rejected method version in claim 1; therefore, it is rejected for the same reasons. Furthermore, Tian also teaches one or more non-transitory computer-readable storage media storing instructions (Tian; p. 6: first full paragraph, We run ELF on a single server with a different number of CPU cores to test the efficiency of parallelism…)

Claim 11
This limitation is already discussed in claim 2; therefore, it is rejected for the same reasons.

Claim 12
This limitation is already discussed in claim 3; therefore, it is rejected for the same reasons.

Claim 13
This limitation is already discussed in claim 4; therefore, it is rejected for the same reasons.

Claim 14
This limitation is already discussed in claim 5; therefore, it is rejected for the same reasons.

Claim 15
This limitation is already discussed in claim 6; therefore, it is rejected for the same reasons.

Claim 16
This limitation is already discussed in claim 7; therefore, it is rejected for the same reasons.

Claim 18
This limitation is already discussed in claim 2; therefore, it is rejected for the same reasons.

Claim 19
This limitation is already discussed in claim 3; therefore, it is rejected for the same reasons.

Claim 20
This limitation is already discussed in claim 4; therefore, it is rejected for the same reasons.

Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Tian and Ma as applied to claims 1 and 9 above, and further in view of Lei Tai et al. (NPL “A Survey of Deep Network Solutions for Learning Control in Robotics: From Reinforcement to Imitation”; hereinafter Tai.)

Claim 8
Tian and Ma do not explicitly teach the reinforcement learning technique is a proximal policy optimization (PPO) algorithm.
However, Tai teaches the reinforcement learning technique is a proximal policy optimization (PPO) algorithm (Tai; P. 4: right column, first paragraph in section “C. DRL Algorithms”, … In the following, we cover the most influential DRL algorithms. 
p. 7: left column, bullet point #8; 8) PPO (Schulman et al., 2017): : Instead of reformulating a hard constraint problem as in TRPO (Eq. 51 and 52), PPO solves the original soft constraint optimization (Eq. 49) with 1st-order SGD, adapting C according to the KL divergence. Since it is much simpler implementation-wise compared to TRPO and gives a good performance, PPO has become the default DRL algorithm at OpenAI. A distributed version of PPO has also been proposed…), PPO is one of deep reinforcement learning (DRL) algorithms.
Tian, Ma and Tai are in the same analogous art as they are in the same field of endeavor, utilization of reinforcement learning for tasks.  Therefore, it would have been obvious to one with ordinary skill, in the art before the effective filing date of the claimed invention, to incorporate Tai teachings into Tian/Ma invention to also utilize a proximal policy optimization (PPO) for a training session, where the PPO offers better performance as suggested by Tai (p. 7: left column, bullet point #8.)

Claim 17
This limitation is already discussed in claim 8; therefore, it is rejected for the same reasons.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CUONG V LUU whose telephone number is (571)270-1733. The examiner can normally be reached 6:30 AM - 3:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hyung S. Sough can be reached on (571) 272-6799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CUONG V LUU/Examiner, Art Unit 2192                                                                                                                                                                                                        
/S. Sough/SPE, AU 2192/2194