Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
An effective filing date of 11/26/2019 is acknowledged.
Claims 1 – 20 are pending as per preliminary amendments dated 11/26/2019.

Specification
The abstract of the disclosure is objected to because it includes a legal phraseology, i.e., “comprising”.  The form and legal phraseology should be avoided. 

Claim Objections
Claims 1 – 20 are objected to because of the following informalities:  
Claim 1
	Lines 18 – 19; change “the action for the environment replica in the action batch” to --the respective action in the action batch for the environment replica--.
	Line 22; change “the batch of transition tuples” to --the transition tuple batch--.  
Claims 2 and 8
	The claims are dependent claim of claim 1; therefore, they inherit issues of claim 1.
Claim 3
	Lines 1 – 2; change “obtaining a transition tuple batch comprising a respective transition tuple” to --obtaining the transition tuple batch comprising the respective transition tuple --.
	Line 3; “the respective actions” lacks antecedent basis.
	Line 5; “the respective subsequent states” lacks antecedent basis.
	Line 6; “the processes” lacks antecedent basis.
Claim 4
	Last line; remove “the” in front of “data”.
	All “the processes” lack antecedent basis.
Claim 5
	Line 3; insert --the-- before “current values”.  And, change “an action batch” to --the action batch--.
	Last line; change “each network output” to --the respective network output--.
Claim 6
	Line 2; change “a” to --the--.
	Lines 4 – 5; “the training tuple batch” lacks antecedent basis.
	Line 6; “the training tuples” and “the batch” lack antecedent basis.
Claim 7
	Line 1; change “a” to --the--.
	Line 4; “external processes” lacks antecedent basis.
	all “the processes” lack antecedent basis.
Claim 9
	Line 9; change “the method” to --the operations--.
	Lines 21 – 22; change “the action for the environment replica in the action batch” to --the respective action in the action batch for the environment replica--.
	Line 25; change “the batch of transition tuples” to --the transition tuple batch--.  
Claim 10
	Line 9; change “the method” to --the operations--.
	Lines 21 – 22; change “the action for the environment replica in the action batch” to --the respective action in the action batch for the environment replica--.
	Line 25; change “the batch of transition tuples” to --the transition tuple batch--. 
Claims 11 and 17
	The claims are dependent claim of claim 9; therefore, it inherits issues of claim 9.
Claim 12
	Lines 1 – 2; change “obtaining a transition tuple batch comprising a respective transition tuple” to --obtaining the transition tuple batch comprising the respective transition tuple --.
	Line 3; “the respective actions” lacks antecedent basis.
	Line 5; “the respective subsequent states” lacks antecedent basis.
	Line 6; “the processes” lacks antecedent basis.
Claim 13
	Last line; remove “the” in front of “data”.
	all “the processes” lack antecedent basis.
Claim 14
	Lines 2 – 3; insert --the-- before “current values”.  And, change “an action batch” to --the action batch--.
	Last line; change “each network output” to --the respective network output--.
Claim 15
	Line 2; change “a” to --the--.
	Lines 4 – 5; “the training tuple batch” lacks antecedent basis.
	Line 6; “the training tuples” and “the batch” lack antecedent basis.
Claim 16
	Line 1; change “a” to --the--.
	Line 3; “external processes” lacks antecedent basis.
	all “the processes” lack antecedent basis.
Claim 18
	The claim is dependent claim of claim 10; therefore, it inherits issues of claim 10.
Claim 19
	Line 2; change “obtaining a transition tuple batch comprising a respective transition tuple” to --obtaining the transition tuple batch comprising the respective transition tuple --.
	Line 4; “the respective actions” lacks antecedent basis.
	Line 6; “the respective subsequent states” lacks antecedent basis.
	Line 7; “the processes” lacks antecedent basis.
Claim 20
	Last line; remove “the” in front of “data”.
	all “the processes” lack antecedent basis.
Appropriate correction is required.

Claim Rejections - 35 USC § 102
	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1 – 4, 7, 9 – 13, 16, and 18 – 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Yuandon Tian et al. (NPL “ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games” version 1; hereinafter Tian.)

Claim 1
Tian teaches a method of training an action selection neural network to select actions to be performed by an agent interacting with an environment, wherein the action selection neural network has a plurality of network parameters and is configured to receive an input observation and to process the input observation in accordance with the network parameters to generate a network output that defines an action to be performed by the agent in response to the input observation (Tian; p. 1: Abstract, In this paper, we propose ELF, an Extensive, Lightweight and Flexible platform for fundamental reinforcement learning research…our platform is flexible in terms of environment-agent communication topologies, choices of RL methods, changes in game parameters…;  Fig. 1, p. 3: first half paragraph – last full paragraph, …During training, the consumers use the batch in various ways. For example, the actor takes the batch and returns the probabilities of actions (and values), then the actions are sampled and sent back…Each consumer knows the game environment identity from received batches, and typically contains one neural network model…We can assign one model to each game environment, or one-to-one (e.g., vanilla A3C [19]), in which each agent follows and updates its own copy of the model. Similarly, multiple environments can be assigned to a single model, or many-to-one (e.g., BatchA3C [33] or GA3C [1]), where the model can perform batched forward prediction to better utilize GPUs…), and wherein the method comprises: 
obtaining an observation batch comprising a plurality of current observations, each current observation characterizing a current state of a respective one of a plurality of environment replicas (Tian; Fig. 1, p. 2: last half paragraph, …The producer plays N games, each in a single C thread. When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…); 
processing the current observations in the observation batch in parallel using the action selection neural network in accordance with current values of the network parameters to generate an action batch that includes, for each environment replica, a respective action to be performed by the agent in response to the current observation characterizing the current state of the environment replica (Tian; Fig. 1, p. 2: last half paragraph, …The producer plays N games, each in a single C thread. When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…; p. 3: last full paragraph, …Similarly, multiple environments can be assigned to a single model, or many-to-one (parallel) (e.g., BatchA3C [33] or GA3C [1]), where the model can perform batched forward prediction to better utilize GPUs…); 
obtaining a transition tuple batch comprising a respective transition tuple for each of the environment replicas (Tian; Fig. 1, p. 2: last half paragraph – p. 3: first half paragraph, …When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…), the respective transition tuple for each environment replica comprising: 
(i) a subsequent observation characterizing a subsequent state that the environment replica transitioned into as a result of the agent performing the action for the environment replica in the action batch (Tian; Fig. 1, p. 2: last half paragraph – p. 3: first half paragraph, …When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…Before the training (or evaluation) starts, different consumers register themselves for batches with different history length…The batch received by the optimizer already contains the sampled actions from the previous steps, and can be used to drive reinforcement learning algorithms such as A3C…), and 
(ii) a reward generated as a result of the environment replica transitioning into the subsequent state (Tian; p. 6: last full paragraph, …For Mini-RTS, the agent only receives a reward when the game ends (_1 for win/loss). An average game of Mini-RTS lasts for around 4000 ticks, which results in 80 decisions for a frame skip of 50, showing that the game is indeed delayed in reward. For Capturing the Flag, we give intermediate rewards when the flag moves towards player’s own base (one score when the flag “touches down”)…); and 
training the action selection neural network on the batch of transition tuples to update the current values of the network parameters using a reinforcement learning technique (Tian; p. 2: last half paragraph – p. 3: first half paragraph, …The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…; p. 4: first half paragraph, Reinforcement Learning backend. We propose a Python-based RL backend. It has a flexible design that decouples RL methods from models. Multiple baseline methods (e.g., A3C [19], Policy Gradient [28], Q-learning, Trust Region Policy Optimization [24], etc.) are implemented, mostly with very few lines of Python codes.)

Claim 2
Tian also teaches each environment replica is maintained inside of a separate process (Tian; p. 2: last half paragraph, Fig. 1 shows the simple architecture of ELF, which follows a canonical producer-consumer model. The producer plays N games, each in a single C thread…; p. 3: first full paragraph, …Process-level parallelism will also introduce extra data exchange overhead between processes and increase complexity to framework design…), game can also be played in separated process.

Claim 3
Tian also teaches 
providing each of the respective actions in the action batch to the process that maintains the environment replica corresponding to the action to cause the environment replicas to transition into the respective subsequent states in parallel (Tian; Fig. 1, p. 2: last half paragraph, … The producer plays N games, each in a single C thread. When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…; p. 3: last full paragraph, …Similarly, multiple environments can be assigned to a single model, or many-to-one (parallel) (e.g., BatchA3C [33] or GA3C [1]), where the model can perform batched forward prediction to better utilize GPUs…; p. 3: first full paragraph, …Process-level parallelism will also introduce extra data exchange overhead between processes and increase complexity to framework design…), game can also be played in separated process; and 
obtaining, from each of the processes, the subsequent observation and the reward for the environment replica maintained inside of the process (Tian; Fig. 1, p. 2: last half paragraph – p. 3: first half paragraph, …When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…Before the training (or evaluation) starts, different consumers register themselves for batches with different history length…The batch received by the optimizer already contains the sampled actions from the previous steps, and can be used to drive reinforcement learning algorithms such as A3C…)

Claim 4
Tian also teaches after the subsequent observation and the reward have been obtained from all of the processes, generating the transition tuple batch from the data obtained from the processes (Tian; Fig. 1, p. 2: last half paragraph – p. 3: first half paragraph, …When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…; p. 3: first full paragraph, …Process-level parallelism will also introduce extra data exchange overhead between processes and increase complexity to framework design…), game can also be played in separated process.

Claim 7
Tian also teaches 
issuing respective calls in parallel to each of the external processes with the actions for the environment replicas (Tian; Fig. 1, p. 2: last half paragraph – p. 3: first half paragraph, …When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…; p. 3: first full paragraph, …Process-level parallelism will also introduce extra data exchange overhead between processes and increase complexity to framework design…), game can also be played in separated process; 
waiting until a subsequent observation and a reward are obtained from each of the processes in response to the respective calls (Tian; p. 6: last full paragraph, Rewards. For Mini-RTS, the agent only receives a reward when the game ends (±1 for win/loss)…; p. 7: third full paragraph, History length. History length T affects the convergence speed, as well as the final performance of A3C (Fig. 5). While Vanilla A3C [19] uses T = 5 for Atari games, the reward in Mini-RTS is more delayed…); and 
after determining that a subsequent observation and a reward have been obtained from each of the processes, generating the transition tuple batch using the obtained subsequent observations and rewards (Tian; Fig. 1, p. 2: last half paragraph – p. 3: first half paragraph, …When a batch of M current game states are ready (M < N), the corresponding games are blocked and the batch are sent to the Python side via the daemon. The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…)

Claim 9
This is a system version of the rejected method version in claim 1; therefore, it is rejected for the same reasons.  Furthermore, Tian also teaches a system comprising one or more computers and one or more storage devices storing instructions (Tian; p. 6: first full paragraph, We run ELF on a single server with a different number of CPU cores to test the efficiency of parallelism…)

Claim 10
This is one or more non-transitory computer-readable storage media version of the rejected method version in claim 1; therefore, it is rejected for the same reasons. Furthermore, Tian also teaches one or more non-transitory computer-readable storage media storing instructions (Tian; p. 6: first full paragraph, We run ELF on a single server with a different number of CPU cores to test the efficiency of parallelism…)

Claim 11
This limitation is already discussed in claim 2; therefore, it is rejected for the same reasons.

Claim 12
This limitation is already discussed in claim 3; therefore, it is rejected for the same reasons.

Claim 13
This limitation is already discussed in claim 4; therefore, it is rejected for the same reasons.

Claim 16
This limitation is already discussed in claim 7; therefore, it is rejected for the same reasons.

Claim 18
This limitation is already discussed in claim 2; therefore, it is rejected for the same reasons.

Claim 19
This limitation is already discussed in claim 3; therefore, it is rejected for the same reasons.

Claim 20
This limitation is already discussed in claim 4; therefore, it is rejected for the same reasons.

Claim Rejections - 35 USC § 103
	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 5, 6, 14, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Tian as applied to claims 2 and 11 above, and further in view of Mohammad Babaeizadeh (NPL “Reinforcement Learning Through Asynchronous Advantage Actor-Critic on a GPU”; hereinafter Mohammad; IDS filed on 03/16/2020.)

Claim 5
Tian teaches executing an inference [, wherein the inference [performs batched inference for the action selection neural network on the current observations in the observation batch to generate a respective network output for each current observation and selects a respective action from each network output (Tian; p. 3: last full paragraph, … Each consumer knows the game environment identity from received batches, and typically contains one neural network model. The models of different consumers may or may not share parameters, might update the weights, might reside in different processes or even on different machines. This architecture offers flexibility for switching topologies between game environments and models… Similarly, multiple environments can be assigned to a single model, or many-to-one (e.g., BatchA3C [33] or GA3C [1]), where the model can perform batched forward prediction (inference) to better utilize GPUs…)
But, Tian does not explicitly teach an inference subgraph of a computation graph.
However, Mohammad teaches an inference subgraph of a computation graph (Mohammad; p. 1: first full paragraph, … Deep Neural Networks (DNNs) as function approximators for value and policy functions, unleashing a rapid series of advancements…; p. 1: last half paragraph, To systematically investigate these issues, we implement both CPU and GPU versions of A3C in TensorFlow (TF) (Abadi et al., 2015), optimizing each for efficient system utilization and to approximately replicate published scores in the Atari 2600 environment…; also, see abstract.)
Tian and Mohammad are in the same analogous art as they are in the same field of endeavor, utilization of reinforcement learning for tasks.  Therefore, it would have been obvious to one with ordinary skill, in the art before the effective filing date of the claimed invention, to incorporate Mohammad teachings into Tian invention to also utilize TensorFlow to optimize system utilization as suggested by Mohammad ([abstract and p. 1: last half paragraph].)

Claim 6
Tian teaches executing a trainingthat takes as input the training tuple batch and the current values of the network parameters and applies the reinforcement learning technique to the training tuples in the batch to generate update values of the network parameters (Tian; p. 2: last half paragraph – p. 3: first half paragraph, …The consumers (e.g., actor, optimizer or others) get batched experience with history information via a Python/C++ interface and send back the replies to the blocked batch of the games, which are waiting for the next action and/or values, so that they can proceed…; p. 4: first half paragraph, Reinforcement Learning backend. We propose a Python-based RL backend. It has a flexible design that decouples RL methods from models. Multiple baseline methods (e.g., A3C [19], Policy Gradient [28], Q-learning, Trust Region Policy Optimization [24], etc.) are implemented, mostly with very few lines of Python codes.)
Mohammad teaches a training subgraph of the computation graph (Mohammad; p. 1: first full paragraph, … Deep Neural Networks (DNNs) as function approximators for value and policy functions, unleashing a rapid series of advancements…; p. 1: last half paragraph, To systematically investigate these issues, we implement both CPU and GPU versions of A3C in TensorFlow (TF) (Abadi et al., 2015), optimizing each for efficient system utilization and to approximately replicate published scores in the Atari 2600 environment…; also, see abstract.) Motivation for incorporating Mohammad into Tian is the same as motivation in claim 5.

Claim 14
This limitation is already discussed in claim 5; therefore, it is rejected for the same reasons.

Claim 15
This limitation is already discussed in claim 6; therefore, it is rejected for the same reasons.

Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Tian as applied to claims 1 and 9 above, and further in view of Lei Tai et al. (NPL “A Survey of Deep Network Solutions for Learning Control in Robotics: From Reinforcement to Imitation”; hereinafter Tai.)

Claim 8
Tian does not explicitly teach the reinforcement learning technique is a proximal policy optimization (PPO) algorithm.
However, Tai teaches the reinforcement learning technique is a proximal policy optimization (PPO) algorithm (Tai; P. 4: right column, first paragraph in section “C. DRL Algorithms”, … In the following, we cover the most influential DRL algorithms. 
p. 7: left column, bullet point #8; 8) PPO (Schulman et al., 2017): : Instead of reformulating a hard constraint problem as in TRPO (Eq. 51 and 52), PPO solves the original soft constraint optimization (Eq. 49) with 1st-order SGD, adapting C according to the KL divergence. Since it is much simpler implementation-wise compared to TRPO and gives agood performance, PPO has become the default DRL algorithm at OpenAI. A distributed version of PPO has also been proposed…), PPO is one of deep reinforcement learning (DRL) algorithms.
Tian and Tai are in the same analogous art as they are in the same field of endeavor, utilization of reinforcement learning for tasks.  Therefore, it would have been obvious to one with ordinary skill, in the art before the effective filing date of the claimed invention, to incorporate Tai teachings into Tian invention to also utilize a proximal policy optimization (PPO) for a training session, where the PPO offers better performance as suggested by Tai (p. 7: left column, bullet point #8.)

Claim 17
This limitation is already discussed in claim 8; therefore, it is rejected for the same reasons.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CUONG V LUU whose telephone number is (571)270-1733. The examiner can normally be reached 7:00 AM - 4:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hyung S. Sough can be reached on (571) 272-6799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CUONG V LUU/Examiner, Art Unit 2192                                                                                                                                                                                                        
/S. Sough/SPE, AU 2192/2194