DETAILED ACTION
This action is in response to the claims filed 12/28/2021 for application 16/188,123. Claims 1, 10, 16 are amended. Claims 1-20 are currently pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/28/2021 has been entered.
 Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitations uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are: 
policy function approximator is configured to calculate in claim 1.
value function approximator is configured to calculate in claim 6. 
virtual environment is configured to output in claim 10. 
trained reinforcement learning agent configured to receive in claim 10.
action performance module is configured to concurrently perform in claim 10. 
display module is configured to cause in claim 10. 
trained state value function approximator is configured to calculate in claim 13
Because these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have these limitations interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitations to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitations recites sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-3 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Mnih et al. ("Human-level control through deep reinforcement learning", cited by Applicant in the IDS filed on 02/08/2019, hereinafter "Mnih") in view of Sharma et al. ("Learning to Factor Policies and Action-Value Functions: Factored Action Space Representations for Deep Reinforcement learning", cited by Applicant in the IDS filed 02/11/2019, hereinafter "Sharma") further in view of Pandey et al. ("Reinforcement Learning by Comparing Immediate Reward", hereinafter "Pandey") and further in view of Weaver et al. ("The Optimal Reward Baseline for Gradient·Based Reinforcement Learning", hereinafter "Weaver").
Regarding claim 1, Mnih teaches A computer-implemented method for training a reinforcement learning agent to interact with an environment comprising:
instantiating a policy function approximator, wherein the policy function approximator is configured to (“We consider tasks in which the agent interacts with an environment through a sequence of observations, actions and rewards. The goal of the agent is to select actions in a fashion that maximizes cumulative future reward. More formally, we use a deep convolutional neural network to approximate the optimal action-value function
    PNG
    media_image1.png
    92
    433
    media_image1.png
    Greyscale
” [pg. 529, bottom left column – top right column; Action-value function would be used to calculate action probability values where s is the state of the environment.]), for each discrete action out of a plurality of discrete actions within the environment (“We consider tasks in which an agent interacts with an environment, in this case the Atari emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action at from the set of legal game actions, A = {1,... K}.” [pg. 534, § Algorithm, right col, ¶4, lines 1-4; note: the action value function disclosed by Mnih would calculate estimated action probabilities and these would correspond to actions that the agent selects in an environment.])
for each discrete action of the plurality of discrete actions (“At each time-step the agent selects an action at from the set of legal game actions, A = {1,... K}.” [pg. 534, § Algorithm, right col, ¶4, lines 1-4)
when the environment is in the first state, within the environment in dependence on the initial estimated action probabilities (“A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to -0.9.” [pg. 537, Extended Data Figure 2, b.])
a baseline reward value (“we clipped all positive rewards at 1 and all negative rewards at -1, leaving 0 rewards unchanged.” [pg. 534, § Training Details, lines 8-9]),
wherein: the values of the updated estimated action probabilities which correspond to the concurrently performed two or more of the plurality of discrete actions are greater than the respective values of the initial estimated action probabilities (“A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to -0.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of -1. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1.” [pg. 537, Extended Data Figure 2, b.; the up action staying high would correspond to a greater value than the expected.]);
and the values of the updated estimated action probabilities which do not correspond to the concurrently performed two or more of the plurality of discrete actions are less than the respective values of the initial estimated action probabilities (“A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to -0.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of -1. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1.” [pg. 537, Extended Data Figure 2, b.; the down action falling to -0.9 would correspond to a value less than the expected.]).
	Although Mnih teaches a plurality of discrete actions, the reference fails to go into details regarding for each discrete action out of a plurality of discrete actions concurrently performable by the reinforcement learning agent within the environment, independently calculate an estimated action probability of performing the discrete action, in dependence on a given state of the environment;
	in response to concurrently performing the two or more of the plurality of discrete actions, receiving a reward value
	Sharma teaches for each discrete action out of a plurality of discrete actions concurrently performable by the reinforcement learning agent within the environment, independently calculate an estimated action probability of performing the discrete action, in dependence on a given state of the environment (“Figure 2 visualizes a factoring of X over the complete Atari action space. Any action in the complete Atari action space can be represented as a tuple (hi, vi , fi) with vi ∈ {go up, go down, don’t move vertically}, hi ∈ {go left, go right, don’t move horizontally} and fi ∈ {fire, don’t fire}. The choice of the factors over which X is decomposed depends on the set of possible actions, for a given task. This decomposed representation of X allows the DRL agent to learn X corresponding to multiple actions simultaneously, while executing a single action. When the action a = up-right-fire is executed, the parameters corresponding to the individual factors of Xa: up, right and fire get updated. Hence Xup-fire, Xright-fire and Xup-left are also adjusted, and not just the Xa. Let S denote set of all states in an MDP and A denote the discrete action set for a DRL agent. We claim that often, action spaces A are compositional and thus allow the decomposition of any action a ∈ A into n independent action-factors [a1, a2, · · · , an] such that ai ∈ Ai, where Ai is the set of values that factor i can take. We claim that instead of modeling X over A, the agent would be better off, modeling the individual components of X over the factor-spaces Ai. These individual components of X are realized using independent output layers (referred to as factor-layers hereafter) of a neural network f1, f2,..., fn where fi corresponds to Ai and has size |Ai|.” [pg. 3-4, § 3 Factored Action Representations for Deep Reinforcement Learning, ¶1; See further: “The experiment is as follows: A trained agent is taken. With probability 1 − €, the agent samples actions from the learned policy. With probability €, it samples actions uniformly at random from the best k actions.” [pg. 8, § 5.1 Uniformly Random from best-K analysis, ¶1]]);
	in response to concurrently performing the two or more of the plurality of discrete actions, receiving a reward value (“Consider a DRL agent that executes the action go diagonally up and left and gets some reward corresponding to this action. The key insight in this work is that this feedback can be used to learn not only about the go diagonally up and left action but also the actions go up and go left. Hence, every time a diagonal step is executed, it is possible to learn about the individual action factors as well.” [pg. 2, § Introduction, ¶2])
	Mnih and Sharma are both in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s teachings to train a reinforcement learning agent to perform discrete actions simultaneously as taught by Sharma. One would have been motivated to make this modification in order to allow an agent to learn multiple actions simultaneously. [Abstract, Sharma]
	Mnih/Sharma fails to explicitly teach in response to the received reward value being greater than a baseline reward value, updating the policy function approximator, such that the updated policy function approximator is configured to calculate updated estimated action probabilities in dependence on the first state of the environment
Pandey teaches in response to the received reward value being greater than a baseline reward value (“The immediate reward is based upon the action or move taken by an agent to reach the defined goal in each episode. The total discounted reward can maximize in less number of episode if we select the higher immediate reward signal from previous. [pg. 4, top left column; note Mnih teaches the baseline reward value as cited above.])
Mnih, Sharma, and Pandey are all in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. Pandey discloses a reinforcement learning algorithm that compares rewards. Mnih discloses a baseline reward value and Pandey discloses comparing received rewards. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s/Sharma’s teachings with Pandey to compare the received reward values disclosed by Pandey with the baseline reward value of disclosed by Mnih. One would have been motivated to make this modification in order to reach an optimal Q-value by selecting actions which have corresponding rewards higher than previous ones. [Abstract, Pandey]
Mnih/Sharma/Pandey fails to explicitly teach updating the policy function approximator, such that the updated policy function approximator is configured to calculate updated estimated action probabilities in dependence on the first state of the environment
Weaver teaches updating the policy function approximator based on a baseline reward, such that the updated policy function approximator is configured to calculate updated estimated action probabilities in dependence on the first state of the environment (“
    PNG
    media_image2.png
    242
    317
    media_image2.png
    Greyscale
” [pg. 544, left col, ¶1; See further GARB algorithm on pg. 541])
Mnih, Sharma, Pandey and Weaver are all in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. Pandey discloses a reinforcement learning algorithm that compares rewards. Weaver discloses using a reward baseline to update policy gradients. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s/Sharma’s/Pandey’s teachings by updating action probabilities as taught by Weaver. Updating policy functions by using rewards is well-known in reinforcement learning and thus would yield predictable results. 

Regarding claim 2, Mnih/Sharma/Pandey/Weaver teaches The method of claim 1, where Mnih further teaches wherein the policy function approximator comprises a policy neural network (“More formally, we use a deep convolutional neural network to approximate the optimal action-value function. [pg. 529, right col, ¶1; note action-values are calculated using a policy π which would correspond to a policy neural network]).

Regarding claim 3, Mnih/Sharma/Pandey/Weaver teaches The method of claim 2, where Mnih further teaches wherein the output layer of the policy neural network comprises a plurality of nodes which each independently calculates a probability for a respective one of the discrete actions (“The output layer is a fully-connected linear layer with a single output for each valid action. The number of valid actions varied between 4 and 18 on the games we considered.” [pg. 534, § Model architecture, ¶2; See Fig. 1 for plurality of nodes, the output layer outputs predicted Q-values for each valid action.]).

Regarding claim 9, Mnih/Sharma/Pandey/Weaver teaches The method of claim 1, where Mnih further teaches wherein the policy function approximator is updated in dependence on the received reward value and one or more expert state-action pairs, wherein each expert state-action pair comprises (“
    PNG
    media_image3.png
    65
    431
    media_image3.png
    Greyscale
” [pg. 534, § Algorithm, right colum, ¶4; State action pair would be Q(s,a) where s is the state and a is the action.]: 
a state data item based on a state of the environment (“The DQN agent predicts high state values for both full (top right screenshots) and nearly complete screens (bottom left screenshots) because it has learned that completing a screen leads to a new screen full of enemy ships.” [pg. 532, Figure 4; enemy ships would correspond to a state data item.]); 
and an expert action data item based on one or more actions taken by an expert when the environment was in the respective state (“10 Hz is about the fastest that a human player can select the ‘fire’ button, and setting the random agent to this frequency avoids spurious baseline scores in a handful of the games. We did also assess the performance of a random agent that selected an action at 60 Hz (that is, every frame). This had a minimal effect: changing the normalized DQN performance by more than 5% in only six games (Boxing, Breakout, Crazy Climber, Demon Attack, Krull and Robotank), and in all these games DQN outperformed the expert human by a considerable margin.” [pg. 534, § Evaluation procedure, ¶1; human playing selecting ‘fire’ button would correspond to an expert action in a state (frame would correspond to a state).]).

Claims 4 and 5 are rejected under 35 U.S.C. 103 as being unpatentable over Mnih in view of Sharma, Pandey, and Weaver as applied to claim 1 above, and further in view of Khan et al. ("Training an Agent for FPS Doom Game using Visual Reinforcement Learning and VizDoom", hereinafter "Khan").

Regarding claim 4, Mnih/Sharma/Pandey/Weaver teaches The method of claim 2, however fails to explicitly teach wherein updating the policy function approximator comprises calculating a policy gradient, wherein calculating the policy gradient comprises calculating a cross-entropy in dependence on the action probabilities of the initial estimated action probabilities which correspond to the concurrently performed two or more actions
Khan teaches wherein updating the policy function approximator comprises calculating a policy gradient, wherein calculating the policy gradient comprises calculating a cross-entropy in dependence on the action probabilities of the initial estimated action probabilities which correspond to the concurrently performed two or more actions  (“
    PNG
    media_image4.png
    75
    358
    media_image4.png
    Greyscale
” [pg. 36, left column, equation 6; note: This policy gradient would be used to calculate the updated policy function by using the actions performed in the training.]).
Mnih, Sharma, Pandey, Weaver and Khan are all in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. Pandey discloses a reinforcement learning algorithm that compares rewards. Weaver discloses using a reward baseline to update policy gradients. Khan discloses training a competitive agent to play in a 3D FPS game. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s/Sharma’s/Pandey’s/Weaver’s teachings with Khan’s policy gradient in order to update the policy function approximator. One would be motivated to make this modification in order to find the optimal Q-function. [pg. 36, left column, Khan]

Regarding claim 5, Mnih/Sharma/Pandey/Weaver teaches The method of claim 2, however fails to explicitly teach wherein updating the policy function approximator comprises calculating a policy gradient, wherein calculating the policy gradient comprises calculating a cross-entropy in dependence on the action probabilities of the initial plurality of estimated action probabilities which do not correspond to the concurrently performed two or more actions
Khan teaches wherein updating the policy function approximator comprises calculating a policy gradient, wherein calculating the policy gradient comprises calculating a cross-entropy in dependence on the action probabilities of the initial plurality of estimated action probabilities which do not correspond to the concurrently performed two or more actions (“
    PNG
    media_image4.png
    75
    358
    media_image4.png
    Greyscale
” [pg. 36, left column, equation 6; note: This policy gradient would be used to calculate the updated policy function by using the actions performed in the training.]).
Mnih, Sharma, Pandey, Weaver and Khan are all in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. Pandey discloses a reinforcement learning algorithm that compares rewards. Weaver discloses using a reward baseline to update policy gradients. Khan discloses training a competitive agent to play in a 3D FPS game. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s/Sharma’s/Pandey’s/Weaver’s teachings with Khan’s policy gradient in order to update the policy function approximator. One would be motivated to make this modification in order to find the optimal Q-function. [pg. 36, left column, Khan]

Claims 6-8 are rejected under 35 U.S.C. 103 as being unpatentable over Mnih in view of Sharma, Pandey and Weaver as applied to claim 1 above, and further in view of Arulkumaran et al. ("Deep Reinforcement Learning A brief survey", hereinafter "Arulkumaran").

Regarding claim 6, Mnih/Sharma/Pandey/Weaver teaches The method of claim 1, where Pandey further teaches in response to the received reward value being greater than a baseline reward value (“The immediate reward is based upon the action or move taken by an agent to reach the defined goal in each episode. The total discounted reward can maximize in less number of episode if we select the higher immediate reward signal from previous. [pg. 4, top left column; note Mnih teaches the baseline reward value as cited above in claim 1.])
However Mnih/Sharma/Pandey/Weaver fails to explicitly teach further comprising:
 instantiating a state value approximator, wherein the value function approximator is configured to calculate an estimated state value in dependence on a given state of the environment; 
calculating, using the state value approximator, an initial estimated state value for a first state of the environment in dependence on the first state of the environment; and 
updating the state value approximator such that the updated state value approximator is configured to calculate an updated estimated state value in dependence on the first state of the environment, wherein the updated estimated state value is greater than the initial estimated state value.
Arulkumaran teaches further comprising: instantiating a state value approximator, wherein the value function approximator is configured to calculate an estimated state value in dependence on a given state of the environment (“Value function methods are based on estimating the value (expected return) of being in a given state. The state-value function Vπ(s) is the expected return when starting in state s and following π subsequently:” [pg. 29, § Value functions, ¶1]); 
calculating, using the state value approximator, an initial estimated state value for a first state of the environment in dependence on the first state of the environment (“
    PNG
    media_image5.png
    112
    363
    media_image5.png
    Greyscale
” [pg. 29, § Value functions, ¶1; This state value function would be used to calculate an initial state value for a first state (i.e. starting in state s)]); and 
updating the state value approximator such that the updated state value approximator is configured to calculate an updated estimated state value in dependence on the first state of the environment, wherein the updated estimated state value is greater than the initial estimated state value (“
    PNG
    media_image6.png
    272
    360
    media_image6.png
    Greyscale
” [pg. 29, § Value functions, ¶2-3; note: It is implicit that by choosing a greedily policy to maximize the policy, the updated state value would need be greater than the initial state for this to happen.]).
Mnih, Sharma, Pandey, Weaver and Arulkumaran are all in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. Pandey discloses a reinforcement learning algorithm that compares rewards. Weaver discloses using a reward baseline to update policy gradients. Arulkumaran discloses a deep reinforcement learning survey that teaches the generic concepts of reinforcement learning. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s/Sharma’s/Pandey’s/Weaver’s teachings with Arulkumaran’s state value function. State value functions are well-known and understood in the field of deep reinforcement learning and one would be motivated to use one to determine the optimal policy. [Arulkumaran, § Value functions]

Regarding claim 7, Mnih/Sharma/Pandey/Weaver/Arulkumaran teaches The method of claim 6, where Arulkumaran further teaches wherein the state value approximator comprises a value neural network (“In general, DRL is based on training deep neural networks to approximate the optimal policy π* and/or the optimal value functions V*, Q*, and A*.” [pg. 31, § The rise of DRL; note: Training neural network to approximate V* disclosed by Arulkumaran would correspond to a value neural network.]).
Mnih, Sharma, Pandey, Weaver and Arulkumaran are all in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. Pandey discloses a reinforcement learning algorithm that compares rewards. Weaver discloses using a reward baseline to update policy gradients. Arulkumaran discloses a deep reinforcement learning survey that teaches the generic concepts of reinforcement learning. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s/Sharma’s/Pandey’s/Weaver’s teachings with Arulkumaran’s state value function. State value functions are well-known and understood in the field of deep reinforcement learning and one would be motivated to use one to determine the optimal policy. [Arulkumaran, § Value functions]

Regarding claim 8, Mnih/Sharma/Pandey/Weaver/Arulkumaran teaches The method of claim 6, wherein Arulkumaran further teaches the policy function approximator is updated in dependence on the initial estimated state value and the received reward value (“At the same time, the critic (value function) receives the state and reward resulting from the previous interaction. The critic uses the TD error calculated from this information to update itself and the actor.” [pg. 31, Figure 4, actor-critic setup; note: actor would correspond to the policy function approximator]).
Mnih, Sharma, Pandey, Weaver and Arulkumaran are all in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. Pandey discloses a reinforcement learning algorithm that compares rewards. Weaver discloses using a reward baseline to update policy gradients. Arulkumaran discloses a deep reinforcement learning survey that teaches the generic concepts of reinforcement learning. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s/Sharma’s/Pandey’s/Weaver’s teachings with Arulkumaran’s state value function. State value functions are well-known and understood in the field of deep reinforcement learning and one would be motivated to use one to determine the optimal policy. [Arulkumaran, § Value functions]

Claims 10, 11, 15, 16, 17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Mnih in view of Sharma.

Regarding claim 10, Mnih teaches A system comprising: 
a virtual environment, wherein the virtual environment is configured to output a first visual representation of a state of the environment comprising first pixel data and output a second visual representation of the state of the environment comprising second pixel data “
    PNG
    media_image7.png
    374
    881
    media_image7.png
    Greyscale
” [pg. 537, Extended Data Figure 2, b.; The game ‘Pong’ would correspond to a virtual environment. The ball at a certain position would correspond to a first visual representation of a state and the ball at a new position would correspond to a second visual representation of the state. Additionally, the ball would be pixel data as it is displayed on a screen.])
a trained reinforcement learning agent configured to receive the first visual representation of the environment, wherein the trained reinforcement learning agent comprises (“At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen” [pg. 537, Extended Data Figure 2, b.; agent would receive the visual representation in order to perform any action]):
a trained policy function approximator, wherein the trained policy function approximator is trained to, for each discrete action out of a plurality of discrete actions (“At each time-step the agent selects an action at from the set of legal game actions, A = {1,... K}.” [pg. 534, § Algorithm, right col, ¶4, lines 1-4)
and a display module, wherein the display module is configured to cause the second visual representation of the state of the environment to be displayed to a user (Working directly with raw Atari 2600 frames, which are 2103 160 pixel images with a 128-colour palette, can be demanding in terms of computation and memory requirements. We apply a basic preprocessing step aimed at reducing the input dimensionality and dealing with some artefacts of the Atari 2600 emulator. First, to encode a single frame we take the maximum value for each pixel colour value over the frame being encoded and the previous frame. This was necessary to remove flickering that is present in games where some objects appear only in even frames while other objects appear only in odd frames, an artefact caused by the limited number of sprites Atari 2600 can display at once.” [pg. 534, § Preprocessing, ¶1; note: Mnih discloses experimentation involving a user (See pg. 534, right column, ¶3), it would be implicit that the second visual representation of the state of the environment would be displayed to the user.]);
a user-controlled agent controlled by the user, wherein the user-controlled agent is configured to concurrently perform two or more actions within the environment in dependence on two or more inputs provided by the user. (“The professional human tester used the same emulator engine as the agents, and played under controlled conditions. The human tester was not allowed to pause, save or reload games. As in the original Atari 2600 environment, the emulator was run at 60 Hz and the audio output was disabled: as such, the sensory input was equated between human player and agents. The human performance is the average reward achieved from around 20 episodes of each game lasting a maximum of 5 min each, following around 2 h of practice playing each game” [pg. 534, § Evaluation procedure, ¶2; Mnih discloses a human tester (i.e. user). A sensory input would be pressing the ‘fire’ button in a game. See Evaluation procedure, ¶1])
However Mnih fails to explicitly teach an action performance module, wherein the action performance module is configured to concurrently perform two or more actions within the environment in dependence on the estimated action probabilities;
independently calculate an estimated action probability in dependence on the first visual representation of the state of the environment, wherein an estimated action probability is a probability of performing, by the trained reinforcement learning agent, one discrete action of the plurality of discrete actions within the environment;
Sharma teaches wherein the action performance module is configured to concurrently perform two or more actions within the environment in dependence on the estimated action probabilities (“Any action in the complete Atari action space can be represented as a tuple (hi, vi , fi) with vi ∈ {go up, go down, don’t move vertically}, hi ∈ {go left, go right, don’t move horizontally} and fi ∈ {fire, don’t fire}.” [pg. 3, § 3 Factored Action Representations for Deep Reinforcement Learning, ¶1]);
independently calculate an estimated action probability in dependence on the first visual representation of the state of the environment, wherein an estimated action probability is a probability of performing, by the trained reinforcement learning agent, one discrete action of the plurality of discrete actions within the environment (“Figure 2 visualizes a factoring of X over the complete Atari action space. Any action in the complete Atari action space can be represented as a tuple (hi, vi , fi) with vi ∈ {go up, go down, don’t move vertically}, hi ∈ {go left, go right, don’t move horizontally} and fi ∈ {fire, don’t fire}. The choice of the factors over which X is decomposed depends on the set of possible actions, for a given task. This decomposed representation of X allows the DRL agent to learn X corresponding to multiple actions simultaneously, while executing a single action. When the action a = up-right-fire is executed, the parameters corresponding to the individual factors of Xa: up, right and fire get updated. Hence Xup-fire, Xright-fire and Xup-left are also adjusted, and not just the Xa. Let S denote set of all states in an MDP and A denote the discrete action set for a DRL agent. We claim that often, action spaces A are compositional and thus allow the decomposition of any action a ∈ A into n independent action-factors [a1, a2, · · · , an] such that ai ∈ Ai, where Ai is the set of values that factor i can take. We claim that instead of modeling X over A, the agent would be better off, modeling the individual components of X over the factor-spaces Ai. These individual components of X are realized using independent output layers (referred to as factor-layers hereafter) of a neural network f1, f2,..., fn where fi corresponds to Ai and has size |Ai|.” [pg. 3-4,§ 3 Factored Action Representations for Deep Reinforcement Learning, ¶1; See further: “The experiment is as follows: A trained agent is taken. With probability 1 − €, the agent samples actions from the learned policy. With probability €, it samples actions uniformly at random from the best k actions.” [pg. 8, § 5.1 Uniformly Random from best-K analysis, ¶1]]);
	Mnih and Sharma are both in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s teachings to train a reinforcement learning agent to perform discrete actions simultaneously as taught by Sharma. One would have been motivated to make this modification in order to allow an agent to learn multiple actions simultaneously. [Abstract, Sharma]

Regarding claim 11, Mnih/Sharma teaches the system of claim 10, where Mnih further teaches wherein the trained policy function approximator comprises a trained policy neural network (“More formally, we use a deep convolutional neural network to approximate the optimal action-value function. [pg. 529, right col, ¶1; note action-values are calculated using a policy π which would correspond to a policy neural network]). 

Regarding claim 15, Mnih/Sharma teaches The system of claim 10, where Mnih further teaches wherein the virtual environment is a video game environment (“A visualization of the learned action-value function on the game Pong.” [pg. 537, Extended Data Figure 2, b.]).

Regarding claim 16, Mnih teaches A computer-implemented method for operating a reinforcement learning agent within an environment comprising: loading a trained policy function approximator wherein the trained policy function approximator is trained to (“During learning, we apply Q-learning updates, on samples (or minibatches) of experience (s,a,r,s’) , U(D), drawn uniformly at random from the pool of stored samples.” [pg. 529, right column, ¶3; storing samples of experience would implicitly mean that data would have to be loaded for the next training iteration.]), for each discrete action out of a plurality of discrete actions (“At each time-step the agent selects an action at from the set of legal game actions, A = {1,... K}.” [pg. 534, § Algorithm, right col, ¶4]), in dependence on a visual representation of a state of the environment comprising pixel data (“b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience.” [pg. 537, Extended Data Figure 2])
calculating, using the trained policy function approximator, a plurality of estimated action probabilities in dependence on a visual representation of a first state of the environment (“The outputs correspond to the predicted Q-values of the individual actions for the input state. The main advantage of this type of architecture is the ability to compute Q-values for all possible actions in a given state with only a single forward pass through the network.” [pg. 534, § Model architecture, ¶3; predicting would be equivalent to estimating. A given state would encompass a first state of the environment.]); 
However Mnih fails to explicitly teach and when the environment is in the first state, concurrently performing two or more of the plurality of discrete actions within the environment in dependence on the plurality of estimated action probabilities.
independently calculate an estimated action probability, wherein an estimated action probability is a probability of performing, by the reinforcement learning agent, one discrete action of the plurality of discrete actions within the environment;
Sharma teaches wherein the action performance module is configured to concurrently perform two or more actions within the environment in dependence on the estimated action probabilities (“Any action in the complete Atari action space can be represented as a tuple (hi, vi , fi) with vi ∈ {go up, go down, don’t move vertically}, hi ∈ {go left, go right, don’t move horizontally} and fi ∈ {fire, don’t fire}.” [pg. 3, § 3 Factored Action Representations for Deep Reinforcement Learning, ¶1]);
independently calculate an estimated action probability in dependence on the first visual representation of the state of the environment, wherein an estimated action probability is a probability of performing, by the trained reinforcement learning agent, one discrete action of the plurality of discrete actions within the environment (“Figure 2 visualizes a factoring of X over the complete Atari action space. Any action in the complete Atari action space can be represented as a tuple (hi, vi , fi) with vi ∈ {go up, go down, don’t move vertically}, hi ∈ {go left, go right, don’t move horizontally} and fi ∈ {fire, don’t fire}. The choice of the factors over which X is decomposed depends on the set of possible actions, for a given task. This decomposed representation of X allows the DRL agent to learn X corresponding to multiple actions simultaneously, while executing a single action. When the action a = up-right-fire is executed, the parameters corresponding to the individual factors of Xa: up, right and fire get updated. Hence Xup-fire, Xright-fire and Xup-left are also adjusted, and not just the Xa. Let S denote set of all states in an MDP and A denote the discrete action set for a DRL agent. We claim that often, action spaces A are compositional and thus allow the decomposition of any action a ∈ A into n independent action-factors [a1, a2, · · · , an] such that ai ∈ Ai, where Ai is the set of values that factor i can take. We claim that instead of modeling X over A, the agent would be better off, modeling the individual components of X over the factor-spaces Ai. These individual components of X are realized using independent output layers (referred to as factor-layers hereafter) of a neural network f1, f2,..., fn where fi corresponds to Ai and has size |Ai|.” [pg. 3-4,§ 3 Factored Action Representations for Deep Reinforcement Learning, ¶1; See further: “The experiment is as follows: A trained agent is taken. With probability 1 − €, the agent samples actions from the learned policy. With probability €, it samples actions uniformly at random from the best k actions.” [pg. 8, § 5.1 Uniformly Random from best-K analysis, ¶1]]);
	Mnih and Sharma are both in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s teachings to train a reinforcement learning agent to perform discrete actions simultaneously as taught by Sharma. One would have been motivated to make this modification in order to allow an agent to learn multiple actions simultaneously. [Abstract, Sharma]

Regarding claim 17, Mnih/Sharma teaches The method of claim 16, where Mnih further teaches wherein the trained policy function approximator comprises a trained policy neural network (“More formally, we use a deep convolutional neural network to approximate the optimal action-value function. [pg. 529, right col, ¶1; note action-values are calculated using a policy π which would correspond to a policy neural network]).

Regarding claim 20, Mnih/Sharma teaches The method of claim 16, where Mnih further teaches wherein the environment is a video game environment (“A visualization of the learned action-value function on the game Pong.” [pg. 537, Extended Data Figure 2, b.]).

Claims 12 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Mnih in view of Sharma as applied to claim 10 above, and further in view of Silver et al. ("Reinforcement Learning of Local Shape in the Game of Go", hereinafter "Silver").
Regarding claim 12, Mnih/Sharma teaches The system of claim 11, however fails to explicitly teach wherein the output layer of the trained policy neural network is a sigmoid per action output layer. 
Silver teaches wherein the output layer of the trained policy neural network is a sigmoid per action output layer (“
    PNG
    media_image8.png
    112
    337
    media_image8.png
    Greyscale
” [pg. 1055, § 5. Learning algorithm, top of right column]).
Mnih, Sharma and Silver are all in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. Silver discloses using reinforcement learning to train an AI agent in the game of Go. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s/Sharma’s teachings with Silver’s sigmoid function to keep the output of the values between 0 and 1. Using a sigmoid function is well-known and understood in the field of deep reinforcement learning and one would be motivated to make this modification in order to make the training process more efficient. [§ Learning algorithm, ¶4, Silver] 

Regarding claim 18, Mnih/Sharma teaches The method of claim 17, however fails to explicitly teach wherein the output layer of the trained policy neural network is a sigmoid per action output layer.
Silver teaches wherein the output layer of the trained policy neural network is a sigmoid per action output layer (“
    PNG
    media_image8.png
    112
    337
    media_image8.png
    Greyscale
” [pg. 1055, § 5. Learning algorithm, top of right column]).
Mnih, Sharma and Silver are all in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. Silver discloses using reinforcement learning to train an AI agent in the game of Go. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s/Sharma’s teachings with Silver’s sigmoid function to keep the output of the values between 0 and 1. Using a sigmoid function is well-known and understood in the field of deep reinforcement learning and one would be motivated to make this modification in order to make the training process more efficient. [§ Learning algorithm, ¶4, Silver]

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Mnih in view of Sharma as applied to claim 10 above, and further in view of Arulkumaran.

Regarding claim 13, Mnih/Sharma teaches The system of claim 10, however fails to explicitly teach further comprising: a trained state value approximator, wherein the trained state function approximator is configured to calculate an estimated state value in dependence on a given state of the environment.
Arulkumaran teaches further comprising: a trained state value approximator, wherein the trained state function approximator is configured to calculate an estimated state value in dependence on a given state of the environment (“
    PNG
    media_image5.png
    112
    363
    media_image5.png
    Greyscale
” [pg. 29, § Value functions, ¶1; This state value function would be used to calculate an initial state value for a first state (i.e. starting in state s)]).
Mnih, Sharma, and Arulkumaran are all in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. Arulkumaran discloses a deep reinforcement learning survey that teaches the generic concepts of reinforcement learning. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s/Sharma’s teachings with Arulkumaran’s state value function. State value functions are well-known and understood in the field of deep reinforcement learning and one would be motivated to use one to determine the optimal policy. [Arulkumaran, § Value functions]

Claims 14 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Mnih in view of Sharma as applied to claim 10 above, and further in view of Khan.
Regarding claim 14, Mnih/Sharma teaches The system of claim 10, however fails to explicitly teach wherein the trained policy neural network comprises one or more long short-term memory layers
Khan further teaches wherein the trained policy neural network comprises one or more long short-term memory layers (“Fig. 6. The proposed architecture of the model in [14]. The Convolutional layers are given an input image. The output is split into two streams produced by the convolutional layers. The first one (bottom) flattens the output (layers 3) and inputs it to LSTM, as in the DQRN model. The second one at the top directs it to an extra hidden layer “layer 4”, after then to a final layer representing each game features. While training, the game features and the Q-learning objectives are trained mutually.” [pg. 35, Fig. 6])
 Mnih, Sharma and Khan are all in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. Khan discloses training a competitive agent to play in a 3D FPS game. Both Mnih and Khan use the generic concepts of reinforcement learning to train the agents. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s/Sharma’s teachings with Khan to use long short-term memory layers within the trained policy neural network. Using LSTMs are well-known in the field of deep learning and one would be motivated to make this modification in order to yield predictable results. 

Regarding claim 19, Mnih/Sharma teaches The method of claim 17, however fails to explicitly teach wherein the trained policy neural network comprises one or more long short-term memory layers
Khan teaches wherein the trained policy neural network comprises one or more long short-term memory layers (“Fig. 6. The proposed architecture of the model in [14]. The Convolutional layers are given an input image. The output is split into two streams produced by the convolutional layers. The first one (bottom) flattens the output (layers 3) and inputs it to LSTM, as in the DQRN model. The second one at the top directs it to an extra hidden layer “layer 4”, after then to a final layer representing each game features. While training, the game features and the Q-learning objectives are trained mutually.” [pg. 35, Fig. 6]).
Mnih, Sharma and Khan are all in the same field of endeavor of deep reinforcement learning. Mnih discloses a method of training an agent to perform diverse tasks in a virtual environment. Sharma discloses Factored Action space Representations (FAR) which allows an agent to learn multiple actions simultaneously. Khan discloses training a competitive agent to play in a 3D FPS game. Both Mnih and Khan use the generic concepts of reinforcement learning to train the agents. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mnih’s/Sharma’s teachings with Khan to use long short-term memory layers within the trained policy neural network. Using LSTMs are well-known in the field of deep learning and one would be motivated to make this modification in order to yield predictable results. 

Response to Arguments
Applicant’s arguments with respect to claims 1, 10 and 16 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. 
Specifically, applicant’s arguments regarding the limitation: “independently calculate an estimated action probability of performing the discrete action…” are moot because that particular limitation is now taught by the newly presented art of Sharma.

Additionally, applicant’s arguments regarding the limitation: “updated estimated action probabilities” are moot because the limitation is now taught by the newly presented art of Weaver. Please see the updated 103 rejection above. 

Applicant’s arguments with respect to the rejections of the dependent claims have been fully considered but they are not persuasive as they rely upon the allowability of the independent claims.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL H HOANG whose telephone number is (571)272-8491. The examiner can normally be reached Mon-Fri 8:30AM-4:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/M.H.H./Examiner, Art Unit 2122 

                                                                                                                                                                                                     
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122