DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-5, 7-11 and 14-20  is/are rejected under 35 U.S.C. 103 as being unpatentable over Mnih et al (US Pub No., 2019/0258938 A1)  in view of Kartal et al (US Pub., 2020/0143206 A1) 

With respect to claim 1  Mnih  teaches a reinforcement learning algorithm for an agent, the algorithm (Fig. 1, 100 discloses reinforcement learning system and paragraph [0006], discloses  a reinforcement learing system implemented as computer program  on one or more computers  in one or more locations that selects action to be performed by an agent interacting with an environment)  comprising: 

using an action-value model for training a policy model, the action-value model (paragraph [0007], discloses training  an action selection policy neural networking user a first reinforcement  learning technique wherein the action selection policy neural network has a plurality of action selection policy network parameters and is used in selecting actions to be performed by an agent interacting with an environment); and 
estimating, within one or more processors of the agent,  an expected future discounted reward that would be received if a hypothetical action was selected under a current observation of the agent and the agent's behavior was followed thereafter(paragraph [0015], discloses an estimate of the long-term time-discounted change in the activation generated by the unit if the agent performs the possible action in response to the received observation image…, and paragraph [0016], discloses generate a predicted reward that is an estimate of reward will be received with a next observation and paragraph [0039], discloses Q-value that is an estimate of the log-term time-discounted reward would be received if the agent 108 performs a particular action  in response to the observation.., as another example, the policy output may identify a particular  action that is predicted to yield the highest long-term time-discounted reward if performed by the agent in response to the observation ).

Mnih teaches the above elements including maintaining a stale copy of both the action-value model and the policy model, wherein the stale copy is initialized identically to the fresh copy and is slowly moved to match the fresh copy as learning (Fig. 2, discloses provide the observation as input [steal copy] to the action selection policy neural network and proved the intermediate output of the selection policy.. and paragraph [0026], discloses genre more useful repression of observed data and  ultimately determine more effective policy output  to maximize cumulative extrinsic reward )  and updates are performed on the fresh copy (paragraph [0050], discloses reinforcement learning learing technique to adjust the value of the set of parameter..) and  obtain, obtain observations for a sequence of time steps and the actual reward received following the last observation, generate one or more intermediate output of the action selection policy neural network that characterize the sequence of observation, and  process the one or more intermediate outputs using the  reward prediction neural network to generate a predicted  reward and backpropagate gradients that the system determines based on the policy output into the action selection policy neural network (Fig. 3).  Mnih failed to teach      wherein the algorithm has both an offline variant, in which the algorithm is trained using previously collected data, and an online variant, in which data is collected as the algorithm trains the policy model.
However, Kartal teaches wherein the algorithm has both an offline variant, in which the algorithm is trained using previously collected data, and an online variant, in which data is collected as the algorithm trains the policy model (paragraph [0082], discloses generate an expert move dataset[offline variant] ,   paragraph [0083], discloses per action selection to construct an expert dataset [online variant which data is collected].., on-policy fashion to improve training efficiency for actor-cretic RL in challenge domains with abundant and paragraph [0088], discloses actual terminal and y0 = 0 for initial steal of each episode [steal copy] and intermediate values  are linearly interpolated between [0,1]   and  paragraph [0089], discloses episode length .., the most recent 100 episodes was used to [fresh copy] ) .  Therefore, it would have been obvious to the one ordinary skill in the art before the effective  filing date of the claimed invention for generate a predicted reward that is an estimate of reward will be received with a next observation of Mnih with initial a data set and intermedia values of Kartal in order to increase the number of agent agent-environment interaction with positive rewards for the hare-exploration RL problems to improve training efficiency (see Kartal, paragraph [0084]).


With respect to claim 2, Mnih in view of Kartal teaches elements of claim 1, furthermore, Mnih teaches the algorithm  wherein the action-value model estimates the expected future discounted reward (paragraph [0012], discloses for each of a plurality of possible actions to be performed by the agent, and estimated of a change, more particular an estimate of the long-term item-discounted change, in pixel in the region  if the agent performs  the possible action in response to the received observation image and paragraph [0039] discloses the policy output may be a Q-value that is an estimate of the long-term time-discounted reward that would be received if the agent 108 performs a particular action in response to the observation). Mnih failed to teach the corresponding estimated Q-value of an estimated of the long-time discounted reward is used the model   Q, as 
Q(s,a)=E[ Σ ∞t=1 γ t-1 rt |s,a,π], 
where rt is a reward received at timestep t, s is the current observation of an environment state, a is the hypothetical action, π is the policy model, and γ is a discount factor in a domain [0, 1) that defines how valued future rewards are to more immediate rewards.
However, Kartal teaches model   Q, as 
Q(s,a)=E[ Σ ∞t=1 γ t-1 rt |s,a,π], 
where rt is a reward received at timestep t, s is the current observation of an environment state, a is the hypothetical action, π is the policy model, and γ is a discount factor in a domain [0, 1) that defines how valued future rewards are to more immediate rewards (paragraph [0071], discloses  standard reinforcement learning setting comprises an agent interacting in an environment over a discrete and number of steps. At time to the agent in states, takes an action  and receives a reward rt,. The discounted return is defined as …) .  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for policy output may be a Q-value that is an estimate of the long-term time-discounted reward of Mnih with a state value function for return some discount reward from state’s following a policy π of  Kartal in order minimize the mean-squared error using the lost function (see Kartal paragraph [0072]).
With respect to claim 3, Mnih in view of Kartal teaches elements of claim 1, furthermore, Mnih teaches the algorithm  wherein: the stale copy of the policy model acts as an old policy to be evaluated by the fresh copy of the action-value model critic; and the stale copy of the action-value model provides Q-values of an earlier policy model on which a fresh policy model improves (paragraph [0039], discloses the system uses the selection of the policy .., in response to the observation  104 as input [stale copy of the policy model acts as an old] .., generate a policy out that the system 100 uses to determine an action to be performed [fresh policy] the policy output may be a probability distribution over the set of possible actions…, policy output may Q-value that is an estimation of the long-term time-discount reward).

With respect to claim 4, Mnih in view of Kartal teaches elements of claim 1, furthermore, Mnih teaches the algorithm , wherein an output of the policy model, are parameters of probability distributions over a domain of an action space(paragraph [0039], discloses generate a policy output that the system 100 uses to determine an action 110 to be performed by the agent 108 at the time step. For example, the policy output may be a probability distribution over the set of possible actions).  Mnih failed to teach the corresponding output of the  policy  model as  π(s), for a given observation (s) of an environment state.
However, Kartal teaches output of the  policy  model as  π(s), for a given observation (s) of an environment state (paragraph [0072] m discloses the action-value function is the expected return following policy it after taking action a from state  s).  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for policy output may be a probability distribution over the set of possible actions of Mnih with a state value function for return some discount reward from state’s following a policy π of  Kartal in order minimize the mean-squared error using the lost function (see Kartal paragraph [0072]).
With respect to claim 5, Mnih in view of Kartal teaches elements of claim 4, furthermore, Mnih teaches the algorithm  intermediate output to generate for each of the one or more region and for each of a plurlity possible action to be performed by the agent.. (paragraph [0012]) and the policy output may be a probability distribution over the set of possible action (paragraph [0039]).  Mnih failed to teach  wherein, when the action space is a discrete action space, the parameters outputted are probability mass values. 

However, Kartal  teaches  wherein, when the action space is a discrete action space, the parameters outputted are probability mass values (paragraph [0065], discloses an action space filtered down by the planner and paragraph [0071], discloses an agent interaction in an environment over a discrete  over number of steps ). Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for policy output may be a probability distribution over the set of possible actions of Mnih with a state value function for return some discount reward from state’s following a policy π of  Kartal in order minimize the mean-squared error using the lost function (see Kartal paragraph [0072]).

With respect to claim 7, Mnih in view of Kartal teaches elements of claim 1, furthermore, Mnih teaches the algorithm wherein the offline variant includes an offline algorithm comprising: 
sampling minibatches of tuples from available data(paragraph [0057], discloses sampling engine 116 samples sequence of experience tuples .., observation in the sequence is non-zero are sampled with a higher probability ..); 
computing a critic loss function, LQ, and an actor loss function, Lπ (paragraph [0082], discloses the system back propagates gradients to minims loss unction .., the los function is given by the man-squared error between the actual reward received ..); 
differentiating each of the critic loss function and the actor loss function with respect to neural-net parameters(paragraph [0008], discloses the gradients may be gradients of a policy loss function for an auxiliary control task with respect to parameters . the policy output form an auxiliary control neural may use to compute a loss function for such backpropagation .., the particular loss function depends on the auxiliary control task (s) ..); 
performing a stochastic gradient-descent-based update to the neural-net parameters(paragraph [0082], discloses the system backpropagates gradients to minimize a loss function); and 
updating the stale copy toward the fresh copy by a geometric coefficient (Fig. 2, 201 disclose obtaining by randomly sampling the replay memory observation at a time step, the action at the time steps and the observation at the next time step .., and paragraph [0007], discloses neural network comprises auditing value of the action selection policy network parameter).
With respect to claim 8, Mnih in view of Kartal teaches elements of claim 7, furthermore, Mnih teaches the algorithm wherein: for a discrete-action case, a target for the critic loss function is computed exactly by marginalizing over a probability of each action selection by the stale policy model; and for a discrete action case, a target for the actor loss is computed exactly and a cross entropy loss function is used to make the policy model match the target(paragraph [0082], discloses the system back propagates gradients to minims loss unction .., the los function is given by the man-squared error between the actual reward received ..).


With respect to claim 9, Mnih in view of Kartal teaches elements of claim 7, furthermore, Mnih teaches the algorithm , wherein, for a continuous-action case, targets of the critic loss function and the actor loss function are not computed exactly, where sampling from the policy model and the stale copy of the policy model are used to stochastically approximate the targets, where a variance from the sampling is smoothed by a stochastic gradient descent process (paragraph [0008], discloses the gradients may be gradients of a policy loss function for an auxiliary control task with respect to the parameters…., to compute a loss function for such backpropagation the particular loss function depends on the auxiliary control …, the gradients may be determined from a reward prediction loss function.., paragraph [0010], discloses backpropagating gradient computed  and paragraph [0016], discloses determining gradients based on predicted reward generated by the reward prediction neural network ).
With respect to claim 10, Mnih in view of Kartal teaches elements of claim 7, furthermore, Mnih teaches the algorithm  wherein a target of each of the critic loss function and the actor loss function is an optimal solution that minimizing the respective critic loss function and the actor loss function would produce (paragraph [0056], discloses the system 1010 trans the reward prediction neural network 122 to generate a predicted reward that minimize reward prediction loss is a mean-squared Errol loss between the predicted reward that is received with a next observation ..).

With respect to claim 11, Mnih in view of Kartal teaches elements of claim 7, furthermore, Mnih teaches the algorithm  wherein a target (TQ) of the critic loss function for a given reward(paragraph [0082] discloses the loss function is given by the mean-squared error between the actual reward received with the observation following the last observation in the sequence and the prediction for the reward received with the observation following the last observation in the sequence).  Mnih failed to teach the corresponding loss function is resulted  observation is a scalar value defined by the formula.
However, Kartal teaches resulting observation is a scalar value defined by the formula TQ(r,s′)[AltContent: rect]r+γ Ea′ ˜π(s′;ϕ′) [Q(s′,a′,θ′)].  (paragraph [0072], discloses using the parameter, and then update parameter to minimize the mean-squared error, using the losing function..).  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention loss function is given by the mean-squared error Mnih with DQN algorithm approximate the action value function of Kartal in order to minimize the mean-squared error   (see Kartal, paragraph [0072]).

With respect to claim 14 Mnih in view of Kartal teaches elements of claim 1, furthermore, Mnih, the algorithm   wherein the agent is a previously trained agent and the action-value model is only used to train the agent over a predetermined number of initial steps (paragraph [0036], discloses the reinforcement learning system 100 select action 110 to be performed by an agent 108  [trained] interacting with an environment 102 at each of multiple time step [a predetermined number of initial steps] ).
With respect to claim 15, Mnih teaches a method of training policy mode and action-value model of an agent, comprising: 


estimating, within one or more processors of the agent,  an expected future discounted reward that would be received if a hypothetical action was selected under a current observation of the agent and the agent's behavior was followed thereafter(paragraph [0015], discloses an estimate of the long-term time-discounted change in the activation generated by the unit if the agent performs the possible action in response to the received observation image…, and paragraph [0016], discloses generate a predicted reward that is an estimate of reward will be received with a next observation and paragraph [0039], discloses Q-value that is an estimate of the log-term time-discounted reward would be received if the agent 108 performs a particular action  in response to the observation.., as another example, the policy output may identify a particular  action that is predicted to yield the highest long-term time-discounted reward if performed by the agent in response to the observation );
the stale copy of the policy model acts as an old policy to be evaluated by the fresh copy of the action-value model critic; and the stale copy of the action-value model provides Q-values of an earlier policy model on which a fresh policy model improves (paragraph [0039], discloses the system uses the selection of the policy .., in response to the observation  104 as input [stale copy of the policy model acts as an old] .., generate a policy out that the system 100 uses to determine an action to be performed [fresh policy] the policy output may be a probability distribution over the set of possible actions…, policy output may Q-value that is an estimation of the long-term time-discount reward).
Mnih teaches the above elements including wherein the action-value model estimates the expected future discounted reward (paragraph [0012], discloses for each of a plurality of possible actions to be performed by the agent, and estimated of a change, more particular an estimate of the long-term item-discounted change, in pixel in the region  if the agent performs  the possible action in response to the received observation image and paragraph [0039] discloses the policy output may be a Q-value that is an estimate of the long-term time-discounted reward that would be received if the agent 108 performs a particular action in response to the observation) and maintaining a stale copy of both the action-value model and the policy model, wherein the stale copy is initialized identically to the fresh copy and is slowly moved to match the fresh copy as learning (Fig. 2, discloses provide the observation as input [steal copy] to the action selection policy neural network and proved the intermediate output of the selection policy.. and paragraph [0026], discloses genre more useful repression of observed data and  ultimately determine more effective policy output  to maximize cumulative extrinsic reward )  and updates are performed on the fresh copy (paragraph [0050], discloses reinforcement learning learing technique to adjust the value of the set of parameter..) and  obtain, obtain observations for a sequence of time steps and the actual reward received following the last observation, generate one or more intermediate output of the action selection policy neural network that characterize the sequence of observation, and  process the one or more intermediate outputs using the  reward prediction neural network to generate a predicted  reward and backpropagate gradients that the system determines based on the policy output into the action selection policy neural network (Fig. 3).  

Mnih failed to teach the 
Q(s,a)=E[ Σ ∞t=1 γ t-1 rt |s,a,π], 
where rt is a reward received at timestep t, s is the current observation of an environment state, a is the hypothetical action, π is the policy model, and γ is a discount factor in a domain [0, 1) that defines how valued future rewards are to more immediate rewards; and 
wherein the algorithm has both an offline variant, in which the algorithm is trained using previously collected data, and an online variant, in which data is collected as the algorithm trains the policy model.
However, Kartal teaches model   Q, as 
Q(s,a)=E[ Σ ∞t=1 γ t-1 rt |s,a,π], 
where rt is a reward received at timestep t, s is the current observation of an environment state, a is the hypothetical action, π is the policy model, and γ is a discount factor in a domain [0, 1) that defines how valued future rewards are to more immediate rewards (paragraph [0071], discloses  standard reinforcement learning setting comprises an agent interacting in an environment over a discrete and number of steps. At time to the agent in states, takes an action  and receives a reward rt,. The discounted return is defined as …);  and 
  wherein the algorithm has both an offline variant, in which the algorithm is trained using previously collected data, and an online variant, in which data is collected as the algorithm trains the policy model (paragraph [0082], discloses generate an expert move dataset[offline variant] ,   paragraph [0083], discloses per action selection to construct an expert dataset [online variant which data is collected].., on-policy fashion to improve training efficiency for actor-cretic RL in challenge domains with abundant and paragraph [0088], discloses actual terminal and y0 = 0 for initial steal of each episode [steal copy] and intermediate values  are linearly interpolated between [0,1]   and  paragraph [0089], discloses episode length .., the most recent 100 episodes was used to [fresh copy] ) .   Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for policy output may be a Q-value that is an estimate of the long-term time-discounted reward of Mnih with a state value function for return some discount reward from state’s following a policy π of  Kartal in order minimize the mean-squared error using the lost function (see Kartal paragraph [0072]).

With respect to claim 16, Mnih in view of Kartal teaches elements of claim 1, furthermore, Mnih teaches the method , wherein an output of the policy model, are parameters of probability distributions over a domain of an action space(paragraph [0039], discloses generate a policy output that the system 100 uses to determine an action 110 to be performed by the agent 108 at the time step. For example, the policy output may be a probability distribution over the set of possible actions).  Mnih failed to teach the corresponding output of the  policy  model as  π(s), for a given observation (s) of an environment state.
However, Kartal teaches output of the  policy  model as  π(s), for a given observation (s) of an environment state (paragraph [0072] m discloses the action-value function is the expected return following policy it after taking action a from state  s).  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for policy output may be a probability distribution over the set of possible actions of Mnih with a state value function for return some discount reward from state’s following a policy π of  Kartal in order minimize the mean-squared error using the lost function (see Kartal paragraph [0072]).
With respect to claim 17, Mnih in view of Kartal teaches elements of claim 15, furthermore, Mnih teaches the method wherein the offline variant includes an offline algorithm comprising: 
sampling minibatches of tuples from available data(paragraph [0057], discloses sampling engine 116 samples sequence of experience tuples .., observation in the sequence is non-zero are sampled with a higher probability ..); 
computing a critic loss function, LQ, and an actor loss function, Lπ (paragraph [0082], discloses the system back propagates gradients to minims loss unction .., the los function is given by the man-squared error between the actual reward received ..); 
differentiating each of the critic loss function and the actor loss function with respect to neural-net parameters(paragraph [0008], discloses the gradients may be gradients of a policy loss function for an auxiliary control task with respect to parameters . the policy output form an auxiliary control neural may use to compute a loss function for such backpropagation .., the particular loss function depends on the auxiliary control task (s) ..); 
performing a stochastic gradient-descent-based update to the neural-net parameters(paragraph [0082], discloses the system backpropagates gradients to minimize a loss function); and 
updating the stale copy toward the fresh copy by a geometric coefficient (Fig. 2, 201 disclose obtaining by randomly sampling the replay memory observation at a time step, the action at the time steps and the observation at the next time step .., and paragraph [0007], discloses neural network comprises auditing value of the action selection policy network parameter).
With respect to claim 18, Mnih in view of Kartal teaches elements of claim 17, furthermore, Mnih teaches the method  wherein: for a discrete-action case, (1)a target for the critic loss function is computed exactly by marginalizing over a probability of each action selection by the stale policy model; and (2) a target for the actor loss is computed exactly and a cross entropy loss function is used to make the policy model match the target(paragraph [0082], discloses the system back propagates gradients to minims loss unction .., the los function is given by the man-squared error between the actual reward received ..); and 
wherein, for a continuous-action case, targets of the critic loss function and the actor loss function are not computed exactly, where sampling from the policy model and the stale copy of the policy model are used to stochastically approximate the targets, where a variance from the sampling is smoothed by a stochastic gradient descent process (paragraph [0008], discloses the gradients may be gradients of a policy loss function for an auxiliary control task with respect to the parameters…., to compute a loss function for such backpropagation the particular loss function depends on the auxiliary control …, the gradients may be determined from a reward prediction loss function.., paragraph [0010], discloses backpropagating gradient computed  and paragraph [0016], discloses determining gradients based on predicted reward generated by the reward prediction neural network ).




With respect to claim 19  Mnih  teaches a non-transitory computer-readable storage medium with an executable program stored thereon, wherein the program (Fig. 1, 100 discloses reinforcement learning system and paragraph [0006], discloses  a reinforcement learing system implemented as computer program  on one or more computers  in one or more locations that selects action to be performed by an agent interacting with an environment)  instructions  one or more pressor to perform the following steps:
using an action-value model for training a policy model, the action-value model (paragraph [0007], discloses training  an action selection policy neural networking user a first reinforcement  learning technique wherein the action selection policy neural network has a plurality of action selection policy network parameters and is used in selecting actions to be performed by an agent interacting with an environment); and 
estimating, within one or more processors of the agent,  an expected future discounted reward that would be received if a hypothetical action was selected under a current observation of the agent and the agent's behavior was followed thereafter(paragraph [0015], discloses an estimate of the long-term time-discounted change in the activation generated by the unit if the agent performs the possible action in response to the received observation image…, and paragraph [0016], discloses generate a predicted reward that is an estimate of reward will be received with a next observation and paragraph [0039], discloses Q-value that is an estimate of the log-term time-discounted reward would be received if the agent 108 performs a particular action  in response to the observation.., as another example, the policy output may identify a particular  action that is predicted to yield the highest long-term time-discounted reward if performed by the agent in response to the observation ).

Mnih teaches the above elements including maintaining a stale copy of both the action-value model and the policy model, wherein the stale copy is initialized identically to the fresh copy and is slowly moved to match the fresh copy as learning (Fig. 2, discloses provide the observation as input [steal copy] to the action selection policy neural network and proved the intermediate output of the selection policy.. and paragraph [0026], discloses genre more useful repression of observed data and  ultimately determine more effective policy output  to maximize cumulative extrinsic reward )  and updates are performed on the fresh copy (paragraph [0050], discloses reinforcement learning learing technique to adjust the value of the set of parameter..) and  obtain, obtain observations for a sequence of time steps and the actual reward received following the last observation, generate one or more intermediate output of the action selection policy neural network that characterize the sequence of observation, and  process the one or more intermediate outputs using the  reward prediction neural network to generate a predicted  reward and backpropagate gradients that the system determines based on the policy output into the action selection policy neural network (Fig. 3).  Mnih failed to teach      wherein the algorithm has both an offline variant, in which the algorithm is trained using previously collected data, and an online variant, in which data is collected as the algorithm trains the policy model.
However, Kartal teaches wherein the algorithm has both an offline variant, in which the algorithm is trained using previously collected data, and an online variant, in which data is collected as the algorithm trains the policy model (paragraph [0082], discloses generate an expert move dataset[offline variant] ,   paragraph [0083], discloses per action selection to construct an expert dataset [online variant which data is collected].., on-policy fashion to improve training efficiency for actor-cretic RL in challenge domains with abundant and paragraph [0088], discloses actual terminal and y0 = 0 for initial steal of each episode [steal copy] and intermediate values  are linearly interpolated between [0,1]   and  paragraph [0089], discloses episode length .., the most recent 100 episodes was used to [fresh copy] ) .  Therefore, it would have been obvious to the one ordinary skill in the art before the effective  filing date of the claimed invention for generate a predicted reward that is an estimate of reward will be received with a next observation of Mnih with initial a data set and intermedia values of Kartal in order to increase the number of agent agent-environment interaction with positive rewards for the hare-exploration RL problems to improve training efficiency (see Kartal, paragraph [0084]).


With respect to claim 20, Mnih in view of Kartal teaches elements of claim 19, furthermore, Mnih teaches the  non-transitory computer-readable medium  wherein: the stale copy of the policy model acts as an old policy to be evaluated by the fresh copy of the action-value model critic; and the stale copy of the action-value model provides Q-values of an earlier policy model on which a fresh policy model improves (paragraph [0039], discloses the system uses the selection of the policy .., in response to the observation  104 as input [stale copy of the policy model acts as an old] .., generate a policy out that the system 100 uses to determine an action to be performed [fresh policy] the policy output may be a probability distribution over the set of possible actions…, policy output may Q-value that is an estimation of the long-term time-discount reward).




Claim(s) 6, 12 and 13 are  rejected under 35 U.S.C. 103 as being unpatentable over Mnih et al (US Pub No., 2019/0258938 A1)  in view of Kartal et al (US Pub., 2020/0143206 A1)  and further view of Clifton et al  (US Pub., 2015/0227837 A1)
With respect to claim 6, Mnih in view of Kartal teaches elements of claim 4 furthermore, Mnih teaches the algorithm wherein, when the action space is a continuous n-dimensional action space, the parameters outputted (paragraphs [0069]- [0070], discloses   a pixel control neural network, then the policy outputs may be a Nact, tensor Q, where Nact, is the number of
possible actions that can be performed by the agent and Q(a,i,j) is an estimate of the long-tern time-discounted change in the pixels in the (i,j)th  region of the n* n nonoverlapping grid placed over the observation image) and Kartal teaches RL can explore on the action space filtered down by the planner outperforming …,(paragraph [0065]).  Mnih failed to teach the corresponding region of n*n and  Kartal failed to teach the corrosinding action spaces are a mean and a covariance of a multivariate Gaussian distribution over the action space.
	However, Clifton  teaches action space or region are a mean and a covariance of a multivariate Gaussian distribution over the action space (Fig. 3, 33, b and paragraph [0019], discloses  gaussian process with fits the functions of the training  set bases and paragraph [0040], discloses gaussian process is completely derived by the three variance hyperparameter .., with mean function ..) .  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for region of n*n and the action spaces of Kartal with Gaussian process of Clifton in order to maximizing the joint likelihood of all the training time-series, but other estimation techniques such as Bayesian estimation or use of a mixture of Gaussian Processes (see Clifton paragraph [0040]).

With respect to claim 12, Mnih in view of Kartal teaches elements of claim 7 , furthermore, Mnih teaches the algorithm  wherein a target (Tπ) of the actor loss function is a probability distribution over the Q-values from the stale copy of the action-value model (paragraph [0056], discloses a reward predication  loss is a mean-squared error loss between the predicted reward that received with a next observation that follows the last observation sequence [actor loss] and  paragraph [0082], discloses the system back propagates gradients to minims loss unction .., the los function is given by the man-squared error between the actual reward received ..) and Kartal teaches the default weight parameter may be employed i.e., for actor loss 0.5 for value loss and 0.01 for entropy loss(paragraph [0161]).
The combination of Mnih and  Kartal failed to teach the corrosinding actor loss is based a density for each action is defined and an actor loss function
    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale

wherein τ is a temperature hyperparameter that defines how greedy a target distribution is towards a highest scoring Q-value, whereas the temperature hyperparameter approaches zero, the probability distribution is more greedy and as the temperature hyperparameter approaches infinity, the probability distribution becomes more uniform.
However, Clifton teaches  a density for each action is defined  

    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale

wherein τ is a temperature hyperparameter that defines how greedy a target distribution is towards a highest scoring Q-value, whereas the temperature hyperparameter approaches zero, the probability distribution is more greedy and as the temperature hyperparameter approaches infinity, the probability distribution becomes more uniform (paragraphs [0038]- [0040], discloses  the Gaussian Process is completely defined by the three variance hyperparameter …, probability of density functions fn(s) which indicates the probability density value  y of any point in the n-dimensional (24-dimensional in this example) function space, each point in that space defining a functions consisting of 24 successive x values and the parameter of the gaussian process which give the best fit for functions of the rating set are obtained …).  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date for loss function of Mnih and Kartal with Gaussian process of Clifton in order to maximizing the joint likelihood of all the training time-series, but other estimation techniques such as Bayesian estimation or use of a mixture of Gaussian Processes (see Clifton paragraph [0040]).

With respect to claim 13 Mnih in view of Kartal teaches elements of claim 12, furthermore, Mnih, the algorithm     wherein the probability distribution prevents the policy model from overfitting the Q-value estimate by (1) preventing the policy model from becoming deterministic, which would hinder exploration when used in the environment, and (2) preventing optimization of the policy model by exploiting a relatively small error in the Q-value estimate that overestimate a suboptimal action selection(paragraph [0039], discloses policy output may be a probability distribution over the set of possible actions. As another example, the policy output may be a Q-value that is an estimate of the long-term time-discounted reward that would be received if the agent 108 performs a particular action in response to the observation. As another example, the policy output may identify highest long-term time discounted reward  a particular action that is predicted to yield the high [within the scope of preventing policy and optimization of policy).
The following prior art on the record: 
Mnih et al (US Pub No., 2019/0258938 A1) discloses a computer system and method for extending parallelized asynchronous reinforcement learning for training a neural network is described in various embodiments, through coordinated

Kartal et al (US Pub., 2020/0143206 A1)  discloses methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a reinforcement learning system.

Clifton et al  (US Pub., 2015/0227837 A1) discloses a method of monitoring a system such as a machine, industrial system, or human or animal patient, to classify the system as normal or abnormal, in which a time-series of measurements of the system are regarded as a function to be compared to a model of normality for such functions.



Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SABA DAGNEW whose telephone number is (571)270-3271. The examiner can normally be reached 9-6:45.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Waseem Ashraf can be reached on (571) 270 -3948. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SABA DAGNEW/Primary Examiner, Art Unit 3682