DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
The present application is being examined under the claims filed on 06/01/2021.
Claims 21 and 22 are new.
Claims 2, 4, 12, 14, 17, and 19 are canceled.
Claims 1, 3, 11, 13, 16, and 18 are amended.
Claims 1, 3, 5-11, 13, 15, 16, 18, and 20 are rejected.
Claims 21 and 22 are objected.
Claims 1, 3, 5-11, 13, 15, 16, 18, and 20-22 are pending.

Drawings
The Drawings filed on 11/22/2017 are acceptable for examination purposes.

Specification
The Specification filed on 11/22/2017 is acceptable for examination purposes.

Computer Readable Storage Medium(s) Positive Statement
The computer readable storage medium(s) has been interpreted to exclude carrier waves, propagated signals or the like based on the reading of ¶ [0089] of the instant specification. Examiner interprets computer readable storage medium(s) to include only non-transitory embodiments of 

Response to Arguments
In reference to rejections under 35 USC § 102 and 35 USC § 103
Applicant asserts that the recitation of the cited reference for Boltzmann exploration appears to be a distinct kind of learning, with no immediate relevance to the other cited types of exploration. Thus, there is no reason to think, based on the cited references, that the cited temperature parameter might be used to calculate an exploration term ε.
Examiner respectfully disagrees. Examiner notes that both Hosu et al. and Stadie et al. disclose epsilon-greedy methods (ε-greedy methods). Examiner notes that epsilon-greedy Boltzman methods are known in the art; to support that epsilon-greedy Boltzman methods are known in the art, examiner included an additional reference of Michel Tokic - "Adaptive e-greedy Exploration in Reinforcement Learning Based on Value Differences" to the PTO-892 (not relied for the rejection). The rejection relies on Stadie in at least § 5 “In Boltzman exploration, a positive probability is assigned to any possible action according to its expected utility and according to a temperature parameter [23]”. Examiner notes that a greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally optimal solutions that approximate a globally optimal solution in a reasonable amount of time. In other words the Boltzman exploration term ε is based on the current value and a temperature parameter.
Applicant's arguments filed 06/01/2021 have been fully considered but they are not persuasive. 

Allowable Subject Matter
Claim 21 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Claim 22 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

s 1, 5-11, 15, 16, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Hosu et al. (hereinafter Hosu) “Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay” in view of Stadie et al. (hereinafter Stadie) “INCENTIVIZING EXPLORATION IN REINFORCEMENT LEARNING WITH DEEP PREDICTIVE MODELS”.
In reference to claim 1. Hosu teaches a computer program product including one or more computer readable storage mediums collectively storing program instructions that are executable by a computer (Hosu discloses an environment for deep reinforcement learning. Examiner notes that one of ordinary skill in the art would know to implement the deep reinforcement learning in a computer program product including a computer readable storage medium storing program instructions that are executable by a computer) to cause the computer to perform operations comprising:
“inputting a current time frame of an action and observation sequence sequentially into a neural network including a plurality of parameters, the action and observation sequence including a plurality of time frames, each time frame including action values and observation values” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”);
“approximating a value function using the neural network based on the current time frame to acquire a current value” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”. In at least § 4.1 the function approximator is a neural network);
“updating an action selection policy through exploration based on an ε-greedy strategy using the current value” (Hosu in at least § 3.1 “All current approaches using deep reinforcement learning fail to learn any successful control policies for Montezuma’s Revenge. This happens mostly due to the ε-greedy strategy failing to explore the game in a consistent and efficient manner”),
“training the neural network by updating the plurality of parameters” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

Hosu does not explicitly disclose:
“including calculating an exploration term ε based on the current value and a temperature parameter”; and
However, Stadie discloses:
“including calculating an exploration term ε based on the current value and a temperature parameter” (Stadie in at least § 5 “In Boltzman exploration, a positive probability is assigned to any possible action according to its expected utility and according to a temperature parameter [23]”. Examiner notes that a greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally optimal solutions that approximate a globally optimal solution in a reasonable amount of time. In other words the Boltzman exploration term ε is based on the current value and a temperature parameter); and
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu and Stadie. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. One of ordinary skill would have motivation to combine Hosu and Stadie because the results show that 

In reference to claim 5. Hosu and Stadie teach the computer program product of claim 1 (as mentioned above), wherein:
Hosu further discloses:
“the value function is an action-value function” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

In reference to claim 6. Hosu and Stadie teach the computer program product of claim 1 (as mentioned above), wherein the inputting of the current time frame includes:
Hosu further discloses:
“selecting an action according to the action selection policy with which to proceed from a current time frame of the action and observation sequence to a subsequent time frame of the action and observation sequence” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”),
“causing the selected action to be performed” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”), and
“obtaining an observation of the subsequent time frame of the action and observation sequence” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

In reference to claim 7. Hosu and Stadie teach the computer program product of claim 6 (as mentioned above), wherein:
Hosu does not explicitly disclose:
“the selecting an action includes selecting one of a random action and a greedy action, wherein the random action has a probability of being selected equal to the exploration term”.
However, Stadie discloses:
“the selecting an action includes selecting one of a random action and a greedy action, wherein the random action has a probability of being selected equal to the exploration term” (Stadie in at least § 4.1 “We compared two separate methodologies for capturing these images. 1. Static AE: A random agent plays for enough time to collect the required images. The auto-encoder is trained offline before the policy learning algorithm begins. 2. Dynamic AE: Initialize with an epsilon-greedy strategy and collect images and actions while the agent acts under the policy learning algorithm. After 5 epochs, train the auto encoder from this data. Continue to collect data and periodically retrain the auto encoder in parallel with the policy training algorithm”).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu and Stadie. Hosu teaches a novel method using deep 

In reference to claim 8. Hosu and Stadie teach the computer program product of claim 7 (as mentioned above), wherein:
Stadie further discloses:
“the selecting of the greedy action includes evaluating each reward probability of a plurality of possible actions according to a probability function based on the value function, and the selected greedy action among the plurality of possible actions yields the largest reward probability from the probability function” (Stadie in at least § 3, § 4.1, and § 6 disclose the selecting of the greedy action includes evaluating each reward probability of a plurality of possible actions according to a probability function based on the value function, and the selected greedy action among the plurality of possible actions yields the largest reward probability from the probability function).

In reference to claim 9. Hosu and Stadie teach the computer program product of claim 1 (as mentioned above), wherein:
Hosu further discloses:
“the approximating the value function includes acquiring the current value from an evaluation of the value function in consideration of an actual reward” (Hosu in at least § 4.1 extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

In reference to claim 10. Hosu and Stadie teach the computer program product of claim 1 (as mentioned above), wherein:
Hosu further discloses:
“the updating the plurality of parameters includes storing at least one experience in a replay memory, each experience including action values, observation values, and reward values of a previous time frame and observation values of a current time frame, and sampling at least one experience transition, each experience transition including two consecutive experiences” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

In reference to claim 11. Hosu teaches a method comprising:
“inputting a current time frame of an action and observation sequence sequentially into a neural network including a plurality of parameters, the action and observation sequence including a plurality of time frames, each time frame including action values and observation values” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”);
“approximating a value function using the neural network based on the current time frame to acquire a current value” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”. In at least § 4.1 the function approximator is a neural network);
“updating an action selection policy through exploration based on an ε-greedy strategy using the current value” (Hosu in at least § 3.1 “All current approaches using deep reinforcement learning fail to learn any successful control policies for Montezuma’s Revenge. This happens mostly due to the ε-greedy strategy failing to explore the game in a consistent and efficient manner”),
“training the neural network by updating the plurality of parameters” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

Hosu does not explicitly disclose:
“including calculating an exploration term ε based on the current value and a temperature parameter”; and
However, Stadie discloses:
“including calculating an exploration term ε based on the current value and a temperature parameter” (Stadie in at least § 5 “In Boltzman exploration, a positive probability is assigned to any possible action according to its expected utility and according to a temperature parameter [23]”. Examiner notes that a greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally optimal solutions that approximate a globally optimal solution in a reasonable amount of time. In other words the Boltzman exploration term ε is based on the current value and a temperature parameter); and
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu and Stadie. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. One of ordinary skill would have motivation to combine Hosu and Stadie because the results show that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy (Stadie § Introduction).

In reference to claim 15. Hosu and Stadie teach the method of claim 11 (as mentioned above), wherein:
Hosu further discloses:
“the value function is an action-value function” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

In reference to claim 16. Hosu teach an apparatus comprising:
“a hardware processor” (Hosu discloses an environment for deep reinforcement learning. Examiner notes that one of ordinary skill in the art would know to implement the deep reinforcement learning in a computer program product including a computer readable storage medium storing program instructions that are executable by a computer); and
“a memory that stores a computer program product, which, when executed by the hardware processor” (Hosu discloses an environment for deep reinforcement learning. Examiner notes that one of ordinary skill in the art would know to implement the deep reinforcement learning in a computer program product including a computer readable storage medium storing program instructions that are executable by a computer), causes the hardware processor to:
“input a current time frame of an action and observation sequence sequentially into a neural network including a plurality of parameters, the action and observation sequence extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”);
“approximate a value function using the neural network based on the current time frame to acquire a current value” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]
“update an action selection policy through exploration based on an ε-greedy strategy using the current value” (Hosu in at least § 3.1 “All current approaches using deep reinforcement learning fail to learn any successful control policies for Montezuma’s Revenge. This happens mostly due to the ε-greedy strategy failing to explore the game in a consistent and efficient manner”),
“train the neural network with an update of the plurality of parameters” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

Hosu does not explicitly disclose:
“including calculating an exploration term ε based on the current value and a temperature parameter”; and
However, Stadie discloses:
“including calculating an exploration term ε based on the current value and a temperature parameter” (Stadie in at least § 5 “In Boltzman exploration, a positive probability is assigned to any possible action according to its expected utility and according to a temperature parameter [23]”. Examiner notes that a greedy algorithm is any algorithm that follows the 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu and Stadie. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. One of ordinary skill would have motivation to combine Hosu and Stadie because the results show that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy (Stadie § Introduction).

In reference to claim 20. Hosu and Stadie teach the apparatus of claim 16 (as mentioned above), wherein:
Hosu further discloses:
“the value function is an action-value function” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is .

Claims 3, 13, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Hosu et al. (hereinafter Hosu) “Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay” in view of Stadie et al. (hereinafter Stadie) “INCENTIVIZING EXPLORATION IN REINFORCEMENT LEARNING WITH DEEP PREDICTIVE MODELS” in view of Carmel et al. (hereinafter Carmel) “Exploration Strategies for Model-based Learning in Multi-agent Systems”.
In reference to claim 3. Hosu and Stadie teach the computer program product of claim 1 (as mentioned above), wherein:
Hosu and Stadie do not explicitly disclose:
“the updating of the action selection policy further includes updating the temperature parameter using the exploration term and the current value”.
However, Carmel discloses:
“the updating of the action selection policy further includes updating the temperature parameter using the exploration term and the current value” (Carmel in at least § 1, § 3.2.1, and § 5.1 disclose updating of the action selection policy includes updating the temperature parameter using the exploration term and the current value).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu, Stadie, and Carmel. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. Carmel 

In reference to claim 13. Hosu and Stadie teach the method of claim 11 (as mentioned above), wherein:
Hosu and Stadie do not explicitly disclose:
“the updating of the action selection policy further includes updating the temperature parameter using the exploration term and the current value”.
However, Carmel discloses:
“the updating of the action selection policy further includes updating the temperature parameter using the exploration term and the current value” (Carmel in at least § 1, § 3.2.1, and § 5.1 disclose updating of the action selection policy includes updating the temperature parameter using the exploration term and the current value).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu, Stadie, and Carmel. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. Carmel teaches exploration strategies for a model-based learning agent to handle its encounters with other agents in a common environment. One of ordinary skill would have motivation to combine Hosu, Stadie, and Carmel because the agent must therefore make a tradeoff between the wish to exploit its current 

In reference to claim 18. Hosu and Stadie teach the apparatus of claim 16 (as mentioned above), wherein computer program product further causes the hardware processor to:
Hosu and Stadie do not explicitly disclose:
“update the temperature parameter using the exploration term and the current value”.
However, Carmel discloses:
“update the temperature parameter using the exploration term and the current value” (Carmel in at least § 1, § 3.2.1, and § 5.1 disclose updating of the action selection policy includes updating the temperature parameter using the exploration term and the current value).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu, Stadie, and Carmel. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. Carmel teaches exploration strategies for a model-based learning agent to handle its encounters with other agents in a common environment. One of ordinary skill would have motivation to combine Hosu, Stadie, and Carmel because the agent must therefore make a tradeoff between the wish to exploit its current knowledge, and the wish to explore other alternatives, to improve its knowledge for better decisions in the future (Carmel § Abstract).

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Viker A. Lamardo whose telephone number is (571)270-5871.  The examiner can normally be reached on Mon. - Fri. 9 AM - 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann J. Lo can be reached on (571)272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact 






/VIKER A LAMARDO/Primary Examiner, Art Unit 2126