DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
The present application is being examined under the claims filed on 11/22/2017.
Claims 1-20 are rejected.
Claims 1-20 are pending.

Drawings
The Drawings filed on 11/22/2017 are acceptable for examination purposes.

Specification
The Specification filed on 11/22/2017 is acceptable for examination purposes.

Computer Readable Storage Medium(s) Positive Statement
The computer readable storage medium(s) has been interpreted to exclude carrier waves, propagated signals or the like based on the reading of ¶ [0089] of the instant specification. Examiner interprets computer readable storage medium(s) to include only non-transitory embodiments of computer readable mediums. Therefore, claims 1-11 are not being rejected under 35 USC § 101 for signal(s) per se. If the applicant disagrees with Examiner's interpretation of the specification, the applicant should indicate this on the record.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim recites “approximating a value function using the function approximator based on the current time frame to acquire a current value”, “updating an action selection policy through exploration based on an ε-greedy strategy using the current value”, and “updating the plurality of parameters” which are directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper. This judicial exception is not integrated into a practical application because the claim is directed to an abstract idea with additional generic computer elements, the generically recited computer elements do not add a meaningful limitation to the abstract idea because they amount to simply implementing the abstract idea on a computer. The generic computer elements are the “computer program product including one or more computer readable storage mediums collectively storing program instructions that are executable by a computer”. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception because when considered separately and in combination, they do not add significantly more (also known as an “inventive concept”) to the exception. The claim recites additional limitations of “inputting a current time frame of an action and observation sequence sequentially into a function approximator including a plurality of parameters, the action and observation sequence including a plurality of time frames, each time frame including action values and observation values”. The additional limitations are directed to receiving or transmitting data over a network, these are well-understood, routine, conventional computer functions as recognized by the court decisions listed in MPEP § 2106.05(d).

Claim 3 recites an additional step of “the updating of the action selection policy further includes updating the temperature parameter using the exploration term and the current value”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.
Claim 4 recites an additional step of “the function approximator is a neural network”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.
Claim 5 recites an additional step of “the value function is an action-value function”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.
Claim 6 recites an additional step of “selecting an action according to the action selection policy with which to proceed from a current time frame of the action and observation sequence to a subsequent time frame of the action and observation sequence”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.
Claim 7 recites an additional step of “the selecting an action includes selecting one of a random action and a greedy action, wherein the random action has a probability of being selected equal to the exploration term”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.

Claim 9 recites an additional step of “the approximating the value function includes acquiring the current value from an evaluation of the value function in consideration of an actual reward”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.
Claim 10 recites an additional step of “the updating the plurality of parameters includes storing at least one experience in a replay memory, each experience including action values, observation values, and reward values of a previous time frame and observation values of a current time frame, and sampling at least one experience transition, each experience transition including two consecutive experiences”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.

Claim 11 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim recites “approximating a value function using the function approximator based on the current time frame to acquire a current value”, “updating an action selection policy through exploration based on an ε-greedy strategy using the current value”, and “updating the plurality of parameters” which are directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper. This judicial exception is not integrated into a practical application because the claim is directed to an abstract idea with additional generic computer elements, the generically 
Claim 12 recites an additional step of “the updating of the action selection policy includes calculating an exploration term based on the current value and a temperature parameter”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.
Claim 13 recites an additional step of “the updating of the action selection policy further includes updating the temperature parameter using the exploration term and the current value”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.
Claim 14 recites an additional step of “the function approximator is a neural network”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.
Claim 15 recites an additional step of “the value function is an action-value function”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.

Claim 16 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim recites “approximate a value function using the function approximator based on the current time frame to acquire a current value”, “update an action selection policy through exploration based on an ε-greedy strategy using the current value”, and “update the plurality of parameters” which are directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper. This judicial exception is not integrated into a practical application because the claim is directed to an abstract idea with additional generic computer elements, the generically recited computer elements do not add a meaningful limitation to the abstract idea because they amount to simply implementing the abstract idea on a computer. The generic computer elements are the “inputting section”, “approximating section”, “action selection policy updating section”, and “parameter updating section”. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception because when considered separately and in combination, they do not add significantly more (also known as an “inventive concept”) to the exception. The claim recites additional limitations of “input a current time frame of an action and observation sequence sequentially into a function approximator including a plurality of parameters, the action and observation sequence including a plurality of time frames, each time frame including action values and observation values”. The additional limitations are directed to receiving or transmitting data over a network, these are well-understood, routine, conventional computer functions as recognized by the court decisions listed in MPEP § 2106.05(d).
Claim 17 recites an additional step of “calculate an exploration term based on the current value and a temperature parameter”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper. The additional limitation of the calculating section is directed to an abstract idea with additional generic 
Claim 18 recites an additional step of “update the temperature parameter using the exploration term and the current value”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper. The additional limitation of the temperature updating section is directed to an abstract idea with additional generic computer elements, the generically recited computer elements do not add a meaningful limitation to the abstract idea because they amount to simply implementing the abstract idea on a computer.
Claim 19 recites an additional step of “the function approximator is a neural network”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.
Claim 20 recites an additional step of “the value function is an action-value function”. The additional step does not amount to significantly more because it is directed to an abstract idea of performing the step(s) mentally with the aid of a pen and paper.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) are:
“an inputting section configured to […]” in claim 16.
“an approximating section configured to […]” in claim 16.
“an action selection policy updating section configured to […]” in claim 16.
“a parameter updating section configured to […]” in claim 16.
“a calculating section configured to […]” in claim 17.
“a temperature updating section configured to […]” in claim 18.
Because these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 16-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim limitations “inputting section configured to […]”, “an approximating section configured to […]”, “an action selection policy updating section configured to […]”, “a parameter updating section configured to […]”, “a calculating section configured to […]”, and “a temperature updating section configured to […]” invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. The disclosure is devoid of any structure that performs the function in the claim. Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph.
Applicant may:

(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory 
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 4-6, 9-11, 14-16, 19, and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Hosu et al. (hereinafter Hosu) “Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay”.
In reference to claim 1. Hosu teaches a computer program product including one or more computer readable storage mediums collectively storing program instructions that are executable by a computer (Hosu discloses an environment for deep reinforcement learning. Examiner notes that one of ordinary skill in the art would know to implement the deep reinforcement learning in a computer program product including a computer readable storage medium storing program instructions that are executable by a computer) to cause the computer to perform operations comprising:
“inputting a current time frame of an action and observation sequence sequentially into a function approximator including a plurality of parameters, the action and observation sequence including a plurality of time frames, each time frame including action values and observation values” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”);
“approximating a value function using the function approximator based on the current time frame to acquire a current value” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”. In at least § 4.1 the function approximator is a neural network);
“updating an action selection policy through exploration based on an ε-greedy strategy using the current value” (Hosu in at least § 3.1 “All current approaches using deep reinforcement learning fail to learn any successful control policies for Montezuma’s Revenge. This happens mostly due to the ε-greedy strategy failing to explore the game in a consistent and efficient manner”); and
“updating the plurality of parameters” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

In reference to claim 4. Hosu teach the computer program product of claim 1 (as mentioned above), wherein:
Hosu further discloses:
“the function approximator is a neural network” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”. In at least § 4.1 the function approximator is a neural network).

In reference to claim 5. Hosu teach the computer program product of claim 1 (as mentioned above), wherein:
Hosu further discloses:
“the value function is an action-value function” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

In reference to claim 6. Hosu teach the computer program product of claim 1 (as mentioned above), wherein the inputting of the current time frame includes:
Hosu further discloses:
“selecting an action according to the action selection policy with which to proceed from a current time frame of the action and observation sequence to a subsequent time frame of the action and observation sequence” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”),
“causing the selected action to be performed” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”), and
“obtaining an observation of the subsequent time frame of the action and observation sequence” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to 

In reference to claim 9. Hosu teach the computer program product of claim 1 (as mentioned above), wherein:
Hosu further discloses:
“the approximating the value function includes acquiring the current value from an evaluation of the value function in consideration of an actual reward” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

In reference to claim 10. Hosu teach the computer program product of claim 1 (as mentioned above), wherein:
Hosu further discloses:
“the updating the plurality of parameters includes storing at least one experience in a replay memory, each experience including action values, observation values, and reward values of a previous time frame and observation values of a current time frame, and sampling at least extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

In reference to claim 11. Hosu teaches a method comprising:
“inputting a current time frame of an action and observation sequence sequentially into a function approximator including a plurality of parameters, the action and observation sequence including a plurality of time frames, each time frame including action values and observation values” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to 
“approximating a value function using the function approximator based on the current time frame to acquire a current value” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”. In at least § 4.1 the function approximator is a neural network);
“updating an action selection policy through exploration based on an ε-greedy strategy using the current value” (Hosu in at least § 3.1 “All current approaches using deep reinforcement learning fail to learn any successful control policies for Montezuma’s Revenge. This happens mostly due to the ε-greedy strategy failing to explore the game in a consistent and efficient manner”); and
“updating the plurality of parameters” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

In reference to claim 14. Hosu teach the method of claim 11 (as mentioned above), wherein:
Hosu further discloses:
“the function approximator is a neural network” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”. In at least § 4.1 the function approximator is a neural network).

In reference to claim 15. Hosu teach the method of claim 11 (as mentioned above), wherein:
Hosu further discloses:
“the value function is an action-value function” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

In reference to claim 16. Hosu teach an apparatus comprising:
“an inputting section configured to input a current time frame of an action and observation sequence sequentially into a function approximator including a plurality of parameters, the action and observation sequence including a plurality of time frames, each time frame including action values and observation values” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”);
“an approximating section configured to approximate a value function using the function approximator based on the current time frame to acquire a current value” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”. In at least § 4.1 the function approximator is a neural network);
“an action selection policy updating section configured to update an action selection policy through exploration based on an ε-greedy strategy using the current value” (Hosu in at least § 3.1 “All current approaches using deep reinforcement learning fail to learn any successful control policies for Montezuma’s Revenge. This happens mostly due to the ε-greedy strategy failing to explore the game in a consistent and efficient manner”); and
“a parameter updating section configured to update the plurality of parameters” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

In reference to claim 19. Hosu teach the apparatus of claim 16 (as mentioned above), wherein:
Hosu further discloses:
“the function approximator is a neural network” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural network that extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”. In at least § 4.1 the function approximator is a neural network).

In reference to claim 20. Hosu teach the apparatus of claim 16 (as mentioned above), wherein:
Hosu further discloses:
“the value function is an action-value function” (Hosu in at least § 4.1 “Deep reinforcement learning [15, 16] was the first method able to learn successful control policies directly from high-dimensional visual input on the Atari domain. It consists of a convolutional neural extracts features from the game frames and approximates the following action-value function […] The computed value represents the sum of rewards                                 
                                    
                                        
                                            r
                                        
                                        
                                            t
                                        
                                    
                                
                             discounted by  at each time step                                 
                                    t
                                
                            , using a policy  for the observation                                 
                                    s
                                
                             and action                                 
                                    a
                                
                            . To solve the instability issue that reinforcement learning presents when a neural network is used to approximate the state-value function, experience replay [14] is used, as well as a target network [16]. In order to train the network, Q-learning updates are applied on minibatches of experience, drawn at random from the replay memory”).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 2, 7, 8, 12, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Hosu et al. (hereinafter Hosu) “Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay” in view of Stadie et al. (hereinafter Stadie) “INCENTIVIZING EXPLORATION IN REINFORCEMENT LEARNING WITH DEEP PREDICTIVE MODELS”.
In reference to claim 2. Hosu teach the computer program product of claim 1 (as mentioned above), wherein:

“the updating of the action selection policy includes calculating an exploration term based on the current value and a temperature parameter”.
However, Stadie discloses:
“the updating of the action selection policy includes calculating an exploration term based on the current value and a temperature parameter” (Stadie in at least § 5 “In Boltzman exploration, a positive probability is assigned to any possible action according to its expected utility and according to a temperature parameter [23]”).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu and Stadie. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. One of ordinary skill would have motivation to combine Hosu and Stadie because the results show that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy (Stadie § Introduction).

In reference to claim 7. Hosu teach the computer program product of claim 6 (as mentioned above), wherein:
Hosu does not explicitly disclose:
“the selecting an action includes selecting one of a random action and a greedy action, wherein the random action has a probability of being selected equal to the exploration term”.
However, Stadie discloses:
“the selecting an action includes selecting one of a random action and a greedy action, wherein the random action has a probability of being selected equal to the exploration term” (Stadie in at least § 4.1 “We compared two separate methodologies for capturing these images. 1. Static AE: A random agent plays for enough time to collect the required images. The auto-encoder  is trained offline before the policy learning algorithm begins. 2. Dynamic AE: Initialize with an epsilon-greedy strategy and collect images and actions while the agent acts under the policy learning algorithm. After 5 epochs, train the auto encoder from this data. Continue to collect data and periodically retrain the auto encoder in parallel with the policy training algorithm”).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu and Stadie. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. One of ordinary skill would have motivation to combine Hosu and Stadie because the results show that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy (Stadie § Introduction).

In reference to claim 8. Hosu and Stadie teach the computer program product of claim 7 (as mentioned above), wherein:
Stadie further discloses:
“the selecting of the greedy action includes evaluating each reward probability of a plurality of possible actions according to a probability function based on the value function, and the selected greedy action among the plurality of possible actions yields the largest reward 

In reference to claim 12. Hosu teach the method of claim 11 (as mentioned above), wherein:
Hosu does not explicitly disclose:
“the updating of the action selection policy includes calculating an exploration term based on the current value and a temperature parameter”.
However, Stadie discloses:
“the updating of the action selection policy includes calculating an exploration term based on the current value and a temperature parameter” (Stadie in at least § 5 “In Boltzman exploration, a positive probability is assigned to any possible action according to its expected utility and according to a temperature parameter [23]”).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu and Stadie. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. One of ordinary skill would have motivation to combine Hosu and Stadie because the results show that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy (Stadie § Introduction).

In reference to claim 17. Hosu teach the apparatus of claim 16 (as mentioned above), wherein:
Hosu does not explicitly disclose:
“the action selection policy updating section includes a calculating section configured to calculate an exploration term based on the current value and a temperature parameter”.
However, Stadie discloses:
“the action selection policy updating section includes a calculating section configured to calculate an exploration term based on the current value and a temperature parameter” (Stadie in at least § 5 “In Boltzman exploration, a positive probability is assigned to any possible action according to its expected utility and according to a temperature parameter [23]”).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu and Stadie. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. One of ordinary skill would have motivation to combine Hosu and Stadie because the results show that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy (Stadie § Introduction).

Claims 3, 13, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Hosu et al. (hereinafter Hosu) “Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay” in view of Stadie et al. (hereinafter Stadie) “INCENTIVIZING EXPLORATION IN REINFORCEMENT LEARNING WITH DEEP PREDICTIVE MODELS” in view of Carmel et al. (hereinafter Carmel) “Exploration Strategies for Model-based Learning in Multi-agent Systems”.
In reference to claim 3. Hosu and Stadie teach the computer program product of claim 2 (as mentioned above), wherein:
Hosu and Stadie do not explicitly disclose:
“the updating of the action selection policy further includes updating the temperature parameter using the exploration term and the current value”.
However, Carmel discloses:
“the updating of the action selection policy further includes updating the temperature parameter using the exploration term and the current value” (Carmel in at least § 1, § 3.2.1, and § 5.1 disclose updating of the action selection policy includes updating the temperature parameter using the exploration term and the current value).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu, Stadie, and Carmel. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. Carmel teaches exploration strategies for a model-based learning agent to handle its encounters with other agents in a common environment. One of ordinary skill would have motivation to combine Hosu, Stadie, and Carmel because the agent must therefore make a tradeoff between the wish to exploit its current knowledge, and the wish to explore other alternatives, to improve its knowledge for better decisions in the future (Carmel § Abstract).

In reference to claim 13. Hosu and Stadie teach the method of claim 12 (as mentioned above), wherein:
Hosu and Stadie do not explicitly disclose:
“the updating of the action selection policy further includes updating the temperature parameter using the exploration term and the current value”.
However, Carmel discloses:
“the updating of the action selection policy further includes updating the temperature parameter using the exploration term and the current value” (Carmel in at least § 1, § 3.2.1, and § 5.1 disclose updating of the action selection policy includes updating the temperature parameter using the exploration term and the current value).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu, Stadie, and Carmel. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. Carmel teaches exploration strategies for a model-based learning agent to handle its encounters with other agents in a common environment. One of ordinary skill would have motivation to combine Hosu, Stadie, and Carmel because the agent must therefore make a tradeoff between the wish to exploit its current knowledge, and the wish to explore other alternatives, to improve its knowledge for better decisions in the future (Carmel § Abstract).

In reference to claim 18. Hosu and Stadie teach the apparatus of claim 17 (as mentioned above), wherein:
Hosu and Stadie do not explicitly disclose:
“the action selection policy updating section includes a temperature updating section configured to update the temperature parameter using the exploration term and the current value”.

“the action selection policy updating section includes a temperature updating section configured to update the temperature parameter using the exploration term and the current value” (Carmel in at least § 1, § 3.2.1, and § 5.1 disclose updating of the action selection policy includes updating the temperature parameter using the exploration term and the current value).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Hosu, Stadie, and Carmel. Hosu teaches a novel method using deep reinforcement learning, called human checkpoint replay, which was designed for some of the most difficult Atari 2600 games from the Arcade Learning Environment. Stadie teaches that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilon-greedy strategy. Carmel teaches exploration strategies for a model-based learning agent to handle its encounters with other agents in a common environment. One of ordinary skill would have motivation to combine Hosu, Stadie, and Carmel because the agent must therefore make a tradeoff between the wish to exploit its current knowledge, and the wish to explore other alternatives, to improve its knowledge for better decisions in the future (Carmel § Abstract).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Viker A. Lamardo whose telephone number is (571)270-5871.  The examiner can normally be reached on Mon. - Fri. 9 AM - 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/VIKER A LAMARDO/Examiner, Art Unit 2126