DETAILED ACTION
This action is in response to the claims filed 08/23/2022 for application 16/303,256. Claims 1, 4, 10, 11, 15, 19, and 20 have been amended, claims 2-3 have been canceled, and claim 21 is new. Thus, claims 1 and 4-21 are currently pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1 and 4-21 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim 1, 
Step 1 Analysis: Claim 1 is directed to a process, which falls within one of the four statutory categories. 
Step 2A Prong 1 Analysis: Claim 1 recites, in part, maintaining return data that maps each of a plurality of observation-action pairs to a respective return, receiving a current observation characterizing a current state of the environment, determining whether the current observation matches any of the observations identified in the return data; and in response to determining…, determining a feature representation of the current observation, selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data; determining observation-action pairs in the return data that include the action and any one of the one or more selected observations, and determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs in the return data; and selecting the action to be performed by the agent in response to the current observation using the estimated returns. The limitations of maintaining return data that maps each of a plurality of observation-action pairs to a respective return, receiving a current observation characterizing a current state of the environment, determining whether the current observation matches any of the observations identified in the return data; and in response to determining…, determining a feature representation of the current observation, selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data; determining observation-action pairs in the return data that include the action and any one of the one or more selected observations, and determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs in the return data; and selecting the action to be performed by the agent in response to the current observation using the estimated returns, as drafted, are processes that, under broadest reasonable interpretation, covers the performance of the limitation in the mind which falls within the “Mental Processes” grouping of abstract ideas. The limitations of:
maintaining return data that maps each of a plurality of observation-action pairs to a respective return can be considered to be an observation in the human mind, 
receiving a current observation characterizing a current state of the environment can be considered an observation in the human mind, 
determining whether the current observation matches any of the observations identified in the return data can be considered to be an evaluation in the human mind; and 
and in response to determining… 
determining a feature representation of the current observation can be considered to be an evaluation in the human mind, 
selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data can be considered to be an evaluation in the human mind; 
determining observation-action pairs in the return data that include the action and any one of the one or more selected observations can be considered to be an evaluation in the human mind, and 
determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs in the return data can be considered to be an evaluation in the human mind; and 
selecting the action to be performed by the agent in response to the current observation using the estimated returns can be considered to be an evaluation in the human mind.
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2 Analysis: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional element – “agent”. This element that is recited is only generally linked to the judicial exception. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.
The claim further recites: controlling the agent based on the selected action. This limitation is an insignificant extra-solution activity. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim as a whole is directed to an abstract idea. 
Step 2B Analysis: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of utilizing an agent to perform the steps of the claimed process amount to no more than generally linking the elements to the judicial exception. Additionally, the limitation of controlling the agent based on the selected action is well-understood, routine, and conventional, as evidenced by Arel et al. (“US 9536191 B1”, Background, col 1, lines 7-21). These limitations therefore remain insignificant extra-solution activity even upon reconsideration, and does not amount to significantly more. Even when considered in combination, these additional elements amount to generally linking the elements to the judicial exception and insignificant extra-solution activity, which cannot provide an inventive concept. The claim is not patent eligible.
Regarding claim 4, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data comprises: determining the k observations identified in the return data that have feature representations that are closest to the feature representation of the current observation, wherein k is an integer greater than one; and wherein determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs comprises: determining a respective estimated return for each of a plurality of actions in the predetermined set of actions from returns mapped to by observation-action pairs in the return data that include the action and any one of the k observations. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exceptions into a practical application, nor to significantly more than the judicial exceptions. The claim is not patent eligible.

Regarding claim 5, the rejection of claim 4 is further incorporated, and further, the claim recites: wherein determining a respective estimated return comprises, for each of the plurality of actions: determining an average of the returns mapped to by observation-action pairs in the return data that include the action and any one of the k observations. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exceptions into a practical application, nor to significantly more than the judicial exceptions. The claim is not patent eligible.

Regarding claim 6, the rejection of claim 4 is further incorporated, and further, the claim recites: wherein selecting the action to be performed by the agent comprises: selecting an action from the plurality of actions that has the highest estimated return. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exceptions into a practical application, nor to significantly more than the judicial exceptions. The claim is not patent eligible.

Regarding claim 7, the rejection of claim 4 is further incorporated, and further, the claim recites: wherein selecting the action to be performed by the agent comprises: selecting an action from the plurality of actions that has the highest estimated return with probability 1 - ᵋ; and selecting an action randomly from the predetermined set of actions with probability ᵋ. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exceptions into a practical application, nor to significantly more than the judicial exceptions. The claim is not patent eligible.

Regarding claim 8, the rejection of claim 4 is further incorporated, and further, the claim recites: wherein determining the k observations identified in the return data that have feature representations that are closest to the feature representation of the current observation comprises: determining the k observations identified in the return data that have feature representations that have a smallest Euclidian distance to the feature representation of the current observation. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exceptions into a practical application, nor to significantly more than the judicial exceptions. The claim is not patent eligible.

Regarding claim 9, the rejection of claim 4 is further incorporated, and further, the claim recites: wherein the feature representation of the current observation is the current observation. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 10, the rejection of claim 4 is further incorporated, and further, the claim recites: wherein determining the feature representation of the current observation comprises: projecting the current observation into a space having a projected dimension smaller than the dimensionality of the current observation. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 11, the rejection of claim 4 is further incorporated, and further, the claim recites: wherein projecting the current observation into the space comprises applying a random projection matrix to the current observation. This claim recites additional mathematical steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible. 

Regarding claim 12, the rejection of claim 4 is further incorporated, and further, the claim recites: wherein determining the feature representation of the current observation comprises: processing the current observation to generate a latent representation of the current observation; and using the latent representation of the current observation as the feature representation of the current observation. This claim recites additional mathematical steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does recite the additional element of “variational auto-encoder model”, however it does not amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception, for the reasons set forth in connection with the rejection of claim 1 above. The claim is not patent eligible.

Regarding claim 13, the rejection of claim 1 is further incorporated, and further, the claim recites: receiving a new return resulting from the agent performing the selected action in response to the current observation; and updating the return data using the new return. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 14, the rejection of claim 13 is further incorporated, and further, the claim recites: when the current observation matches a first observation identified in the return data, updating the return data using the new return comprises: determining whether the new return is larger than an existing return resulting from performing the selected action in response to the first observation according to the return data; and when the new return is larger than the existing return, replacing the existing return with the new return in the return data. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exceptions into a practical application, nor to significantly more than the judicial exceptions. The claim is not patent eligible.

Regarding claim 15, the rejection of claim 13 is further incorporated, and further, the claim recites: wherein, when the current observation does not match any of the observations identified in the return data, updating the return data using the new return comprises: updating the return data to map a current observation-selected action pair to the new return. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exceptions into a practical application, nor to significantly more than the judicial exceptions. The claim is not patent eligible.

Regarding claim 16, the rejection of claim 1 is further incorporated, and further, the claim recites: determining that a number of mappings in the return data has reached a maximum size and, in response, removing a least recently updated mapping from the return data. This claim recites additional mental steps in addition to the judicial exception identified in the rejection of claim 1, thus recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exceptions into a practical application, nor to significantly more than the judicial exceptions. The claim is not patent eligible.

Regarding claim 17, the rejection of claim 1 is further incorporated, and further, the claim recites: initializing the return data with initial mappings by randomly selecting actions to be performed by the agent until each action in the predetermined set of actions has been performed more than a threshold number of times. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 18, the rejection of claim 1 is further incorporated, and further, the claim recites: wherein the returns are discounted sums of rewards received by the agent in response to performing actions. This limitation amounts to more specifics of the judicial exception identified in the rejection of claim 1 above.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 19, 
Step 1 Analysis: Claim 19 is directed to a process, which falls within one of the four statutory categories. 
Step 2A Prong 1 Analysis: Claim 19 recites, in part, maintaining return data that maps each of a plurality of observation-action pairs to a respective return, receiving a current observation characterizing a current state of the environment, determining whether the current observation matches any of the observations identified in the return data; and in response to determining…, determining a feature representation of the current observation, selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data; determining observation-action pairs in the return data that include the action and any one of the one or more selected observations, and determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs in the return data; and selecting the action to be performed by the agent in response to the current observation using the estimated returns. The limitations of maintaining return data that maps each of a plurality of observation-action pairs to a respective return, receiving a current observation characterizing a current state of the environment, determining whether the current observation matches any of the observations identified in the return data; and in response to determining…, determining a feature representation of the current observation, selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data; determining observation-action pairs in the return data that include the action and any one of the one or more selected observations, and determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs in the return data; and selecting the action to be performed by the agent in response to the current observation using the estimated returns, as drafted, are processes that, under broadest reasonable interpretation, covers the performance of the limitation in the mind which falls within the “Mental Processes” grouping of abstract ideas. The limitations of:
maintaining return data that maps each of a plurality of observation-action pairs to a respective return can be considered to be an observation in the human mind, 
receiving a current observation characterizing a current state of the environment can be considered an observation in the human mind, 
determining whether the current observation matches any of the observations identified in the return data can be considered to be an evaluation in the human mind; and 
and in response to determining… 
determining a feature representation of the current observation can be considered to be an evaluation in the human mind, 
selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data can be considered to be an evaluation in the human mind; 
determining observation-action pairs in the return data that include the action and any one of the one or more selected observations can be considered to be an evaluation in the human mind, and 
determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs in the return data can be considered to be an evaluation in the human mind; and 
selecting the action to be performed by the agent in response to the current observation using the estimated returns can be considered to be an evaluation in the human mind.
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2 Analysis: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional element – “agent”. This element that is recited is only generally linked to the judicial exception. Additionally, the claim recites the – “one or more computers” and “one or more storage devices”. Thus, the elements in the claim are recited at a high level of generality (i.e. as a generic processor performing a generic computer function of generating an index) such that it amounts to no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim further recites: controlling the agent based on the selected action. This limitation is an insignificant extra-solution activity. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim as a whole is directed to an abstract idea. 
Step 2B Analysis: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of utilizing an agent to perform the steps of the claimed process amount to no more than generally linking the elements to the judicial exception. Additionally, the one or more computers and one or more storage devices amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Additionally, the limitation of controlling the agent based on the selected action is well-understood, routine, and conventional, as evidenced by Arel et al. (“US 9536191 B1”, Background, col 1, lines 7-21). These limitations therefore remain insignificant extra-solution activity even upon reconsideration, and does not amount to significantly more. Even when considered in combination, these additional elements amount to generally linking the elements to the judicial exception, mere instructions to apply an exception using a generic computer component and insignificant extra-solution activity, which cannot provide an inventive concept. The claim is not patent eligible.  

Regarding claim 20, 
Step 1 Analysis: Claim 20 is directed to a process, which falls within one of the four statutory categories. 
Step 2A Prong 1 Analysis: Claim 20 recites, in part, maintaining return data that maps each of a plurality of observation-action pairs to a respective return, receiving a current observation characterizing a current state of the environment, determining whether the current observation matches any of the observations identified in the return data; and in response to determining…, determining a feature representation of the current observation, selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data; determining observation-action pairs in the return data that include the action and any one of the one or more selected observations, and determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs in the return data; and selecting the action to be performed by the agent in response to the current observation using the estimated returns. The limitations of maintaining return data that maps each of a plurality of observation-action pairs to a respective return, receiving a current observation characterizing a current state of the environment, determining whether the current observation matches any of the observations identified in the return data; and in response to determining…, determining a feature representation of the current observation, selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data; determining observation-action pairs in the return data that include the action and any one of the one or more selected observations, and determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs in the return data; and selecting the action to be performed by the agent in response to the current observation using the estimated returns, as drafted, are processes that, under broadest reasonable interpretation, covers the performance of the limitation in the mind which falls within the “Mental Processes” grouping of abstract ideas. The limitations of:
maintaining return data that maps each of a plurality of observation-action pairs to a respective return can be considered to be an observation in the human mind, 
receiving a current observation characterizing a current state of the environment can be considered an observation in the human mind, 
determining whether the current observation matches any of the observations identified in the return data can be considered to be an evaluation in the human mind; and 
and in response to determining… 
determining a feature representation of the current observation can be considered to be an evaluation in the human mind, 
selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data can be considered to be an evaluation in the human mind; 
determining observation-action pairs in the return data that include the action and any one of the one or more selected observations can be considered to be an evaluation in the human mind, and 
determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs in the return data can be considered to be an evaluation in the human mind; and 
selecting the action to be performed by the agent in response to the current observation using the estimated returns can be considered to be an evaluation in the human mind.
Accordingly, the claim recites an abstract idea.

Step 2A Prong 2 Analysis: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional element – “agent”. This element that is recited is only generally linked to the judicial exception. Additionally, the claim recites the – “one or more computers” and “non-transitory computer storage medium”. Thus, the elements in the claim are recited at a high level of generality (i.e. as a generic processor performing a generic computer function of generating an index) such that it amounts to no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim further recites: controlling the agent based on the selected action. This limitation is an insignificant extra-solution activity. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim as a whole is directed to an abstract idea. 
Step 2B Analysis: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of utilizing an agent to perform the steps of the claimed process amount to no more than generally linking the elements to the judicial exception. Additionally, the one or more computers and non-transitory computer storage medium amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Additionally, the limitation of controlling the agent based on the selected action is well-understood, routine, and conventional, as evidenced by Arel et al. (“US 9536191 B1”, Background, col 1, lines 7-21). These limitations therefore remain insignificant extra-solution activity even upon reconsideration, and does not amount to significantly more. Even when considered in combination, these additional elements amount to generally linking the elements to the judicial exception, mere instructions to apply an exception using a generic computer component and insignificant extra-solution activity, which cannot provide an inventive concept. The claim is not patent eligible.


Regarding claim 21, 
Step 1 Analysis: Claim 21 is directed to a process, which falls within one of the four statutory categories. 
Step 2A Prong 1 Analysis: Claim 21 recites, in part, maintaining return data that maps each of a plurality of observation-action pairs to a respective return, receiving a current observation characterizing a current state of the environment, determining whether the current observation matches any of the observations identified in the return data; and in response to determining…, determining a feature representation of the current observation, determining k observations in the return data…, determining a respective return for each of a plurality of actions…, and selecting the action to be performed by the agent. The limitations of maintaining return data that maps each of a plurality of observation-action pairs to a respective return, receiving a current observation characterizing a current state of the environment, determining whether the current observation matches any of the observations identified in the return data; and in response to determining…, determining a feature representation of the current observation, determining k observations in the return data…, determining a respective return for each of a plurality of actions…, and selecting the action to be performed by the agent, as drafted, are processes that, under broadest reasonable interpretation, covers the performance of the limitation in the mind which falls within the “Mental Processes” grouping of abstract ideas. The limitations of:
maintaining return data that maps each of a plurality of observation-action pairs to a respective return can be considered to be an observation in the human mind, 
receiving a current observation characterizing a current state of the environment can be considered an observation in the human mind, 
determining whether the current observation matches any of the observations identified in the return data can be considered to be an evaluation in the human mind; and 
and in response to determining… 
determining a feature representation of the current observation can be considered to be an evaluation in the human mind, 
determining k observations in the return data… can be considered to be an evaluation in the human mind, 
determining a respective return for each of a plurality of actions… can be considered to be an evaluation in the human mind, and 
selecting the action to be performed by the agent can be considered to be an evaluation in the human mind
Accordingly, the claim recites an abstract idea.
Step 2A Prong 2 Analysis: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional element – “agent”. This element that is recited is only generally linked to the judicial exception. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.
The claim further recites: controlling the agent based on the selected action. This limitation is an insignificant extra-solution activity. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim as a whole is directed to an abstract idea. 
Step 2B Analysis: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of utilizing an agent to perform the steps of the claimed process amount to no more than generally linking the elements to the judicial exception. Additionally, the limitation of controlling the agent based on the selected action is well-understood, routine, and conventional, as evidenced by Arel et al. (“US 9536191 B1”, Background, col 1, lines 7-21). These limitations therefore remain insignificant extra-solution activity even upon reconsideration, and does not amount to significantly more. Even when considered in combination, these additional elements amount to generally linking the elements to the judicial exception and insignificant extra-solution activity, which cannot provide an inventive concept. The claim is not patent eligible.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 4-10, 13-21 are rejected under 35 U.S.C. 103 as being unpatentable over Arel et al. (US 9536191 B1, hereinafter "Arel") in view of Taylor et al. ("Metric Learning for Reinforcement Learning Agents", hereinafter "Taylor") and further in view of Kingma et al. ("Semi-supervised Learning with Deep Generative Models" cited by Applicant in the IDS filed 12/10/2018, hereinafter "Kingma1").

Regarding claim 1, Arel teaches A method for selecting an action from a predetermined set of actions to be performed by an agent interacting with an environment (“In a reinforcement learning system, an agent interacts with an environment by receiving an observation that either fully or partially characterizes the current state of the environment, and in response, performing an action selected from a predetermined set of actions.” [col 1, lines 8-12]), the method comprising: 
maintaining return data that maps each of a plurality observation-action pairs to a respective return (“As described above, the value function estimate for a given state-action pair is an estimate of the return resulting from the agent performing the given action in response to an observation characterizing the given state.” [col 7, lines 9-14; See further: “To determine the value function estimate for a given action in implementations where the value function representation is a tabular representation, the system identifies the value function estimate that is mapped to by the combination of the state representation and the given action in the tabular representation.” [col 7, lines 20-26]]), 
wherein the action in each observation-action pair is an action that was performed by the agent in response to the observation in the observation-action pair (“In general, one innovative aspect of the subject matter described in this specification can be embodied in methods for selecting an action to be performed by an agent that interacts with an environment by performing actions selected from a set of actions.” [col 1, lines 44-48]), and 
wherein the respective return mapped to by each of the observation-action pairs is a return that resulted from the agent performing the action in the observation-action pair in response to the observation in the observation-action pair (“determining, in accordance with a value function representation and from the action and a current state representation for the current state derived from the current observation, a respective value function estimate that is an estimate of a return resulting from the agent performing the action in response to the current observation, wherein the return is a function of future rewards received in response to the agent performing actions to interact with the environment” [col 1, lines 51-59]);
receiving a current observation characterizing a current state of the environment (“The methods include the actions of receiving a current observation, the current observation being data that characterizes a current state of the environment” [col 1, lines 48-51]); 
selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data (“In particular, in these implementations, the recurrent neural network is configured to receive an observation and to combine the observation with the current internal state of the recurrent neural network to generate the state representation and to process the state representation and an action to generate the value function estimate and to update the internal state of the recurrent neural network. In yet other implementations, the reinforcement learning system 100 combines the current observation with one or more recent observations to generate the state representation. For example, the state representation can be a stack of the observation and a number of most recent observations in the order in which they were received by the reinforcement learning system 100 or a compressed representation of the observation and the most recent observations.” [col 5, line 56 – col 6 line 4; Arel’s system appears to inherently select observations from a tabular representation/table (see col 6, lines 5-9)]);
for each action of a plurality of actions in the predetermined set of actions (“In a reinforcement learning system, an agent interacts with an environment by receiving an observation that either fully or partially characterizes the current state of the environment, and in response, performing an action selected from a predetermined set of actions.” [col 1, lines 7-12]), 
determining observation-action pairs in the return data that include the action and any one of the one or more selected observations (“The system determines a respective confidence score for each action when the environment is in the current state (step 206). The confidence score for a given state-action pair is a measure of confidence that the value function estimate for the action is an accurate estimate of the return that will result from the agent performing the given action in response to an observation characterizing the given state.” [col 7, lines 27-34]), and 
determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs in the return data (“The system determines a respective value function estimate for each action in the set of actions (step 204) when the environment is in the current state in accordance with the value function representation. As described above, the value function estimate for a given state-action pair is an estimate of the return resulting from the agent performing the given action in response to an observation characterizing the given state.” [col 7, lines 6-14]);
selecting the action to be performed by the agent in response to the current observation using the estimated returns (“selecting an action to be performed by the agent in response to the current observation using the respective adjusted value function estimates.” [col 2, lines 2-4])
controlling the agent based on the selected action (“In some other implementations, the environment is a real-world environment. For example, the agent may be a robot attempting to complete a specified task and the environment may be the surroundings of the robot as characterized by data captured by one or more sensory input devices of the robot” [col 3, line 66 – col 4, line 4])
However Arel fails to explicitly teach determining whether the current observation matches any of the observations identified in the return data
in response to determining that the current observation does not match any of the observations identified in the return data,
Taylor teaches determining whether the current observation matches any of the observations identified in the return data and in response to determining that the current observation does not match any of the observations identified in the return data (“Algorithm 1 reasons about pairs of vectors, where these vectors describe transitions in the state space: s → s′ . Algorithm 2 calculates the similarity of two vectors, given the current distance metric, where the relatedness of two vectors is at most 1.0 (if they are identical in direction and magnitude). This similarity will be used in the next section to calculate the distance metric under the assumption that states that have similar transitions (for the same action) should be closer in the state space than states that have dissimilar transitions.” [pg. 779, § 3.2 Transition Similarity, ¶1; Taylor measures similarity/dissimilar between states (i.e. observation, therefore would be able to determine a “match” or “does not match” between 2 observations.])
Arel and Taylor are both in the same field of endeavor of reinforcement learning and thus are analogous. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s reinforcement algorithm by matching previous and current observations as taught by Taylor. One would have been motivated to find the smallest distance between states in order to determine effective state representations. [pg. 777, § 1. Introduction, ¶5, Taylor]
However Arel/Taylor fails to explicitly teach determining a feature representation of the current observation
Kingma1 teaches determining a feature representation of the current observation (“
    PNG
    media_image1.png
    43
    536
    media_image1.png
    Greyscale
” [pg. 2, § 2. Deep Generative Models for Semi-supervised Learning, ¶1]); 
Arel, Taylor, and Kingma1 are all in the same field of endeavor of machine learning. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. Kingma1 discloses semi-supervised learning with deep generative models. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s/Taylor’s teachings with semi-supervised learning and clustering methods as taught by Kingma1. One would have been motivated to make this modification in order obtain more accurate predictions. [pg. 1, § Introduction, ¶1, Kingma1]

Regarding claim 4, Arel/Taylor/Kingma1 teaches The method of claim 1, where Arel further teaches wherein selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data comprises (“In particular, in these implementations, the recurrent neural network is configured to receive an observation and to combine the observation with the current internal state of the recurrent neural network to generate the state representation and to process the state representation and an action to generate the value function estimate and to update the internal state of the recurrent neural network. In yet other implementations, the reinforcement learning system 100 combines the current observation with one or more recent observations to generate the state representation. For example, the state representation can be a stack of the observation and a number of most recent observations in the order in which they were received by the reinforcement learning system 100 or a compressed representation of the observation and the most recent observations.” [col 5, line 56 – col 6 line 4; Arel’s system appears to inherently select observations from a tabular representation/table (see col 6, lines 5-9)])
wherein determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs comprises: 
determining a respective estimated return for each of a plurality of actions in the predetermined set of actions from returns mapped to by observation-action pairs in the return data that include the action and any one of the k observations (“for each action in the set of actions: determining, in accordance with a value function representation and from the action and a current state representation for the current state derived from the current observation, a respective value function estimate that is an estimate of a return resulting from the agent performing the action in response to the current observation” [col 1, lines 51-57])
Kingma1 teaches determining the k observations identified in the return data that have feature representations that are closest to the feature representation of the current observation, wherein k is an integer greater than one (“A commonly used approach is to construct a model that provides an embedding or feature representation of the data. Using these features, a separate classifier is thereafter trained. The embeddings allow for a clustering of related observations in a latent feature space that allows for accurate classification, even with a limited number of labels. Instead of a linear embedding, or features obtained from a regular auto-encoder, we construct a deep generative model of the data that is able to provide a more robust set of latent features.” [pg. 2, § 2. Deep Generative Models for Semi-supervised Learning, ¶2; clustering of related observations would be equivalent to finding feature representations “closest” to the feature representation of the current observation.]); 
Arel, Taylor, and Kingma1 are all in the same field of endeavor of machine learning. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. Kingma1 discloses semi-supervised learning with deep generative models. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s/Taylor’s teachings with semi-supervised learning and clustering methods as taught by Kingma1. One would have been motivated to make this modification in order obtain more accurate predictions. [pg. 1, § Introduction, ¶1, Kingma1]

Regarding claim 5, Arel/Taylor/Kingma1 teaches The method of claim 4, where Arel teaches wherein determining a respective estimated return comprises, for each of the plurality of actions: determining an average of the returns mapped to by observation-action pairs in the return data that include the action and any one of the k observations (“While the agent is interacting with the environment, the reinforcement learning system selects actions to be performed by the agent in order to maximize the expected return. Generally, the expected return is a function of the rewards anticipated to be received over time in response to future actions performed by the agent. That is, the return is a function of future rewards received starting from the immediate reward received in response to the agent performing the selected action. For example, possible definitions of return that the reinforcement learning system attempts to maximize may include a sum of the future rewards, a discounted sum of the future rewards, or an average of the future rewards.” [col 4, lines 37-49]).

	Regarding claim 6, Arel/Taylor/Kingma1 teaches The method of claim 4, where Arel teaches wherein selecting the action to be performed by the agent comprises: selecting an action from the plurality of actions that has the highest estimated return (“For example, in implementations where the action selection policy is a greedy policy, the system selects the action having the highest adjusted value function estimate as the action to be performed by the agent.” [col 8, lines 8-12]).

	Regarding claim 7, Arel/Taylor/Kingma1 teaches The method of claim 4, wherein selecting the action to be performed by the agent comprises: selecting an action from the plurality of actions that has the highest estimated return with probability 1 - ε; and selecting an action randomly from the predetermined set of actions with probability ε (“As another example, in implementations where the action selection policy is an ε-greedy policy, the system selects an action randomly from the set of actions with probability ε and selects the action having the highest adjusted value function estimate with probability 1−ε, where ε is a constant between zero and one.” [col 8, lines 13-18]).
	Regarding claim 8, Arel/Taylor/Kingma1 teaches The method of claim 4, wherein determining the k observations identified in the return data that have feature representations that are closest to the feature representation of the current observation comprises: 
Taylor teaches determining the k observations identified in the return data that have feature representations that have a smallest Euclidian distance to the feature representation of the current observation (“Examples include constraints of the form “points x and y should have a small/large distance” or “points v and w should have a smaller distance than points v and x.” Metric learning algorithms typically attempt to construct a transformation of the data (either linear or non-linear) such that the constraints are satisfied after applying a standard distance function such as the Euclidean distance to the transformed data.” [pg. 778, § 2.3 Distance Metric Learning, ¶1; See further: “This similarity will be used in the next section to calculate the distance metric under the assumption that states that have similar transitions (for the same action) should be closer in the state space than states that have dissimilar transitions.” [pg. 779, § 3.2 Transition Similarity, ¶1]])
Arel, Taylor, and Kingma1 are all in the same field of endeavor of machine learning. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. Kingma1 discloses semi-supervised learning with deep generative models. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s//Kingma1’s teachings with the distance metric learning method as taught by Taylor. One would have been motivated to find the smallest distance between states in order to determine effective state representations. [pg. 777, § 1. Introduction, ¶5, Taylor]

	Regarding claim 9, Arel/Taylor/Kingma1 teaches The method of claim 4, where Kingma1 teaches wherein the feature representation of the current observation is the current observation (“
    PNG
    media_image1.png
    43
    536
    media_image1.png
    Greyscale
…A commonly used approach is to construct a model that provides an embedding or feature representation of the data. Using these features, a separate classifier is thereafter trained. The embeddings allow for a clustering of related observations in a latent feature space that allows for accurate classification, even with a limited number of labels.” [pg. 2, § 2. Deep Generative Models for Semi-supervised Learning, ¶1-2; feature representation of the current observation being the current observation is inherent.]).
Arel, Taylor, and Kingma1 are all in the same field of endeavor of machine learning. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. Kingma1 discloses semi-supervised learning with deep generative models. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s/Taylor’s teachings with semi-supervised learning and clustering methods as taught by Kingma1. One would have been motivated to make this modification in order obtain more accurate predictions. [pg. 1, § Introduction, ¶1, Kingma1]

Regarding claim 10, Arel/Taylor/Kingma1 teaches The method of claim 4, where Kingma1 teaches wherein determining the feature representation of the current observation comprises: projecting the current observation into a space having a projected dimension smaller than the dimensionality of the current observation (“Using this approach, we can now perform classification in a lower dimensional space since we typically use latent variables whose dimensionality is much less than that of the observations. These low dimensional embeddings should now also be more easily separable since we make use of independent latent Gaussian posteriors whose parameters are formed by a sequence of non-linear transformations of the data. This simple approach results in improved performance for SVMs, and we demonstrate this in section 4.” [pg. 3, top para; note: Examiner is interpreting lower dimensional space to be equivalent to a smaller dimensional space.]).
Arel, Taylor, and Kingma1 are all in the same field of endeavor of machine learning. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. Kingma1 discloses semi-supervised learning with deep generative models. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s/Taylor’s teachings with semi-supervised learning and clustering methods as taught by Kingma1. One would have been motivated to make this modification in order obtain more accurate predictions. [pg. 1, § Introduction, ¶1, Kingma1]

Regarding claim 13, Arel/Taylor/Kingma1 teaches The method of claim 1, further comprising: 
Arel teaches receiving a new return resulting from the agent performing the selected action in response to the current observation (“a current value function estimate that is an estimate of a return that will result from the agent performing the current action in response to the current observation” [col 2, lines 43-45]); and 
updating the return data using the new return (“Once the agent 110 has performed the selected action 104, the reinforcement learning system 100 identifies a reward 106 resulting from the agent 110 performing the selected action 104. The reward 106 is an immediate actual reward resulting from the agent 110 performing the selected action 104 in response to the observation 102. The reinforcement learning system 100 uses the reward 106 and the confidence function representation 140 to update the value function representation 130. The reinforcement learning system 100 then updates the confidence function representation 140 to reflect the change in the measure of confidence in the value function estimates resulting from the agent 110 having performed the selected action 104 in response to the observation 102. Updating the value function representation and the confidence function representation is described in more detail below with reference to FIGS. 3 and 4.” [col 6, 46-61]).

Regarding claim 14, Arel/Taylor/Kingma1 teaches The method of claim 13, wherein, when the current observation matches a first observation identified in the return data, updating the return data using the new return comprises: 
Arel teaches determining whether the new return is larger than an existing return resulting from performing the selected action in response to the first observation according to the return data (“The degree to which the previous confidence score is increased depends on the current confidence score, i.e., so that the previous confidence score is increased to a greater degree when the measure of confidence that the current value function estimate is an accurate estimate of the return that will result from the agent performing the current action in response to the current observation is higher” [col 11, lines 37-43]); and 
when the new return is larger than the existing return, replacing the existing return with the new return in the return data (“The system adjusts the confidence function representation (step 416). Generally, the system adjusts the representation to increase a previous confidence score, i.e., the confidence score that is a measure of confidence that the previous value function estimate is an accurate estimate of the return resulting from the agent performing the previous action in response to the previous observation” [col 11, 28-34]).

Regarding claim 15, Arel/Taylor/Kingma1 teaches The method of claim 13, Taylor teaches wherein, when the current observation does not match any of the observations identified in the return data (See pg. 779, §3.2), 
Arel teaches updating the return data using the new return comprises:
updating the return data to map a current observation - selected action pair to the new return (“In implementations where the value function representation is a tabular representation, the system can add the value function update to the value function estimate that is mapped to by the combination of the previous state representation and the previous action in the tabular representation to generate an adjusted value function estimate.” [col 11, lines 16-21]).
Arel and Taylor are both in the same field of endeavor of reinforcement learning and thus are analogous. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s reinforcement algorithm by matching previous and current observations as taught by Taylor. One would have been motivated to find the smallest distance between states in order to determine effective state representations. [pg. 777, § 1. Introduction, ¶5, Taylor]

Regarding claim 16, Arel/Taylor/Kingma1 teaches The method of claim 1, further comprising: 
Taylor teaches determining that a number of mappings in the return data has reached a maximum size (“As the state space grows, using a table becomes impractical, or impossible if the state space is continuous. Agents in such tasks typically factor the state using state variables (or features), so that s = <x1, x2, . . . ,xni. In such cases, RL methods use function approximators, such as artificial neural networks or tile coding, where parameterized functions representing π or Q are tuned via supervised learning methods. The parameterization and bias of the function approximator define the state space abstraction, allowing observed data to update a region of state-action values rather than a single state/action value.” [pg. 778, left col, 2])
Arel teaches and, in response, removing a least recently updated mapping from the return data (“Generally, which actions are in the set of actions are fixed prior to any given action selection performed by the reinforcement learning system. Thus, in response to any given observation, the system selects the action to be performed by the agent in response to the observation from a predetermined set of actions. In some cases, however, which actions are in the set of actions may be adjusted before the system processes a given observation, e.g., to add a new action to the set or to remove an existing action from the set.” [col 4, lines 23-31]).
Arel, Taylor, and Kingma1 are all in the same field of endeavor of machine learning. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. Kingma1 discloses semi-supervised learning with deep generative models. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s//Kingma1’s teachings with the distance metric learning method as taught by Taylor. One would have been motivated to find the smallest distance between states in order to determine effective state representations. [pg. 777, § 1. Introduction, ¶5, Taylor]

Regarding claim 17, Arel/Taylor/Kingma1 teaches The method of claim 1, further comprising: 
Taylor teaches initializing the return data with initial mappings by randomly selecting actions to be performed by the agent until each action in the predetermined set of actions has been performed more than a threshold number of times (“Tasks are often episodic: the agent executes actions in the environment until it reaches a terminal or goal state, at which point the agent is returned to a starting state. The set A describes the actions available to the agent, although not every action may be possible in every state.” [pg. 778, left col, top para; note: Taylor also discloses actions can be randomly selected as evidenced by “In order for HOLLER to learn a distance metric, it must have data recorded from the task. To record this data, we allowed the agent to explore the task (with a fully random policy) for different numbers of episodes. The more episodes used for learning the metric, the more likely it will be accurate. However, the episodes spent collecting data will count against the agent’s performance (as discussed further in Section 4.3). After trying 6 different values, we decided to experiment with 1, 5, and 10 episodes of data for HOLLER, affecting Algorithm 1, lines 33 and 34.” [pg. 780, § 4.2 Experimental Procedure, 3]]).

Regarding claim 18, Arel/Taylor/Kingma1 teaches The method of claim 1, where Arel teaches wherein the returns are discounted sums of rewards received by the agent in response to performing actions (“That is, the return is a function of future rewards received starting from the immediate reward received in response to the agent performing the selected action. For example, possible definitions of return that the reinforcement learning system attempts to maximize may include a sum of the future rewards, a discounted sum of the future rewards, or an average of the future rewards.” [col 4, lines 43-49]).

Regarding claim 19, Arel teaches A system comprising one or more computers and one or more storage devices storing instructions, when executed by the one or more computers (“FIG. 1 shows an example reinforcement learning system 100. The reinforcement learning system 100 is an example of a system implemented as one or more computer programs on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented.” [col 5, lines 4-9]), cause the one or more computers to perform operations for selecting an action from a predetermined set of actions to be performed by an agent interacting with an environment (“In a reinforcement learning system, an agent interacts with an environment by receiving an observation that either fully or partially characterizes the current state of the environment, and in response, performing an action selected from a predetermined set of actions.” [col 1, lines 8-12]), the operations comprising: 
maintaining return data that maps each of a plurality observation-action pairs to a respective return (“As described above, the value function estimate for a given state-action pair is an estimate of the return resulting from the agent performing the given action in response to an observation characterizing the given state.” [col 7, lines 9-14; See further: “To determine the value function estimate for a given action in implementations where the value function representation is a tabular representation, the system identifies the value function estimate that is mapped to by the combination of the state representation and the given action in the tabular representation.” [col 7, lines 20-26]]), 
wherein the action in each observation-action pair is an action that was performed by the agent in response to the observation in the observation-action pair (“In general, one innovative aspect of the subject matter described in this specification can be embodied in methods for selecting an action to be performed by an agent that interacts with an environment by performing actions selected from a set of actions.” [col 1, lines 44-48]), and 
wherein the respective return mapped to by each of the observation-action pairs is a return that resulted from the agent performing the action in the observation-action pair in response to the observation in the observation-action pair (“determining, in accordance with a value function representation and from the action and a current state representation for the current state derived from the current observation, a respective value function estimate that is an estimate of a return resulting from the agent performing the action in response to the current observation, wherein the return is a function of future rewards received in response to the agent performing actions to interact with the environment” [col 1, lines 51-59]);
receiving a current observation characterizing a current state of the environment (“The methods include the actions of receiving a current observation, the current observation being data that characterizes a current state of the environment” [col 1, lines 48-51]); 
selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data (“In particular, in these implementations, the recurrent neural network is configured to receive an observation and to combine the observation with the current internal state of the recurrent neural network to generate the state representation and to process the state representation and an action to generate the value function estimate and to update the internal state of the recurrent neural network. In yet other implementations, the reinforcement learning system 100 combines the current observation with one or more recent observations to generate the state representation. For example, the state representation can be a stack of the observation and a number of most recent observations in the order in which they were received by the reinforcement learning system 100 or a compressed representation of the observation and the most recent observations.” [col 5, line 56 – col 6 line 4; Arel’s system appears to inherently select observations from a tabular representation/table (see col 6, lines 5-9)]);
for each action of a plurality of actions in the predetermined set of actions (“In a reinforcement learning system, an agent interacts with an environment by receiving an observation that either fully or partially characterizes the current state of the environment, and in response, performing an action selected from a predetermined set of actions.” [col 1, lines 7-12]), 
determining observation-action pairs in the return data that include the action and any one of the one or more selected observations (“The system determines a respective confidence score for each action when the environment is in the current state (step 206). The confidence score for a given state-action pair is a measure of confidence that the value function estimate for the action is an accurate estimate of the return that will result from the agent performing the given action in response to an observation characterizing the given state.” [col 7, lines 27-34]), and 
determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs in the return data (“The system determines a respective value function estimate for each action in the set of actions (step 204) when the environment is in the current state in accordance with the value function representation. As described above, the value function estimate for a given state-action pair is an estimate of the return resulting from the agent performing the given action in response to an observation characterizing the given state.” [col 7, lines 6-14]);
selecting the action to be performed by the agent in response to the current observation using the estimated returns (“selecting an action to be performed by the agent in response to the current observation using the respective adjusted value function estimates.” [col 2, lines 2-4])
controlling the agent based on the selected action (“In some other implementations, the environment is a real-world environment. For example, the agent may be a robot attempting to complete a specified task and the environment may be the surroundings of the robot as characterized by data captured by one or more sensory input devices of the robot” [col 3, line 66 – col 4, line 4])
However Arel fails to explicitly teach determining whether the current observation matches any of the observations identified in the return data
in response to determining that the current observation does not match any of the observations identified in the return data,
Taylor teaches determining whether the current observation matches any of the observations identified in the return data and in response to determining that the current observation does not match any of the observations identified in the return data (“Algorithm 1 reasons about pairs of vectors, where these vectors describe transitions in the state space: s → s′ . Algorithm 2 calculates the similarity of two vectors, given the current distance metric, where the relatedness of two vectors is at most 1.0 (if they are identical in direction and magnitude). This similarity will be used in the next section to calculate the distance metric under the assumption that states that have similar transitions (for the same action) should be closer in the state space than states that have dissimilar transitions.” [pg. 779, § 3.2 Transition Similarity, ¶1; Taylor measures similarity/dissimilar between states (i.e. observation, therefore would be able to determine a “match” or “does not match” between 2 observations.])
Arel and Taylor are both in the same field of endeavor of reinforcement learning and thus are analogous. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s reinforcement algorithm by matching previous and current observations as taught by Taylor. One would have been motivated to find the smallest distance between states in order to determine effective state representations. [pg. 777, § 1. Introduction, ¶5, Taylor]
However Arel/Taylor fails to explicitly teach determining a feature representation of the current observation
Kingma1 teaches determining a feature representation of the current observation (“
    PNG
    media_image1.png
    43
    536
    media_image1.png
    Greyscale
” [pg. 2, § 2. Deep Generative Models for Semi-supervised Learning, ¶1]); 
Arel, Taylor, and Kingma1 are all in the same field of endeavor of machine learning. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. Kingma1 discloses semi-supervised learning with deep generative models. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s/Taylor’s teachings with semi-supervised learning and clustering methods as taught by Kingma1. One would have been motivated to make this modification in order obtain more accurate predictions. [pg. 1, § Introduction, ¶1, Kingma1]

Regarding claim 20, Arel teaches A computer storage medium encoded with instructions that (See col 13, lines 48-52), when executed by one or more computers, cause the one or more computers to perform operations for selecting an action from a predetermined set of actions to be performed by an agent interacting with an environment (“In a reinforcement learning system, an agent interacts with an environment by receiving an observation that either fully or partially characterizes the current state of the environment, and in response, performing an action selected from a predetermined set of actions.” [col 1, lines 8-12]), the operations comprising:
maintaining return data that maps each of a plurality observation-action pairs to a respective return (“As described above, the value function estimate for a given state-action pair is an estimate of the return resulting from the agent performing the given action in response to an observation characterizing the given state.” [col 7, lines 9-14; See further: “To determine the value function estimate for a given action in implementations where the value function representation is a tabular representation, the system identifies the value function estimate that is mapped to by the combination of the state representation and the given action in the tabular representation.” [col 7, lines 20-26]]), 
wherein the action in each observation-action pair is an action that was performed by the agent in response to the observation in the observation-action pair (“In general, one innovative aspect of the subject matter described in this specification can be embodied in methods for selecting an action to be performed by an agent that interacts with an environment by performing actions selected from a set of actions.” [col 1, lines 44-48]), and 
wherein the respective return mapped to by each of the observation-action pairs is a return that resulted from the agent performing the action in the observation-action pair in response to the observation in the observation-action pair (“determining, in accordance with a value function representation and from the action and a current state representation for the current state derived from the current observation, a respective value function estimate that is an estimate of a return resulting from the agent performing the action in response to the current observation, wherein the return is a function of future rewards received in response to the agent performing actions to interact with the environment” [col 1, lines 51-59]);
receiving a current observation characterizing a current state of the environment (“The methods include the actions of receiving a current observation, the current observation being data that characterizes a current state of the environment” [col 1, lines 48-51]); 
selecting one or more observations identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data (“In particular, in these implementations, the recurrent neural network is configured to receive an observation and to combine the observation with the current internal state of the recurrent neural network to generate the state representation and to process the state representation and an action to generate the value function estimate and to update the internal state of the recurrent neural network. In yet other implementations, the reinforcement learning system 100 combines the current observation with one or more recent observations to generate the state representation. For example, the state representation can be a stack of the observation and a number of most recent observations in the order in which they were received by the reinforcement learning system 100 or a compressed representation of the observation and the most recent observations.” [col 5, line 56 – col 6 line 4; Arel’s system appears to inherently select observations from a tabular representation/table (see col 6, lines 5-9)]);
for each action of a plurality of actions in the predetermined set of actions (“In a reinforcement learning system, an agent interacts with an environment by receiving an observation that either fully or partially characterizes the current state of the environment, and in response, performing an action selected from a predetermined set of actions.” [col 1, lines 7-12]), 
determining observation-action pairs in the return data that include the action and any one of the one or more selected observations (“The system determines a respective confidence score for each action when the environment is in the current state (step 206). The confidence score for a given state-action pair is a measure of confidence that the value function estimate for the action is an accurate estimate of the return that will result from the agent performing the given action in response to an observation characterizing the given state.” [col 7, lines 27-34]), and 
determining a respective estimated return for each action based on returns mapped to by the determined observation-action pairs in the return data (“The system determines a respective value function estimate for each action in the set of actions (step 204) when the environment is in the current state in accordance with the value function representation. As described above, the value function estimate for a given state-action pair is an estimate of the return resulting from the agent performing the given action in response to an observation characterizing the given state.” [col 7, lines 6-14]);
selecting the action to be performed by the agent in response to the current observation using the estimated returns (“selecting an action to be performed by the agent in response to the current observation using the respective adjusted value function estimates.” [col 2, lines 2-4])
controlling the agent based on the selected action (“In some other implementations, the environment is a real-world environment. For example, the agent may be a robot attempting to complete a specified task and the environment may be the surroundings of the robot as characterized by data captured by one or more sensory input devices of the robot” [col 3, line 66 – col 4, line 4])
However Arel fails to explicitly teach determining whether the current observation matches any of the observations identified in the return data
in response to determining that the current observation does not match any of the observations identified in the return data,
Taylor teaches determining whether the current observation matches any of the observations identified in the return data and in response to determining that the current observation does not match any of the observations identified in the return data (“Algorithm 1 reasons about pairs of vectors, where these vectors describe transitions in the state space: s → s′ . Algorithm 2 calculates the similarity of two vectors, given the current distance metric, where the relatedness of two vectors is at most 1.0 (if they are identical in direction and magnitude). This similarity will be used in the next section to calculate the distance metric under the assumption that states that have similar transitions (for the same action) should be closer in the state space than states that have dissimilar transitions.” [pg. 779, § 3.2 Transition Similarity, ¶1; Taylor measures similarity/dissimilar between states (i.e. observation, therefore would be able to determine a “match” or “does not match” between 2 observations.])
Arel and Taylor are both in the same field of endeavor of reinforcement learning and thus are analogous. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s reinforcement algorithm by matching previous and current observations as taught by Taylor. One would have been motivated to find the smallest distance between states in order to determine effective state representations. [pg. 777, § 1. Introduction, ¶5, Taylor]
However Arel/Taylor fails to explicitly teach determining a feature representation of the current observation
Kingma1 teaches determining a feature representation of the current observation (“
    PNG
    media_image1.png
    43
    536
    media_image1.png
    Greyscale
” [pg. 2, § 2. Deep Generative Models for Semi-supervised Learning, ¶1]); 
Arel, Taylor, and Kingma1 are all in the same field of endeavor of machine learning. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. Kingma1 discloses semi-supervised learning with deep generative models. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s/Taylor’s teachings with semi-supervised learning and clustering methods as taught by Kingma1. One would have been motivated to make this modification in order obtain more accurate predictions. [pg. 1, § Introduction, ¶1, Kingma1]

Regarding claim 21, Arel teaches A method for selecting an action from a predetermined set of actions to be performed by an agent interacting with an environment, the method comprising: 
maintaining return data that maps each of a plurality observation-action pairs to a respective return (“As described above, the value function estimate for a given state-action pair is an estimate of the return resulting from the agent performing the given action in response to an observation characterizing the given state.” [col 7, lines 9-14; See further: “To determine the value function estimate for a given action in implementations where the value function representation is a tabular representation, the system identifies the value function estimate that is mapped to by the combination of the state representation and the given action in the tabular representation.” [col 7, lines 20-26]]), 
wherein the action in each observation-action pair is an action that was performed by the agent in response to the observation in the observation-action pair (“In general, one innovative aspect of the subject matter described in this specification can be embodied in methods for selecting an action to be performed by an agent that interacts with an environment by performing actions selected from a set of actions.” [col 1, lines 44-48]), and
wherein the respective return mapped to by each of the observation-action pairs is a return that resulted from the agent performing the action in the observation-action pair in response to the observation in the observation-action pair (“determining, in accordance with a value function representation and from the action and a current state representation for the current state derived from the current observation, a respective value function estimate that is an estimate of a return resulting from the agent performing the action in response to the current observation, wherein the return is a function of future rewards received in response to the agent performing actions to interact with the environment” [col 1, lines 51-59]);
receiving a current observation characterizing a current state of the environment (“The methods include the actions of receiving a current observation, the current observation being data that characterizes a current state of the environment” [col 1, lines 48-51]); 
determining a respective estimated return for each of a plurality of actions in the predetermined set of actions from returns mapped to by observation-action pairs in the return data that include the action and any one of the k observations (“for each action in the set of actions: determining, in accordance with a value function representation and from the action and a current state representation for the current state derived from the current observation, a respective value function estimate that is an estimate of a return resulting from the agent performing the action in response to the current observation” [col 1, lines 51-57]); and 
selecting the action to be performed by the agent in response to the current observation using the estimated returns (“selecting an action to be performed by the agent in response to the current observation using the respective adjusted value function estimates.” [col 2, lines 2-4]); and 
controlling the agent based on the selected action (“In some other implementations, the environment is a real-world environment. For example, the agent may be a robot attempting to complete a specified task and the environment may be the surroundings of the robot as characterized by data captured by one or more sensory input devices of the robot” [col 3, line 66 – col 4, line 4]).
However Arel fails to explicitly teach determining whether the current observation matches any of the observations identified in the return data
in response to determining that the current observation does not match any of the observations identified in the return data,
Taylor teaches determining whether the current observation matches any of the observations identified in the return data and in response to determining that the current observation does not match any of the observations identified in the return data (“Algorithm 1 reasons about pairs of vectors, where these vectors describe transitions in the state space: s → s′ . Algorithm 2 calculates the similarity of two vectors, given the current distance metric, where the relatedness of two vectors is at most 1.0 (if they are identical in direction and magnitude). This similarity will be used in the next section to calculate the distance metric under the assumption that states that have similar transitions (for the same action) should be closer in the state space than states that have dissimilar transitions.” [pg. 779, § 3.2 Transition Similarity, ¶1; Taylor measures similarity/dissimilar between states (i.e. observation, therefore would be able to determine a “match” or “does not match” between 2 observations.])
Arel and Taylor are both in the same field of endeavor of reinforcement learning and thus are analogous. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s reinforcement algorithm by matching previous and current observations as taught by Taylor. One would have been motivated to find the smallest distance between states in order to determine effective state representations. [pg. 777, § 1. Introduction, ¶5, Taylor]
However Arel/Taylor fails to explicitly teach determining a feature representation of the current observation
determining k observations identified in the return data that have feature representations that are closest to the feature representation of the current observation, wherein k is an integer greater than one;
Kingma1 teaches determining a feature representation of the current observation (“
    PNG
    media_image1.png
    43
    536
    media_image1.png
    Greyscale
” [pg. 2, § 2. Deep Generative Models for Semi-supervised Learning, ¶1]); 
determining k observations identified in the return data that have feature representations that are closest to the feature representation of the current observation, wherein k is an integer greater than one (“A commonly used approach is to construct a model that provides an embedding or feature representation of the data. Using these features, a separate classifier is thereafter trained. The embeddings allow for a clustering of related observations in a latent feature space that allows for accurate classification, even with a limited number of labels. Instead of a linear embedding, or features obtained from a regular auto-encoder, we construct a deep generative model of the data that is able to provide a more robust set of latent features.” [pg. 2, § 2. Deep Generative Models for Semi-supervised Learning, ¶2; clustering of related observations would be equivalent to finding feature representations “closest” to the feature representation of the current observation.]); 
Arel, Taylor, and Kingma1 are all in the same field of endeavor of machine learning. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. Kingma1 discloses semi-supervised learning with deep generative models. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s/Taylor’s teachings with semi-supervised learning and clustering methods as taught by Kingma1. One would have been motivated to make this modification in order obtain more accurate predictions. [pg. 1, § Introduction, ¶1, Kingma1]

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Arel in view of Taylor and Kingma1 and further in view of Wookey et al. ("Regularized feature selection in reinforcement learning", hereinafter "Wookey").

Regarding claim 11, Arel/Taylor/Kingma1 teaches The method of claim 10, however fails to explicitly teach wherein projecting the current observation into the smaller-dimensional space comprises applying a random projection matrix to the current observation.
Wookey teaches wherein projecting the current observation into the space comprises applying a random projection matrix to the current observation (“LSTD-RP deals with large feature sets differently. Instead of selecting features so that the set used in the fitting step is computationally manageable, LSTD-RP projects the entire set of features using a random projection matrix onto a low dimensional space. The low dimensional space results in a significant reduction in the size of the matrices used in LSTD, making it possible to compute value functions despite the curse of dimensionality.” [pg. 664, § 3.3 ROMP-TD and LSTD-RP, ¶3]).
Arel, Taylor, Kingma1, and Wookey are all in the same field of endeavor of machine learning. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. Kingma1 discloses semi-supervised learning with deep generative models. Wookey discloses regularized feature election in reinforcement learning. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s/Taylor’s/Kingma1’s teachings by applying a random projection matrix to an observation as taught by Wookey. One would have been motivated to make this modification in order to improve performance and reduce overfitting. [pg. 670, § 5 Future work and conclusions, Wookey]

Claims 12 is rejected under 35 U.S.C. 103 as being unpatentable over Arel in view of Taylor and Kingma1 and further in view of Kingma et al. ("Auto-Encoding Variational Bayes", cited by Applicant in the IDS filed 12/10/2018, hereinafter "Kingma2").

Regarding claim 12, Arel/Taylor/Kingma1 teaches The method of claim 4, where Kingma1 teaches wherein determining the feature representation of the current observation comprises: Page: 6of9 
processing the current observation using a model to generate a latent representation of the current observation (“
    PNG
    media_image1.png
    43
    536
    media_image1.png
    Greyscale
” [pg. 2, § 2 Deep Generative Models for Semi-supervised Learning, ¶1]); and 
using the latent representation of the current observation as the feature representation of the current observation (“We can combine these two approaches by first learning a new latent representation z1 using the generative model from M1, and subsequently learning a generative semi-supervised model M2, using embeddings from z1 instead of the raw data x” [pg. 3, § Stacked generative semi-supervised model (M1+M2)]).
However Arel/Taylor/Kingma1 fails to explicitly teach using a variational auto-encoder model
Kingma2 teaches using a variational auto-encoder model (“For the case of an i.i.d. dataset and continuous latent variables per datapoint, we propose the AutoEncoding VB (AEVB) algorithm. In the AEVB algorithm we make inference and learning especially efficient by using the SGVB estimator to optimize a recognition model that allows us to perform very efficient approximate posterior inference using simple ancestral sampling, which in turn allows us to efficiently learn the model parameters, without the need of expensive iterative inference schemes (such as MCMC) per datapoint. The learned approximate posterior inference model can also be used for a host of tasks such as recognition, denoising, representation and visualization purposes. When a neural network is used for the recognition model, we arrive at the variational auto-encoder.” [pg. 1, § 1 Introduction, ¶2])
Arel, Taylor, Kingma1, and Kingma2 are all in the same field of endeavor of machine learning. Arel discloses reinforcement learning using confidence scores. Taylor discloses distance metric learning for reinforcement learning agents. Kingma1 discloses semi-supervised learning with deep generative models. Kingma2 discloses using a variational auto-encoder. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Arel’s/Taylor’s/Kingma1’s teachings by using the variational auto-encoder taught by Kingma2 to generate and process the latent representations. One would have been motivated to make this modification in order to make inference and learning more efficient. [pg. 1, § 1 Introduction, ¶2, Kingma2]

Response to Arguments
Applicant's arguments filed 08/23/2022 have been fully considered but they are not persuasive. 

Regarding the 35 U.S.C. §101 Rejection:
Applicant appears to argue that the claimed subject matter of claim 1 reflect a practical implementation of the abstract idea as they reflect an improvement in the functioning of computational technology or technical field. Examiner respectfully disagrees. As noted above, claim 1 currently recites steps such as “maintaining…, determining…, and selecting..., all of which under BRI can be considered to be mental steps. The claims seem to be directed towards to an improvement of an abstract idea. Improvements to an abstract idea are still considered to be abstract ideas. The computer elements in the claims are merely used as tool to perform the abstract idea. The claims do not recite any details of the training/learning process that would amount to an integration of the abstract idea into a practical application. Further details of the training process and/or specific actions that the agent is performing would be helpful in overcoming the rejection. Therefore, applicant’s arguments are not persuasive. 

Regarding the 35 U.S.C. §103 Rejection:
Applicant appears to argue the prior art of Kingma1 fails to explicitly teach “selecting one or above observation identified in the return data based on the feature representation of the current observation and feature representations of the observations identified in the return data” however this limitation was not previously recited in independent claim 1 or dependent claim 4. As noted in the updated 103 rejection above, the examiner is relying upon the primary reference of Arel to teach this newly amended limitation. Please see the updated 103 rejection above.

Applicant appears to argue that the examiner completely disregards the detailed claim language of claim 4. Examiner respectfully disagrees with this assertion. There appears to be a copy + paste error on pgs. 34-35 on the OA mailed 04/06/2022 where the examiner mistakenly notes: 

Arel/Taylor fails to explicitly teach: “determining a respective estimated return for each of a plurality of actions in the predetermined set of actions from returns mapped to by observation-action pairs in the return data that include the action and any one of the k observations” and “selecting the action to be performed by the agent in response to the current observation using the estimated returns”
However, the prior art of Arel teaches these limitations.
Examiner would like to direct the applicant to pg. 34 of the OA where the examiner is relying on the prior art of Arel and sufficient mapping from the prior art to teach those particular limitations. The remaining limitations of the original claim was taught by the prior art of Kingma1. Therefore, all claim limitations were considered and were provided with sufficient mapping for all of the limitations in the claim. 

Applicant’s arguments with respect to the rejections of the dependent claims have been fully considered but they are not persuasive as they rely upon the allowability of the independent claims.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL H HOANG whose telephone number is (571)272-8491. The examiner can normally be reached Mon-Fri 8:30AM-4:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/M.H.H./Examiner, Art Unit 2122                                                                                                                                                                                                        

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122