DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 11/30/2021 has been entered.
Status of Claims
The following claims are pending in this office action: 1-20
The following claims are amended: 1, 7-8, and 15
The following claims are new: None
The following claims are cancelled: None
The following claims are rejected: 1-20
Response to Arguments
Applicant’s arguments filed on 11/30/2021 to address the U.S.C. 112(f) interpretation have been considered. In response to the Applicant’s arguments, the U.S.C. 112(f) interpretation still stands as the claim includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitations uses a generic placeholder that is coupled with functional language.
Applicant’s arguments with respect to the prior art rejections of claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) are:
“a state encoder for encoding…”
“a previous state estimator (PSE) for estimating…”
“an accuracy evaluator for determining…”
“a reward modifier for adjusting…”
in claim 1
Because these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have these limitations interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitations recites sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 8-9, and 15-16 are rejected under 35 U.S.C. 102(a)(1) as being unpatentable  over JP Patent No. JP4811997B2 to Morimoto, et al. (hereinafter, “Morimoto”), in view of “Recall Traces: Backtracking Models for Efficient Reinforcement Learning” to Goyal, et al. (hereinafter, “Goyal”) further in view of “RLBS: An Adaptive Backtracking Strategy Based on Reinforcement Learning for Combinatorial Optimization” to Bachiri, et al. (hereinafter, “Bachiri”)
As per claim 1, Morimoto teaches a reinforcement learning system comprising:
a state encoder for encoding a spatial and temporal representation of an observed state of an environment; (Morimoto, Para. [0025] discloses “In the state estimation method of the present invention configured as described above, the observation target 1 is operated under control based on a control command from the control device 2, and an observation result obtained by observing the state of the observation target 1 is input to the enhancement learning state estimation device 3.” (Observing the state of the observation target encompasses encoding both a spatial and temporal representation))
an accuracy evaluator for determining a difference between [[the estimated previous state and the previous state]]; (Morimoto, Para. [0008] discloses “An enhanced learning module for calculating a feedback value based on a measure of state estimation in the enhanced learning module using an estimated observation result and a difference in observation and a difference in observed observation results and observed results…”)
and a reward modifier for adjusting a size of a reward based on the difference between [[the estimated previous state and the previous state]] (Morimoto, Para. [0008] discloses “…and a means for calculating a reward value based on a difference between the estimated observation result and the observed result and a feedback value; An updating means for updating a measure of the reinforcing learning module using the calculated reward value” (Updating measure of reinforcing learning module encompasses adjusting reward))
Morimoto fails to explicitly teach:
a previous state estimator (PSE) for generating an estimated previous state that estimates a previous state of a given state
the estimated previous state and the previous state
However, Goyal (Goyal addresses backtracking in reinforcement learning) teaches:
a previous state estimator (PSE) for generating an estimated previous state that estimates a previous state of a given state (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.”)
the estimated previous state and the previous state (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.” And Section 2.2 discloses “The agent’s full experience is maintained in a replay buffer B, in the form of tuples of (st, at, st+1, rt)” (Replay buffer stores all the states))
wherein the inaccurate estimation by the previous state estimator (PSE), indicating an efficient state representation, [[generates a higher reward]] than an accurate an accurate estimation, indicating an inefficient state representation by the previous state estimator (PSE) (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.” And Section 2.2 discloses “The agent’s full experience is maintained in a replay buffer B, in the form of tuples of (st, at, st+1, rt)” (Replay buffer stores all the states))
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the reinforcement learning system as disclosed by Morimoto to use the backtracking model for generating previous states as disclosed by Goyal. The combination would have been obvious because a person of ordinary skill in the art would be motivated to use “a backtracking model, which can easily be integrated with existing on- and off policy techniques for reducing sample complexity, i.e. for faster learning” (Goyal, section 2.1)
Morimoto fails to explicitly teach:
generates a higher reward
However, Bachiri teaches:
generates a higher reward (Bachiri, Page 3, 1st Col, Last Para. discloses “Each time we reach a leaf/solution, we need to select the node to backtrack to… Once we select a node, the search continues from that point until we reach a new leaf/solution. The difference between the quality of this new solution and the best solution so far is the reward we get for performing the previous action….This is an opportunity to identify the actions that pay the most (that is, nodes that are more likely to lead to interesting leaves/solutions)”)
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the reinforcement learning system as disclosed by Morimoto to determine higher reward states in backtracking as disclosed by Bachiri. The combination would have been obvious because a person of ordinary skill in the art would be motivated to “…discover new nodes/actions, in addition to giving us a reward” (Bachiri, Page 3, 2nd Col. 2nd Para.)

As per claim 2, the combination of Morimoto, Goyal and Bachiri as shown above teaches the reinforcement learning system of claim 1, Morimoto further teaches:
wherein the state encoder is reconfigured based on the adjusted reward to reduce an effective size of the spatial and temporal representation (Morimoto, Para. [0060] discloses “a measure of the reinforcement learning module 3 b is updated using the calculated reward value (Step S 5)” and Para. [0013] discloses “According to the present invention, by updating the learning parameters using the reinforcing learning, it is possible to reduce the estimation error” (Learning module being updated encompasses reconfiguring the state encoding and reducing estimation error means that the temporal and spatial representation is more compact))

As per claim 8, Morimoto teaches a method comprising:
encoding a spatial and temporal representation of an observed state of an environment; (Morimoto, Para. [0025] discloses “In the state estimation method of the present invention configured as described above, the observation target 1 is operated under control based on a control command from the control device 2, and an observation result obtained by observing the state of the observation target 1 is input to the enhancement learning state estimation device 3.” (Observing the state of the observation target encompasses encoding both a spatial and temporal representation))
adjusting a size of a reward based on a difference between [[the estimated previous state and the previous state]] (Morimoto, Para. [0008] discloses “…and a means for calculating a reward value based on a difference between the estimated observation result and the observed result and a feedback value; An updating means for updating a measure of the reinforcing learning module using the calculated reward value” (Updating measure of reinforcing learning module encompasses adjusting reward))
Morimoto fails to explicitly teach:
generating an estimated previous state that estimates a previous state of a given state
the estimated previous state and the previous state
However, Goyal teaches:
generating an estimated previous state that estimates a previous state of a given state (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.”)
the estimated previous state and the previous state (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.” And Section 2.2 discloses “The agent’s full experience is maintained in a replay buffer B, in the form of tuples of (st, at, st+1, rt)” (Replay buffer stores all the states))
it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto with the teachings of Goyal for at least the same reasons as discussed above in claim 1
Morimoto fails to explicitly teach:
generates a higher reward
However, Bachiri teaches:
generates a higher reward (Bachiri, Page 3, 1st Col, Last Para. discloses “Each time we reach a leaf/solution, we need to select the node to backtrack to… Once we select a node, the search continues from that point until we reach a new leaf/solution. The difference between the quality of this new solution and the best solution so far is the reward we get for performing the previous action….This is an opportunity to identify the actions that pay the most (that is, nodes that are more likely to lead to interesting leaves/solutions)”)
it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto with the teachings of Bachiri for at least the same reasons as discussed above in claim 1



As per claim 9, the combination of Morimoto, Goyal and Bachiri as shown aboveteaches the method of claim 8:, Morimoto further teaches:
further comprising reconfiguring a state encoder based on the adjusted reward to reduce an effective size of the spatial and temporal representation (Morimoto, Para. [0060] discloses “a measure of the reinforcement learning module 3 b is updated using the calculated reward value (Step S 5)” and Para. [0013] discloses “According to the present invention, by updating the learning parameters using the reinforcing learning, it is possible to reduce the estimation error” (Learning module being updated encompasses reconfiguring the state encoding and reducing estimation error means that the temporal and spatial representation is more compact))

As per claim 15, Morimoto teaches A non-transitory computer readable medium comprising computer executable instructions which when executed by a computer cause the computer to perform operations comprising:
encoding a spatial and temporal representation of an observed state of an environment; (Morimoto, Para. [0025] discloses “In the state estimation method of the present invention configured as described above, the observation target 1 is operated under control based on a control command from the control device 2, and an observation result obtained by observing the state of the observation target 1 is input to the enhancement learning state estimation device 3.” (Observing the state of the observation target encompasses encoding both a spatial and temporal representation))
adjusting a size of a reward based on a difference between [[the estimated previous state and the previous state]] (Morimoto, Para. [0008] discloses “…and a means for calculating a reward value based on a difference between the estimated observation result and the observed result and a feedback value; An updating means for updating a measure of the reinforcing learning module using the calculated reward value” (Updating measure of reinforcing learning module encompasses adjusting reward))
Morimoto fails to explicitly teach:
generating an estimated previous state that estimates a previous state of a given state
the estimated previous state and the previous state
However, Goyal teaches:
generating an estimated previous state that estimates a previous state of a given state (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.”)
the estimated previous state and the previous state (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.” And Section 2.2 discloses “The agent’s full experience is maintained in a replay buffer B, in the form of tuples of (st, at, st+1, rt)” (Replay buffer stores all the states))
it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto with the teachings of Goyal for at least the same reasons as discussed above in claim 1
Morimoto fails to explicitly teach:
generates a higher reward
However, Bachiri teaches:
generates a higher reward (Bachiri, Page 3, 1st Col, Last Para. discloses “Each time we reach a leaf/solution, we need to select the node to backtrack to… Once we select a node, the search continues from that point until we reach a new leaf/solution. The difference between the quality of this new solution and the best solution so far is the reward we get for performing the previous action….This is an opportunity to identify the actions that pay the most (that is, nodes that are more likely to lead to interesting leaves/solutions)”)
it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto with the teachings of Bachiri for at least the same reasons as discussed above in claim 1

As per claim 16, the combination of Morimoto, Goyal and Bachiri as shown above teaches the non-transitory computer readable medium of claim 15:, Morimoto further teaches:
wherein the operations further comprise reconfiguring a state encoder based on the adjusted reward to reduce an effective size of the spatial and temporal representation (Morimoto, Para. [0060] discloses “a measure of the reinforcement learning module 3 b is updated using the calculated reward value (Step S 5)” and Para. [0013] discloses “According to the present invention, by updating the learning parameters using the reinforcing learning, it is possible to reduce the estimation error” (Learning module being updated encompasses reconfiguring the state encoding and reducing estimation error means that the temporal and spatial representation is more compact))



Claims 3, 10, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Morimoto, in view of Goyal, further in view of Bachiri, and further in view of U.S. Patent. No. US 9536191 B1 to Arel, et al. (hereinafter, “Arel”)
As per claim 3, the combination of Morimoto, Goyal and Bachiri as shown above teaches the reinforcement learning system of claim 1, the combination of Morimoto, Goyal and Bachiri fails to explicitly teach:
further comprising a controller for determining an action based on the spatial and temporal representation
However, Arel (Arel addresses the issue of reinforcement learning using confidence scores) teaches:
further comprising a controller for determining an action based on the spatial and temporal representation (Arel, Para. [7] discloses “selecting an action to be performed by the agent in response to the current observation…” and Fig. 1 discloses observation 102 used to produce a selected action 104 for agent 110)
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto/Goyal/Bachiri to select an action based on an observation as disclosed by Arel. The combination would have been obvious because a person of ordinary skill in the art would be motivated to instruct an agent to perform subsequent actions such as to continue gathering observations from the agent regarding the environment in order to continue to further adjust state history of an agent. Doing so would improve learning and generalization.

As per claim 10, the combination of Morimoto, Goyal and Bachiri as shown above teaches the method of claim 8, the combination of Morimoto, Goyal and Bachiri fails to explicitly teach:
further comprising determining an action based on the spatial and temporal representation
However, Arel teaches:
further comprising determining an action based on the spatial and temporal representation (Arel, Para. [7] discloses “selecting an action to be performed by the agent in response to the current observation…” and Fig. 1 discloses observation 102 used to produce a selected action 104 for agent 110)
it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto/Goyal/Bachiri with the teachings of Arel for at least the same reasons as discussed above in claim 3.

As per claim 17, the combination of Morimoto, Goyal and Bachiri as shown above teaches the non-transitory computer readable medium of claim 15, the combination of Morimoto, Goyal and Bachiri fails to explicitly teach:
the operations further comprising determining an action based on the spatial and temporal representation
However, Arel teaches:
the operations further comprising determining an action based on the spatial and temporal representation (Arel, Para. [7] discloses “selecting an action to be performed by the agent in response to the current observation…” and Fig. 1 discloses observation 102 used to produce a selected action 104 for agent 110)
it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto/Goyal/Bachiri with the teachings of Arel for at least the same reasons as discussed above in claim 3.

Claims 4-5, 11-12, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Morimoto, in view of Goyal, further in view of Bachiri and further in view of U.S. Patent. No. US 9679258 B2 to Mnih, et al. (hereinafter, “Mnih”)
As per claim 4, the combination of Morimoto, Goyal and Bachiri as shown above teaches the reinforcement learning system of claim 1, the combination of Morimoto, Goyal and Bachiri fails to explicitly teach:
further comprising a model stochastic gradient descent component for training a model defining the state encoder, the first stochastic gradient descent component using a gradient based optimization algorithm
However, Mnih (Mnih addresses the issue of reinforcement learning with multiple states) teaches:
further comprising a model stochastic gradient descent component for training a model defining the state encoder, the first stochastic gradient descent component using a gradient based optimization algorithm  (Mnih, Para. [31] discloses “Conveniently the training may be implemented using stochastic gradient descent, for example using back-propagation” and Para. [61] discloses “…the method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner.” (Stochastic gradient descent to train the state encoder. Stochastic gradient descents are used to train neural networks and the state encoder comprises of neural networks. Additionally stochastic gradient descent consists of gradient based optimization algorithms))
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to Morimoto/Goyal/Bachiri as modified to use the stochastic gradient descent training as disclosed by Mnih. The combination would have been obvious because a person of ordinary skill in the art would be motivated to improve accuracy of the system as using stochastic gradient descent for training means that loss is minimized quicker thus improving accuracy

As per claim 5, the combination of Morimoto, Goyal and Bachiri as shown above teaches the reinforcement learning system of claim 1, the combination of Morimoto, Goyal and Bachiri fails to explicitly teach:
further comprising a previous state estimator stochastic gradient descent component for training the previous state estimator (PSE), the previous state estimator stochastic gradient descent component using a gradient based optimization algorithm
However, Mnih teaches:
further comprising a previous state estimator stochastic gradient descent component for training the previous state estimator (PSE), the previous state estimator stochastic gradient descent component using a gradient based optimization algorithm (Mnih, Para. [31] discloses “Conveniently the training may be implemented using stochastic gradient descent, for example using back-propagation” and Para. [61] discloses “…the method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner.” (Stochastic gradient descent to train the state encoder. Stochastic gradient descents are used to train neural networks and the previous state estimator comprises of neural networks. Additionally stochastic gradient descent consists of gradient based optimization algorithms))
it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto/Goyal/Bachiri with the teachings of Mnih for at least the same reasons as discussed above in claim 4.

As per claim 11, the combination of Morimoto, Goyal and Bachiri as shown above teaches the method of claim 8, the combination of Morimoto, Goyal and Bachiri fails to explicitly teach:
further comprising training a model defining a state encoder using a gradient based optimization algorithm
However, Mnih teaches:
further comprising training a model defining a state encoder using a gradient based optimization algorithm (Mnih, Para. [31] discloses “Conveniently the training may be implemented using stochastic gradient descent, for example using back-propagation” and Para. [61] discloses “…the method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner.” (Stochastic gradient descent to train the state encoder. Stochastic gradient descents are used to train neural networks and the state encoder comprises of neural networks. Additionally stochastic gradient descent consists of gradient based optimization algorithms))
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto/Goyal/Bachiri with the teachings of Mnih for at least the same reasons as discussed above in claim 4

As per claim 12, the combination of Morimoto, Goyal and Bachiri as shown above teaches the method of claim 8, the combination of Morimoto, Goyal and Bachiri fails to explicitly teach:
further comprising training a previous state estimator using a gradient based optimization algorithm
However, Mnih teaches:
further comprising training a previous state estimator using a gradient based optimization algorithm (Mnih, Para. [31] discloses “Conveniently the training may be implemented using stochastic gradient descent, for example using back-propagation” and Para. [61] discloses “…the method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner.” (Stochastic gradient descent to train the state encoder. Stochastic gradient descents are used to train neural networks and the previous state estimator comprises of neural networks. Additionally stochastic gradient descent consists of gradient based optimization algorithms))
it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto/Goyal/Bachiri with the teachings of Mnih for at least the same reasons as discussed above in claim 4

As per claim 18, the combination of Morimoto, Goyal and Bachiri as shown above teaches the non-transitory computer readable medium of claim 15, the combination of Morimoto, Goyal and Bachiri fails to explicitly teach:
the operations further comprising training a model defining a state encoder using a gradient based optimization algorithm
However, Mnih teaches:
the operations further comprising training a model defining a state encoder using a gradient based optimization algorithm (Mnih, Para. [31] discloses “Conveniently the training may be implemented using stochastic gradient descent, for example using back-propagation” and Para. [61] discloses “…the method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner.” (Stochastic gradient descent to train the state encoder. Stochastic gradient descents are used to train neural networks and the previous state estimator comprises of neural networks. Additionally stochastic gradient descent consists of gradient based optimization algorithms))
it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto/Goyal/Bachiri with the teachings of Mnih for at least the same reasons as discussed above in claim 4

As per claim 19, the combination of Morimoto, Goyal and Bachiri as shown above teaches the non-transitory computer readable medium of claim 15, the combination of Morimoto, Goyal and Bachiri fails to explicitly teach:
the operations further comprising training a previous state estimator using a gradient based optimization algorithm
However, Mnih teaches:
the operations further comprising training a previous state estimator using a gradient based optimization algorithm (Mnih, Para. [31] discloses “Conveniently the training may be implemented using stochastic gradient descent, for example using back-propagation” and Para. [61] discloses “…the method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner.” (Stochastic gradient descent to train the state encoder. Stochastic gradient descents are used to train neural networks and the previous state estimator comprises of neural networks. Additionally stochastic gradient descent consists of gradient based optimization algorithms))
it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto/Goyal/Bachiri with the teachings of Mnih for at least the same reasons as discussed above in claim 4

Claims 6, 13, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Morimoto, in view of Goyal, further in view of Bachiri and further in view of “10 Stochastic Gradient Descent Optimisation Algorithms + Cheat Sheet” to Karim (hereinafter, “Karim”)
As per claim 6, , the combination of Morimoto, Goyal and Bachiri as shown above teaches the reinforcement learning system of claim 1, , the combination of Morimoto, Goyal and Bachiri fails to explicitly teach:
wherein the state encoder is trained to multiply gradients by minus one to generate input to the previous state estimator
However, Karim (Karim addresses the issue of stochastic gradient descent optimization algorithms) teaches:
wherein the state encoder is trained to multiply gradients by minus one to generate input to the previous state estimator (Karim, Stochastic gradient descent section discloses 
    PNG
    media_image1.png
    87
    304
    media_image1.png
    Greyscale
   where current gradient ∂L/∂w is multiplied by some factor (A negative constant value may be used in place of the factor thus essentially flipping the value of the gradient))
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto/Goyal/Bachiri to flip values of gradients as disclosed by Karim. The combination would have been obvious because a person of ordinary skill in the art would be motivated to improve prediction accuracy of the system so that a state estimator may attempt to more accurately produce predicted states from a given state.

As per claim 13, , the combination of Morimoto, Goyal and Bachiri as shown above teaches the method of claim 8, , the combination of Morimoto, Goyal and Bachiri fails to explicitly teach:
wherein a state encoder is trained to multiply gradients by minus one to generate input to the previous state estimator
However, Karim teaches:
wherein a state encoder is trained to multiply gradients by minus one to generate input to the previous state estimator (Karim, Stochastic gradient descent section discloses 
    PNG
    media_image1.png
    87
    304
    media_image1.png
    Greyscale
   where current gradient ∂L/∂w is multiplied by some factor (A negative constant value may be used in place of the factor thus essentially flipping the value of the gradient))
it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto/Goyal/Bachiri with the teachings of Karim for at least the same reasons as discussed above in claim 6

As per claim 20,  the combination of Morimoto, Goyal and Bachiri as shown above teaches the non-transitory computer readable medium of claim 15, the combination of Morimoto, Goyal and Bachiri fails to explicitly teach:
wherein a state encoder is trained to multiply gradients by minus one to generate input to the previous state estimator
However, Karim teaches:
wherein a state encoder is trained to multiply gradients by minus one to generate input to the previous state estimator (Karim, Stochastic gradient descent section discloses 
    PNG
    media_image1.png
    87
    304
    media_image1.png
    Greyscale
   where current gradient ∂L/∂w is multiplied by some factor (A negative constant value may be used in place of the factor thus essentially flipping the value of the gradient))
it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Morimoto/Goyal/Bachiri with the teachings of Karim for at least the same reasons as discussed above in claim 6

Claims 7 and 14 are rejected under 35 U.S.C. 102(a)(1) as being unpatentable  over Morimoto in view of Goyal, and further in view of “Exploration-Exploitation Trade-off in Deep Reinforcement Learning” to Rusch (hereinafter, “Rusch”)
As per claim 7, Morimoto teaches a reinforcement learning system comprising:
a state encoder for encoding a spatial and temporal representation of an observed state of an environment; (Morimoto, Para. [0025] discloses “In the state estimation method of the present invention configured as described above, the observation target 1 is operated under control based on a control command from the control device 2, and an observation result obtained by observing the state of the observation target 1 is input to the enhancement learning state estimation device 3.” (Observing the state of the observation target encompasses encoding both a spatial and temporal representation))
an accuracy evaluator for determining a difference between [[the estimated previous state and the previous state]]; (Morimoto, Para. [0008] discloses “An enhanced learning module for calculating a feedback value based on a measure of state estimation in the enhanced learning module using an estimated observation result and a difference in observation and a difference in observed observation results and observed results…”)
and a reward modifier for adjusting a size of a reward based on the difference between [[the estimated previous state and the previous state]], wherein the reward modifier adjusts the size of the reward based on (Morimoto, Para. [0008] discloses “…and a means for calculating a reward value based on a difference between the estimated observation result and the observed result and a feedback value; An updating means for updating a measure of the reinforcing learning module using the calculated reward value” (Updating measure of reinforcing learning module encompasses adjusting reward))
Morimoto fails to explicitly teach:
a previous state estimator (PSE) for generating an estimated previous state that estimates a previous state of a given state
the estimated previous state and the previous state
However, Goyal (Goyal addresses backtracking in reinforcement learning) teaches:
a previous state estimator (PSE) for generating an estimated previous state that estimates a previous state of a given state (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.”)
the estimated previous state and the previous state (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.” And Section 2.2 discloses “The agent’s full experience is maintained in a replay buffer B, in the form of tuples of (st, at, st+1, rt)” (Replay buffer stores all the states))
wherein the inaccurate estimation by the previous state estimator (PSE), indicating an efficient state representation, [[generates a higher reward]] than an accurate an accurate estimation, indicating an inefficient state representation by the previous state estimator (PSE) (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.” And Section 2.2 discloses “The agent’s full experience is maintained in a replay buffer B, in the form of tuples of (st, at, st+1, rt)” (Replay buffer stores all the states))
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the reinforcement learning system as disclosed by Morimoto to use the backtracking model for generating previous states as disclosed by Goyal. The combination would have been obvious because a person of ordinary skill in the art would be motivated to use “a backtracking model, which can easily be integrated with existing on- and off policy techniques for reducing sample complexity, i.e. for faster learning” (Goyal, section 2.1)
Morimoto fails to explicitly teach:

    PNG
    media_image2.png
    52
    352
    media_image2.png
    Greyscale
 where rf represents a reward given by an environment at time t, st is a state of the environment at the time t, [[H(p) is a function that is trained to estimate the previous state from the given state]], 8 is a general distance function between two state representations, and E is a predefined weight.
However, Rusch teaches:

    PNG
    media_image2.png
    52
    352
    media_image2.png
    Greyscale
 where rf represents a reward given by an environment at time t, st is a state of the environment at the time t, [[H(p) is a function that is trained to estimate the previous state from the given state]], 8 is a general distance function between two state representations, and E is a predefined weight. (Rusch, Page 10-11 discloses “The intrinsic reward is the error of the forward model f computed in feature space
    PNG
    media_image3.png
    43
    480
    media_image3.png
    Greyscale
 Computing the error in feature space allows this method to scale to complex inputs. The feature transformation φ is trained using a self-supervised inverse dynamics model. The goal of the inverse dynamics model is to predict the action at from a transition st to st+1.” (Reward function is modified by adding a value based on the difference between the actual next state and the predicted next state). Straight forward modification of the provided equation from the predicted next state to the predicted previous state))
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the reinforcement learning system as disclosed by Morimoto to modify a reward based off estimated states as disclosed by Rusch. The combination would have been obvious because a person of ordinary skill in the art would be motivated to “compute a simpler state representation” (Rusch, Page 10, Last Para.)
 
As per claim 7, Morimoto teaches a method comprising:
a state encoder for encoding a spatial and temporal representation of an observed state of an environment; (Morimoto, Para. [0025] discloses “In the state estimation method of the present invention configured as described above, the observation target 1 is operated under control based on a control command from the control device 2, and an observation result obtained by observing the state of the observation target 1 is input to the enhancement learning state estimation device 3.” (Observing the state of the observation target encompasses encoding both a spatial and temporal representation))
an accuracy evaluator for determining a difference between [[the estimated previous state and the previous state]]; (Morimoto, Para. [0008] discloses “An enhanced learning module for calculating a feedback value based on a measure of state estimation in the enhanced learning module using an estimated observation result and a difference in observation and a difference in observed observation results and observed results…”)
and a reward modifier for adjusting a size of a reward based on the difference between [[the estimated previous state and the previous state]], wherein the reward modifier adjusts the size of the reward based on (Morimoto, Para. [0008] discloses “…and a means for calculating a reward value based on a difference between the estimated observation result and the observed result and a feedback value; An updating means for updating a measure of the reinforcing learning module using the calculated reward value” (Updating measure of reinforcing learning module encompasses adjusting reward))
Morimoto fails to explicitly teach:
a previous state estimator (PSE) for generating an estimated previous state that estimates a previous state of a given state
the estimated previous state and the previous state
However, Goyal (Goyal addresses backtracking in reinforcement learning) teaches:
a previous state estimator (PSE) for generating an estimated previous state that estimates a previous state of a given state (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.”)
the estimated previous state and the previous state (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.” And Section 2.2 discloses “The agent’s full experience is maintained in a replay buffer B, in the form of tuples of (st, at, st+1, rt)” (Replay buffer stores all the states))
wherein the inaccurate estimation by the previous state estimator (PSE), indicating an efficient state representation, [[generates a higher reward]] than an accurate an accurate estimation, indicating an inefficient state representation by the previous state estimator (PSE) (Goyal, Abstract discloses “To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state.” And Section 2.2 discloses “The agent’s full experience is maintained in a replay buffer B, in the form of tuples of (st, at, st+1, rt)” (Replay buffer stores all the states))
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the reinforcement learning system as disclosed by Morimoto to use the backtracking model for generating previous states as disclosed by Goyal. The combination would have been obvious because a person of ordinary skill in the art would be motivated to use “a backtracking model, which can easily be integrated with existing on- and off policy techniques for reducing sample complexity, i.e. for faster learning” (Goyal, section 2.1)
Morimoto fails to explicitly teach:

    PNG
    media_image2.png
    52
    352
    media_image2.png
    Greyscale
 where rf represents a reward given by an environment at time t, st is a state of the environment at the time t, [[H(p) is a function that is trained to estimate the previous state from the given state]], 8 is a general distance function between two state representations, and E is a predefined weight.
However, Rusch teaches:

    PNG
    media_image2.png
    52
    352
    media_image2.png
    Greyscale
 where rf represents a reward given by an environment at time t, st is a state of the environment at the time t, [[H(p) is a function that is trained to estimate the previous state from the given state]], 8 is a general distance function between two state representations, and E is a predefined weight. (Rusch, Page 10-11 discloses “The intrinsic reward is the error of the forward model f computed in feature space
    PNG
    media_image3.png
    43
    480
    media_image3.png
    Greyscale
 Computing the error in feature space allows this method to scale to complex inputs. The feature transformation φ is trained using a self-supervised inverse dynamics model. The goal of the inverse dynamics model is to predict the action at from a transition st to st+1.” (Reward function is modified by adding a value based on the difference between the actual next state and the predicted next state). Straight forward modification of the provided equation from the predicted next state to the predicted previous state))
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the reinforcement learning system as disclosed by Morimoto to modify a reward based off estimated states as disclosed by Rusch. The combination would have been obvious because a person of ordinary skill in the art would be motivated to “compute a simpler state representation” (Rusch, Page 10, Last Para.)
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HAMZA RAZZAQ MUGHAL whose telephone number is (571)272-8833. The examiner can normally be reached M-TR 7:30-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ALEXEY SHMATOV can be reached on 571-270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/H.R.M./Examiner, Art Unit 2123                                                                                                                                                                                                        
/NICHOLAS KLICOS/Primary Examiner, Art Unit 2145