DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1, 17, and 20 rejected under 35 U.S.C. 101 because Claims 1, 17, and 20 are directed to "A neural network system [suitable] for reinforcement learning" and is construed as relating to physical entities. However, the indication "the neural network system comprising: a reinforcement learning neural network [...] and a reward function network [...]" is ambiguous as to whether such networks are devices of the claimed neural network system or abstract features such as program instructions [i.e. lacking technical character per se] comprised by the claimed neural network system, because neural networks as such relate to abstract computational models. The applicant may want to replace occurrences of "network" with e.g. "network device".

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1, 2, 16, 17, 20 and 21 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
In Claims 1 and 20, “to perform a task in an attempt to achieve a specified result” is unclear. It is not clear how performing an unspecified task may “achieve a specific result”.
In Claim 2, “a weighted sum of reward predictions for n future time steps, where n is an integer equal to 1” is not clear and how “n future time steps, where n is an integer equal to 1” can perform “a weighted sum”, as two [or more] operands are necessary for summing.
In Claim 16, “a weighted sum of reward predictions for n future action steps, where n is an integer equal to 1” is not clear and how “n future action steps, where n is an integer equal to 1” can perform “a weighted sum”, as two [or more] operands are necessary for summing.
In Claim 17, “training a reinforcement learning neural network by performing a plurality of reinforcement learning steps on input observation data characterizing a state of an environment to learn an action policy” is not clear. It is an attempt to define the claimed activities by a result to be achieved, rather than defining how a policy is learned by the neural network.
In Claim 17, the indication of "a rate of exponential decay of the weighted set of n step returns" does not have a well-recognized meaning, and the decay of a set is not defined in the claim. “A rate of exponential decay of the weighted set of n step returns with a number of steps over which the return is calculated” is not clear.
In Claim 21, “a weighted sum of reward predictions for n future action steps, where n is an integer equal to 1” is not clear and how “n future action steps, where n is an integer equal to 1” can perform “a weighted sum”, as two [or more] operands are necessary for summing.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3-18, and 20-21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Theodoridis et al. (The Fuzzy Sars’a’ (λ) Learning Approach Applied to a Strategic Route Learning Robot Behaviour) in view of Ni et al. (Goal Representation Heuristic Dynamic Programming on Maze Navigation).

Regarding Claim 1, Theodoridis discloses a neural network system for reinforcement learning (page 1772, last para), the neural network system comprising:
a reinforcement learning (Theodoridis abstract; Fig. 2: reinforcement sarsa (λ) learning algorithm; section: II C) neural network to select actions to be performed by an agent (Theodoridis Fig. 2: agent) interacting with an environment (Theodoridis Fig. 2: environment) to perform a task in an attempt to achieve a specified result, the reinforcement learning neural network having at least one input to receive an input observation characterizing a state of the environment (Theodoridis Fig. 2: input from environment) and at least one output for determining an action to be performed by the agent in response to the input observation (Theodoridis Fig. 2: input s(t); output a(t); section: II B).
Theodoridis may not explicitly disclose a reinforcement learning neural network to select actions to be performed by an agent interacting with an environment to perform a task in an attempt to achieve a specified result, the reinforcement learning neural network having at least one input to receive an input observation characterizing a state of the environment and at least one output for determining an action to be performed by the agent in response to the input observation; and a reward function network, coupled to the reinforcement learning neural network, having an input to receive reward data characterizing a reward provided by one or more states of the environment, and configured to determine a reward function to provide one or more target values for training the reinforcement learning neural network.
However, Ni (abstract) teaches a reinforcement learning (Ni abstract; two other  Fig. 1: reinforcement learning algorithms, namely sarsa (λ) and Q-learning algorithms) neural network (Ni abstract) to select actions to be performed by an agent (Ni abstract) interacting with an environment (Ni Fig. 1: environment) to perform a task in an attempt to achieve a specified result (Ni abstract: maze navigation), the reinforcement learning neural network having at least one input to receive an input observation characterizing a state of the environment (Ni Fig. 1: input from environment: state and reward) and at least one output for determining an action to be performed by the agent in response to the input observation (Ni Fig. 1: inputs x(k); r(k); output u(k); section: II A); and
a reward function network, coupled to the reinforcement learning neural network (Ni page 2039 right hand para 1 discloses it is desirable to find a general reward function that can be able to be adaptively tuned according to the possible change of the environment/system. Here we proposed to integrate one additional network, namely goal network, into online model-free HDP design, to provide an informative internal reward/goal signal that can be adaptively tuned, and updated according the system state over time), having an input to receive reward data characterizing a reward provided by one or more states of the environment (Ni page 2040 section II para 1 discloses the agent observes the system state from the environment and provide the action based on the current state. The corresponding reward will be provided by the environment based on the performance of the action; Fig. 1: reward r(k)), and configured to determine a reward function to provide one or more target values for training the reinforcement learning neural network (Ni page 2039 right hand para 1 discloses the key objective of our method is to use the goal network to adaptively build an internal goal/reward signal to guide the system’s decision-making process. page 2040 section II para 1 discloses our proposed GrHDP [Goal representation Heuristic Dynamic Programming] design integrates a goal network to learn from external reward r, and provide the critic network with a detailed internal reward s instantly).
Theodoridis and Ni are analogous art as they pertain to action selection neural network. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify fuzzy Sarsa (λ) learning approach (as taught by Theodoridis) to provide a general reward function that can be able to be adaptively tuned according to the possible change of the environment (as taught by Ni, page 2039 right hand para 1) to approximate the discounted total future external reward with internal reward signal s by introducing the goal network (Ni, page 2040 left hand last para).

Regarding Claim 6, Theodoridis in view of Ni discloses a neural network system as claimed in claim 1
wherein the reward function network includes or more learnable parameters to determine a λ-value; and wherein the one or more target values are dependent upon the λ-value (Theodoridis page 1771 section V-A discloses the learning parameters of the FSλL controller are shown in table III).

Regarding Claim 7, Theodoridis in view of Ni discloses a neural network system as claimed in claim 6. But Theodoridis may not explicitly disclose wherein the reward function network includes a λ-network coupled to the reinforcement learning neural network to determine the λ-value from a state of the reinforcement learning neural network.
However, Ni (Fig. 1) teaches wherein the reward function network (Ni Fig. 1: goal network) includes a λ-network coupled to the reinforcement learning neural network (Ni Fig. 1: critic network and action network) to determine the λ-value from a state of the reinforcement learning neural network (Ni Fig. 1: inputs x(k); r(k); output u(k); section: II discloses to closely connect the critic network with the goal network, the internal reward s is set to be included in the inputs for the critic network. Therefore, the input of goal network and critic network can be denoted as xg = [X, u] and xc = [X, u, s], respectively).
Theodoridis and Ni are analogous art as they pertain to action selection neural network. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify fuzzy Sarsa (λ) learning approach (as taught by Theodoridis) to provide a general reward function that can be able to be adaptively tuned according to the possible change of the environment (as taught by Ni, page 2039 right hand para 1) to approximate the discounted total future external reward with internal reward signal s by introducing the goal network (Ni, page 2040 left hand last para).

Regarding Claim 8, Theodoridis in view of Ni discloses a neural network system as claimed in claim 6
wherein the one or more target values comprise λ-return values (Theodoridis page 1768 section II-B discloses the FSλL model is described by a learning architecture in Fig. 2, in which the FSλL is connected to a static environment that feeds the robot agent with state information. According to the perceived states the agent applies actions to the environment that responds with rewards. This is the state-action-reward cycle which takes place in all RL algorithms as well).

Regarding Claim 9, Theodoridis in view of Ni discloses a neural network system as claimed in claim 1 further comprising
a reward function target generator to generate reward function targets for training one or more learnable parameters of the reward function network (Theodoridis table III).

Regarding Claim 10, Theodoridis in view of Ni discloses a neural network system as claimed in claim 9
wherein the reward function targets comprise alternate λ-return values generated independently of the one or more target values from the reward function network (Theodoridis page 1768 section II-C discloses the learning controller for the route learning task is based on the Sarsa(λ) learning that can explore the state space and exploit it at the same time by obtaining instant experience. Thus it does not have to wait until the goal area is reached in order to propagate back the collected rewards and then to build a “delayed” experience. The idea in Sarsa(λ) is to apply the TD(λ) prediction method to state-action pairs rather than to states only. Obviously, a trace of eligibility is needed for each state-action pair et(s, α)).

Regarding Claim 11, Theodoridis in view of Ni discloses a neural network system as claimed in claim 10
wherein the reward function target generator is configured to perform an alternate rollout from a current state of the environment to determine the reward function targets (Theodoridis page 1768 section II-B para 3 discloses the defuzzifier engine undertakes the updating of the Q-function. The updated Q-function is propagated back to the Sarsa(λ) algorithm and then, according to the updated Q(s, a) values, the Sarsa(λ) algorithm selects an action with probability 1-ε [ε-greedy policy] and the agent performs that action to the environment. Immediately after that, a reward r evaluates the action performed and the algorithm receives the next state s’).

Regarding Claim 12, Theodoridis in view of Ni discloses a neural network system as claimed in claim 10 further comprising
memory to store the target values for training the reinforcement learning neural network, and wherein the reward function target generator is configured to retrieve the stored target values to provide the reward function targets (Theodoridis page 1768 section II-C last para discloses one of the most significant characteristics for the Sarsa algorithm is that it credits each action taken during the agent's locomotive procedure so that on-line learning can be achieved. Each trace of eligibility constitutes a short term memory of the frequency of the state-action pair visits or triggered rules in the case of Reinforcement Fuzzy Learning. Retrieving from memory is known in the art).

Regarding Claim 13, Theodoridis in view of Ni discloses a neural network system as claimed in claim 1. But Theodoridis may not explicitly disclose wherein the reinforcement learning neural network comprises a recurrent neural network to provide a representation of a sequence of states of the environment comprising a sequence of state-dependent values, and wherein the reward function network is configured to generate the one or more target values from the sequence of state-dependent values.
However, Ni (Fig. 1) teaches wherein the reinforcement learning neural network comprises a recurrent neural network to provide a representation of a sequence of states of the environment comprising a sequence of state-dependent values (Ni page 2039 left hand para 2 discloses the typical MDP benchmark, namely the maze navigation problem, was tested with adaptive-critic designs in a closed-loop form with simultaneous recurrent neural network [SRN]), and
wherein the reward function network is configured to generate the one or more target values from the sequence of state-dependent values (Ni page 2040 section II first para discloses the interaction between proposed GrHDP agent and the maze/environment is shown in Fig. 1, where one can see that the agent observes the system state from the environment and provide the action based on the current state. The corresponding reward will be provided by the environment based on the performance of the action. In the agent block, the similar design is kept like the traditional HDP. That is to say, model-free action dependent design for GrHDP is adopted and online learning for the neural networks in agent is used).
Theodoridis and Ni are analogous art as they pertain to action selection neural network. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify fuzzy Sarsa (λ) learning approach (as taught by Theodoridis) to provide a general reward function that can be able to be adaptively tuned according to the possible change of the environment (as taught by Ni, page 2039 right hand para 1) to approximate the discounted total future external reward with internal reward signal s by introducing the goal network (Ni, page 2040 left hand last para).

Regarding Claim 14, Theodoridis in view of Ni discloses a neural network system as claimed in claim 13. But Theodoridis may not explicitly disclose wherein the reward function network has an input to receive state-dependent reward value data for the sequence of states of the environment.
However, Ni (Fig. 1) teaches wherein the reward function network has an input to receive state-dependent reward value data for the sequence of states of the environment (Ni page 2040 section II first para discloses the proposed GrHDP design integrates a goal network to learn from external reward r, and provide the critic network with a detailed internal reward s instantly).
Theodoridis and Ni are analogous art as they pertain to action selection neural network. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify fuzzy Sarsa (λ) learning approach (as taught by Theodoridis) to provide a general reward function that can be able to be adaptively tuned according to the possible change of the environment (as taught by Ni, page 2039 right hand para 1) to approximate the discounted total future external reward with internal reward signal s by introducing the goal network (Ni, page 2040 left hand last para).

Regarding Claim 15, Theodoridis in view of Ni discloses a neural network system as claimed in claim 13 including
episodic memory to store state and reward data from previous states of the system, and wherein the reward function network is configured to receive reward data from the episodic memory (Theodoridis page 1768 section II-C last para discloses one of the most significant characteristics for the Sarsa algorithm is that it credits each action taken during the agent's locomotive procedure so that on-line learning can be achieved. Each trace of eligibility constitutes a short term memory of the frequency of the state-action pair visits or triggered rules in the case of Reinforcement Fuzzy Learning. Retrieving from memory is known in the art).

Claims 17-18 and 20-21 are rejected for the same reasons as set forth in Claims 1 and 6-15.
Claim 2 is/are rejected under 35 U.S.C. 103 as being unpatentable over Theodoridis et al. (The Fuzzy Sarsa (λ) Learning Approach Applied to a Strategic Route Learning Robot Behaviour) in view of Ni et al. (Goal Representation Heuristic Dynamic Programming on Maze Navigation) further in view of Graepel et al. (US #2018/0032864).

Regarding Claim 2, Theodoridis in view of Ni discloses a neural network system as claimed in claim 1 but may not explicitly disclose wherein the reward function comprises a weighted sum of reward predictions for n future time steps, where n is an integer equal to or greater than 1.
However, Graepel teaches wherein the reward function comprises a weighted sum of reward predictions for n future time steps, where n is an integer equal to or greater than 1 (Graepel ¶0096 discloses for example, when combined, the leaf evaluation score can be a weighted sum of the value score and the rollout long-term reward).
Theodoridis, Ni and Graepel are analogous art as they pertain to action selection neural network. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Theodoridis in view of Ni in light of the teachings of Graepel to provide the weighted sum of reward (as taught by Graepel, ¶0096) to perform actions by an agent interacting with an environment that has a very large state space which can be effectively selected to maximize the rewards resulting from the performance of the action (Graepel, ¶0007).



Claims 3-5 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Theodoridis et al. (The Fuzzy Sarsa (λ) Learning Approach Applied to a Strategic Route Learning Robot Behaviour) in view of Ni et al. (Goal Representation Heuristic Dynamic Programming on Maze Navigation) further in view of Graepel et al. (US #2018/0032864) and further in view of Spoerre et al. (US Patent #5602761).

Regarding Claim 3, Theodoridis in view of Ni and Graepel discloses a neural network system as claimed in claim 2 but may not explicitly disclose wherein the weighted sum comprises an exponentially weighted sum with decay parameter λ defining a decay of the weighted sum away from a current time step.
However, Spoerre teaches wherein the weighted sum comprises an exponentially weighted sum with decay parameter λ defining a decay of the weighted sum away from a current time step (Spoerre col. 9 lines 34-48 discloses the EWMA [Exponentially Weighted Moving Average] constant can be as in Equation 14, where the wi are the weights and wi = λ (1 – λ)t-I . The sum of weights can be found [refer to the equation]. The constant λ determines the “memory” of EWMA statistics. That is, λ determines the rate of decay of weighted sum with decay parameter λ).
Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Theodoridis in view of Ni and Graepel in light of the teachings of Spoerre to provide the weighted sum with a decay parameter (as taught by Spoerre, col. 9 lines 34-48) to perform the action (Spoerre, col. 1 lines 43-47).

Regarding Claim 4, Theodoridis in view of Ni, Graepel, and Spoerre discloses a neural network system as claimed in claim 3. But Theodoridis in view of Ni and Graepel may not explicitly disclose wherein the reward function network is configured to perform computations weighted by λ, wherein the reward function network is configured to learn a value for λ.
However, Spoerre teaches wherein the reward function network is configured to perform computations weighted by λ, wherein the reward function network is configured to learn a value for λ (Spoerre col. 10 lines 23-24 discloses Lambda (λ) determines the "memory" of the EWMA statistic; that is, 'A, determines the rate of decay of the weight).

Regarding Claim 5, Theodoridis in view of Ni, Graepel, and Spoerre discloses a neural network system as claimed in claim 4. But Theodoridis may not explicitly disclose wherein λ is a function of the time step.
However, Ni teaches wherein λ is a function of the time step (Ni page 2046 left hand section IV discloses the control action u, the internal goal signal s and the value function J are updated according to the corresponding error functions that change from one time step to another).
Theodoridis and Ni are analogous art as they pertain to action selection neural network. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify fuzzy Sarsa (λ) learning approach (as taught by Theodoridis) to provide a general reward function that can be able to be adaptively tuned according to the possible change of the environment (as taught by Ni, page 2039 right hand para 1) to approximate the discounted total future external reward with internal reward signal s by introducing the goal network (Ni, page 2040 left hand last para).

Regarding Claim 16, Theodoridis in view of Ni discloses a neural network system as claimed in claim 1 but may not explicitly disclose wherein the reward function comprises a weighted sum of reward predictions for n future action steps, where n is an integer equal to or greater than 1, and wherein the reward function network is configured to determine a parameter defining a rate of exponential decay of the weighted sum n away from a current time step.
However, Graepel teaches wherein the reward function comprises a weighted sum of reward predictions for n future action steps, where n is an integer equal to or greater than 1 (Graepel ¶0096 discloses for example, when combined, the leaf evaluation score can be a weighted sum of the value score and the rollout long-term reward).
Theodoridis, Ni and Graepel are analogous art as they pertain to action selection neural network. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Theodoridis in view of Ni in light of the teachings of Graepel to provide the weighted sum of reward (as taught by Graepel, ¶0096) to perform actions by an agent interacting with an environment that has a very large state space which can be effectively selected to maximize the rewards resulting from the performance of the action (Graepel, ¶0007).
Theodoridis in view of Ni and Graepel may not explicitly disclose wherein the reward function network is configured to determine a parameter defining a rate of exponential decay of the weighted sum n away from a current time step.
However, Spoerre teaches wherein the reward function network is configured to determine a parameter defining a rate of exponential decay of the weighted sum n away from a current time step (Spoerre col. 9 lines 34-48 discloses the EWMA [Exponentially Weighted Moving Average] constant can be as in Equation 14, where the wi are the weights and wi = λ (1 – λ)t-I . The sum of weights can be found [refer to the equation]. The constant λ determines the “memory” of EWMA statistics. That is, λ determines the rate of decay of weighted sum with decay parameter λ).
Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Theodoridis in view of Ni and Graepel in light of the teachings of Spoerre to provide the weighted sum with a decay parameter (as taught by Spoerre, col. 9 lines 34-48) to perform the action (Spoerre, col. 1 lines 43-47).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YOGESHKUMAR G PATEL whose telephone number is (571)272-3957. The examiner can normally be reached 7:30 AM-4 PM PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached on 571-272-7503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/YOGESHKUMAR PATEL/Primary Examiner, Art Unit 2651