DETAILED ACTION
This action is in response to the communications filed 10/12/2021 in which claims 1 and 14 were amended; claims 3, 8, 16, and 21 were cancelled; and claims 1-2, 4-7, 9-15, 17-20, and 22-25 are still pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first 
inventor to file provisions of the AIA .

Drawings
The drawings were received on 10/25/2015.  These drawings are acceptable.

Examiner remarks
MPEP 714 requires all claims being currently amended must be presented with markings to indicate the changes that have been made relative to the immediate prior version. The changes in any amended claim must be shown by strike-through (for deleted matter) or underlining (for added matter) with 2 exceptions: (1) for deletion of five or fewer consecutive characters, double brackets may be used (e.g., [[eroor]]); (2) if strike-through cannot be easily perceived (e.g., deletion of number "4" or certain punctuation marks), double brackets must be used (e.g., [[4]]). As an alternative to using double brackets, however, extra portions of text may be included before and after text being deleted, all in strike-through, followed by including and underlining the extra text with the desired change (e.g., number 4 as number 14 as ). An accompanying clean version is not required and should not be presented. Only claims of the status "currently amended" or "withdrawn" will include markings. See MPEP 714(C). Please follow this practice for all responses.



Examiner notes the cited NPL lists the applicant as the author in the published NPL titled “Large-scale synthesis of function spiking neural circuits” published May 5th 2014. This is considered art that does not meet the exception under 102(b)(1)(A) because the cited disclosure was published 1 year and three months before the effective filing date of the current claimed invention, August 26, 2015.	


Response to Arguments
Applicant’s arguments and amendments filed 10/12/2021 have been fully considered.

Applicant’s arguments with respect to the 35 USC § 112 rejections, have been fully considered. 
Applicant has argued that the disclosed exponential signal provides support for the amended claim limitations. Examiner respectfully disagrees. 
The standard for determining compliance with the written description requirement are noted in MPEP 2163.02, that notes to satisfy the written description requirement, an applicant must convey with reasonable clarity to those skilled in the art that, as of the filing date sought, he or she was in possession of the invention, and that the invention, in that context, is whatever is now claimed. The test for sufficiency of support in a parent application is whether the disclosure of the application relied upon "reasonably conveys to the artisan that the inventor had possession at that time of the later claimed Ralston Purina Co. v. Far-Mar-Co., Inc., 772 F.2d 1570, 1575, 227 USPQ 177, 179 (Fed. Cir. 1985) (quoting In re Kaslow, 707 F.2d 1366, 1375, 217 USPQ 1089, 1096 (Fed. Cir. 1983)). 
The examiner provides below the findings of fact which support the lack of written description conclusion (see MPEP § 2163 for examination guidelines pertaining to the written description requirement):
 Identify the claim limitation(s) at issue; 

    PNG
    media_image1.png
    400
    1482
    media_image1.png
    Greyscale

The equations includes the noted portion in the doted circle above and the use of a discount factor with a value greater than 0 and less than. The previous claim amendments recited equation 12 and the differences between the equations are not supported by original discloser

Establish a prima facie case by providing reasons why a person skilled in the art at the time the application was filed would not have recognized that the inventor was in possession of the invention as claimed in view of the disclosure of the application as filed.
In the instant case: Applicant has not pointed out where the amended claim is supported, nor does there appear to be a written description of the claim limitation in the applicant’s originally filed disclosure.
Applicant specification discloses the computation of temporal difference using equation 12 that can be computed using continuous integrals however, the 
The specification disclosed the use of a discount factor as equation 4 and as an exponentially decaying signal. The newly amended limitation discloses the range of outputs for the discount factor not disclosed as part of the original specification and there is no supporting evidence/teaches that the range provided in the claim amendments are a direct result or a derived output of the disclosed function in equation 4 and the exponentially decaying signal disclosed in paragraph 0073.  One of ordinary skill in art would not equate an exponential decaying signal to a set of values that are greater than 0 and less than one. 

Per the analysis above, the amended claim limitation, include subject matter involving addition for the disclosure of the application as filed, see MPEP 2163.02. Therefore, the amended limitation is directed to new matter. Thus, the rejection under 35 USC § 112(a) made in the previous office action has been maintained 

Regarding applicant’s remarks with respect to the rejection of claims under 35 U.S.C. § 103, the arguments have been fully considered. The applicant has argued that the cited prior failed to render the disclosed equation as recited by the amended claim limitation. 



Claim Interpretation
Regarding claims 1, 3-5, and 7-10, the claim limitations are not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph because the claim construction as recited in claim 1, discloses the models are computer executable instructions and the applicant remarks submitted 07/08/2018 discloses that the modules were intended to be claimed as software models that carry out the claimed functions. Furthermore, claim 1 recites that the executable instructions are implemented on a computer processor, which is considered the hardware associated with the software module elements. Claims 2, 4, 5-7, and 9-13 depend on claim 1 and are associated with the same hardware and claim interpretation. 

Claim Rejections - 35 USC § 112 –Written Description and New Matter
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-2, 4-7, 9-15, 17-20, and 22-25 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  

Regarding claims 1 and 14, the claims recite the limitation below render the claim that includes matter not clearly supported by the original specification

    PNG
    media_image2.png
    73
    387
    media_image2.png
    Greyscale


    PNG
    media_image3.png
    173
    742
    media_image3.png
    Greyscale


Specifically the applicant’s original disclosure does not discuss the claimed equation and how it is derived to account for the term 
    PNG
    media_image4.png
    53
    84
    media_image4.png
    Greyscale
 and value of the discount factor with the range noted in the amended claim limitations. The published specification US Pub. No. 20170061283 ¶0065 discusses the claimed TD error equation as previously claimed where the discount factor can be simple subtracted rather than combined multiplicatively as expressed in equation 12: 

    PNG
    media_image5.png
    129
    1058
    media_image5.png
    Greyscale

 Where there is no discussion on how the revised term is derived from the disclosed equation to produce the newly claimed error equation. Specifically, the specification does not make clear how the amended claim limitation is an obvious derivation and/or equivalent of equation 12 disclosed as part of the original specification. Rather, the specification discloses the derivation/computation of the disclosed 
    PNG
    media_image6.png
    77
    134
    media_image6.png
    Greyscale
 , converges to 
    PNG
    media_image7.png
    24
    63
    media_image7.png
    Greyscale
  is obvious to one of ordinary skill in art given the recitation in the original disclosure. The new equation is thus directed to new matter.

	Additionally, the amended claims recite that the discount factor is taken at the specified range in claim recitation also does not appear to be supported by the original disclosure. The published specification US Pub. No. 20170061283 ¶0063 discloses that the discount factor is an exponential decaying signal that in multiplied by incoming rewards across the SMDP delay period as well as scaling the value of the next action at the end of a time delay period. It appears that the recited equation discloses the discount factor as a value between 0 and 1 where disclosed in the specification how this new information is derived or related to the decaying signal described in the original specification. The recitation in the specification does not provide sufficient support for the newly claimed limitation in such a way as to meet the written description requirement, and thus the limitation encompass new matter.
Regarding the dependent claims 2, 4, 5-7, and 9-13 that depend on claims 1; and claims 15, 17-20, and 22-25 that depend on claims 14, these claims do not resolve the deficiencies noted in the claims above and are therefore appropriately rejected. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the 


Claim 1-2, 4-7, 9-15, 17-20, and 22-25 are rejected under 35 U.S.C. 103 as being unpatentable over Eliasmith et al. (US Patent Application Publication No. 2014/0156577, hereinafter ‘Eliasmith’), in view of  Kawato et al (NPL: “Efficient reinforcement learning: computational theories, neuroscience and robotics”, hereinafter ‘Ka’).

	
Regarding independent claim 1 limitations, Eliasmith teaches a system implementing reinforcement learning:
the system comprising a computer processor and a computer readable medium having computer executable instructions executed by said processor; said computer readable medium including instructions for providing: (Eliasmith teaches the use of a non-transitory computer-readable storage medium configured to execute computer programs that is instructions, in [0043]: Each program may be implemented in a high level procedural or object oriented programming or scripting lan­guage, or both, to communicate with a computer system… Each such computer program may be stored on a storage media or a device ( e.g., ROM, magnetic disk, optical disc), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. Embodiments of the system may also be considered to be implemented as a non-transitory computer-readable stor­age medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein, that are executed by the computer processor, in [0041]: The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. These embodiments may be implemented in computer programs executing on program­mable computers, each computer including at least one pro­cessor, a data storage system (including volatile memory or non-volatile memory or other data storage elements or a combination thereof), and at least one communication inter­face.)
a neural network including a plurality of synapses; (Eliasmith teaches the model of the brain structure as a neural network including a plurality of connection synapse, in [0090]: ... FIG. 3B is a schematic block diagram of the Spaun system that contains elements analogous to those highlighted in FIG. 3A. Lines terminating in circles indicate connections with neurons that produce output simu­lating the effects of gamma-Aminobutyric acid (GABA) at their output-so-called GABAergic (inhibitory) connections or synapses. Lines terminating in open squares indicate modulatory activity emulating dopaminergic (adaptive) con­nections.)
an action values module that receives environmental state as input, containing at least one adaptive element that learns state and/or action values based on an error signal; (Eliasmith teaches the system that carries the task, that is the action value module, of receiving input in the form of internal states, in [0152]: In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward; where the input comprises an action value based on an error signal to receive an signals from the environment as observed actions, considered an environmental state as input, containing at least one adaptive element that learns state and/or action values based on an error signal, in [0072]:  Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task (e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history.; where the system tasks are carried out as executable computer program instructions, in [0041].)
an action selection module coupled to the action values module; (Eliasmith teaches the action selection module component that is coupled to the action values module as computer program code coupled using a computer system for implement programing instructions, in [0041]-[0043]: The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. … Program code is applied to input data to perform the functions described herein and to generate output informa­tion… Each such computer program may be stored on a storage media or a device ( e.g., ROM, magnetic disk, optical disc), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein…)
an error calculation module coupled to both the action values and action selection module, which computes an error signal used to update state and/or action values in the action values module based…; (Eliasmith teaches an error calculation module coupled to both the action values and action selection module, as computer programs coupled to a computing system, in [0041]-[0043, for executing operations used to generate error signal of the observed result and the desired results associated with a task (action values) where the results are selected to produce the highest immediate reward based on past history (based on reward signal), in [0072]: Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history., and as depicted in Fig. 3 the reward evaluator is used to determine the reward (reward signal) associated with an input and the determination of one or more actions to carry out task for maximizing a real or perceived reward (based on a reward signal), in [0152]: …In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward…; where the error module couples to the action selection controller module that influences the routing of information in the system, in [0073]: Action selection controller 220 influences routing of information throughout the system 200., and the system tasks are carried out as executable computer program instructions, in [0041], comprising a multiple computer programs coupled to communicate with a computer system, in [0043].)
 a learning sub-module wherein (Eliasmith teaches computer executable instruction programs, in [0041], as executable programs in a computer system, [0046].)
each module or sub-module comprises a plurality of nonlinear components, wherein each nonlinear component is configured to generate a scalar or vector output in response to the input and is coupled to the output module by at least one synaptic weighted coupling; (Eliasmith teaches modules as executable programs in a computer system that comprise artificial neurons, as the plurality of nonlinear components, in [0046]: Components of the system can perform processing, and communicate, using artificial neurons that implement neural networks. In some cases, non-neural data may also be communicated without the use of artificial neurons. In some cases, one or more components may be implemented without the use of artificial neurons (e.g., motor controls in some embodiments). In the example embodiments presented herein, the artificial neurons are spiking, although non-spik­ing may also be used. ...; where the nonlinear component computes, that is generates, sematic pointers that are vector representations associated with the connections (that includes an output signal) between networks, in [0046]; where the neural connection synaptic weights are coupled to allow computation of functions that facilitate the outputs responses to the inputs to be expressed as vectors, in [0081]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights...)
the output from each nonlinear component is weighted by the connection weights of the corresponding weighted couplings and the synaptic weighted outputs are provided to the output module to form the output modifier; (Eliasmith teaches the output from each neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate the outputs responses to form a output modifier such as in image processing tasks, that is the function that facilities the outputs based on the connection synaptic weights and inputs vector, in [0081] & [0098]-[0099]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights. & [0098]-[0099]: …Finally, a further hierarchical level can be considered analo­gous to the inferior temporal cortex (IT). Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di­mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach. In one preferred neural engineering framework (NEF), the connec­tions between these layers define the vector space transfor­mations that can be implemented in connection weights between the layers. In an example embodiment, training is performed prior to implementation. However, in other embodiments, learning and training may occur during opera­tion…)
the state and/or action values being updated are separated from the reward signal…; (Eliasmith teaches the output of more actions (updated action values) or decisions to carry out task as a way of maximizing a real or perceived reward (separated reward signal), in [0152]: …In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward…; where the system uses reinforcement learning to associate rewards with tasks activity trials, in [00145]: Reinforcement learning-Perform a three­ armed bandit task, in which it is determined which of three possible choices generates the greatest stochasti­cally generated reward. Reward contingencies can change from trial to trial.)
the learning sub-module is configured to update connection weights based on the initial output and the outputs generated by the nonlinear components; (Eliasmith teaches the learning sub-module as a part of the computer executable instruction programs, in [0041], as executable programs in a computer system, [0046] configured to execute functions as updating the connection weights, that is learning based on initial output and the outputs generated by the nonlinear neuron components of the artificial neurons in the network, in [0081]-[0082] : … The artificial neu­rons are formed into networks of neurons with interconnec­tions with varying weights, which can be regulated to disin­hibit (that is, allow) communication between neurons or to inhibit such communication, as is the case in their biological counterparts. In general, the artificial neurons are responsive to control signals that approximate the functions of neuro­chemicals… The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9)... In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights…)
the action values are updated based on a change in the synaptic weights on the output of neural populations and the change in synaptic weights is based on a given error signal and is computed based on a neural learning rule; (Eliasmith teaches updating action values by training and learning rule based on a plausible spike-based rule neural learning rule and the use of neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate update the action values including the neuron connection weights and error signals generated by user observation action for learning to choose an option, out of a set of available choices, in [0081]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011) … and in [0071]-[0072]:  All such transformations are updateable by error signals, which may come from the action selection compo­nent, or which may be internally generated. Error signals generally guide the learning of the transformation modules between two populations. Most often this may facilitate adjusting the connection weights between neurons within each module, but error signals may also be applied to adjust transformations at the level of the semantic pointers. Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history.; such as updating action value as draw actions based on the reinforcement learning rule and learned model with the updated weights based on the error signal, in [0169]: FIG. llA illustrates the reward and behavioral time course of system 300 in a reinforcement learning task. In the illustrated example, the task is a three-armed bandit task. Here the best action was to draw a "2". After some incorrect guesses, this contingency is learned by the model at the begin­ning of the task. However, it can be observed that two "unlucky" rewards at the end of the trial (at 9 s and 11 s) cause the "utility" trace (a decoding of the Str activity) to decrease, and hence the system chooses a "1" for its next guess. The reward prediction error signal is shown separately for each of the three possible actions (this can also be thought of as a reward vector, which is the decoding of a subset of vStr activity). As can be seen, "error 2" decreases as the trial proceeds, until the unlucky rewards occur.)
the input to the system is either discrete or continuous in time and space; and, (the input to the system is a as a input of images associated with a drawing task as depicted in Fig. 6A, in [0027]; where the image inputs are captured as discrete data pixels captured over a time associated with the observation input as depicted in Fig. 13, in [00172].)
the input to the system is one of a scalar and a multidimensional vector. (Eliasmith teaches the input to the system as the image represented as a multidimensional vector, in [0048] & [0098].)
While Eliasmith teaches the use of learning methods to update the connections weights based on input and output information using local methods, in [0081]-[0082], using reinforcement learning techniques to associate rewards with tasks activity trials, in [00145] based on error signals, in [0071]-[0072].
Eliasmith does not expressly teach claim 1 limitation:
computes an error signal used to update state and/or action values … based on  the equation
    PNG
    media_image2.png
    73
    387
    media_image2.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of time steps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor greater than zero and less than 1;
the state and/or action values being updated are separated from the reward signal by one or more intermediate states and/or actions;
Ka does teach claim 1 limitation:
computes an error signal used to update state and/or action values … based on the equation
    PNG
    media_image2.png
    73
    387
    media_image2.png
    Greyscale

(Ka teaches the role of computing the temporal difference error in reinforcement learning controller systems based on applicant’s equation for length of the equation given τ =2  as including the summation an timestamps 2 (i.e. t=2) unit of time observation, ƴ is a discount factor = .5,  rt  is r(t)=r(1) +r(2) = r(t)  given r(1)=0, -2*Q(s,a) =.5*V(t+1), and  Q(s’,a’) = -V(t) and using discount factor ƴ that is considered an integrated discount factor = .5, as depicted in Fig. 2 equation under 2(a) as an error calculation, δ(t) as claimed temporal difference equation, that is based on the recited equation in applicant claim, in pg. 207: Left Col: “…The most important role of temporal-difference error is in solving the temporal credit assignment problem in reinforcement learning theory. Houk, Adams and Barto [5] proposed an explicit neural circuit model for computing the temporal-difference error (Figure 2a) …” 

    PNG
    media_image8.png
    330
    291
    media_image8.png
    Greyscale


    PNG
    media_image9.png
    198
    1087
    media_image9.png
    Greyscale

)
the state and/or action values being updated are separated from the reward signal by one or more intermediate states and/or actions; (Ka teaches the use of a temporal difference reinforcement learning method where the state/action values depicted as x and u values respectively are separated by a reward signal by the reword module as depicted in Fig. 1:

    PNG
    media_image10.png
    546
    790
    media_image10.png
    Greyscale


The Eliasmith and Ka references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing reinforcement learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method for using reinforcement learning process disclosed by 
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to improve reinforcement learning tasks for constructing reinforcement learning model computing temporal difference error based on reinforcement learning models representing connections in the basal ganglia to help resolve the three theoretical difficulties of reinforcement learning—that is, slowness, computation of temporal-difference error, and global neural networks (Ka, Fig. 2 And Sec. Hierarchical reinforcement learning model); Doing so would help capture temporal differences for constructing useful learning models in reinforcement learning task environment that help resolve theoretical issues for modeling behavioral learning that depend on reward and penalty, (Ka, Abstract).
The examiner notes that the equation has variables that have not been appropriately described in the claim limitation; therefore, the variables can be mapped to a boarder scope as highlighted. It is improper to limit the scope to a preferred embodiment not required by the claim limitation, see MPEP 2111.	Examiner notes that all claimed modules are interpreted as computer executable instruction programs as taught by Eliasmith in [0041].

Regarding claim 2, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein multiple instances of the system are composed into a hierarchical structure,  (Eliasmith teaches the artificial intelligence system comprising multiple instances composed into a hierarchy configuration, as the  hierarchical structure, in [0007]: In a first broad aspect, some embodiments provide an artificial intelligence system comprising: at least one inter­face hierarchy configured to receive an input of a high-dimen­sional representation and to compress the high-dimensional representation to generate a lower-dimensional representa­tion of the input; at least one processing module configured to receive the lower-dimensional representation and to generate a further representation; an action selection controller con­figured to control communication of the lower-dimensional representation and the further representation between the at least one interface hierarchy and the at least one processing module. And as processing layers of consecutive neural network layers part of the multiple instances composed into a hierarchical structure, in [0096]-[0099]:  Referring now to FIG. 5, there is illustrated a sim­plified schematic diagram of the compression hierarchy of the visual input hierarchy module 302 in one embodiment. [0097] The visual compression hierarchy has a 28x28 dimensional input layer (e.g., for receiving a 784-pixel input image) and consecutive hidden layers of 1000, 500, 300 and 50 nodes. [0098] The initial, 1000-node hierarchical layer, which generates a 1000-dimensional semantic pointer, can be con­sidered analogous to the primary visual cortex (Vl). A sec­ond, 500-node layer can be considered analogous to the sec­ondary visual cortex (V2). A third, 300-node layer can be considered analogous to the extrastriate visual cortex (V4). Finally, a further hierarchical level can be considered analo­gous to the inferior temporal cortex (IT). Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-dimensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach. In one preferred neural engineering framework (NEF), …)
wherein the output of one instance performs one or more of 
adding new state input to the input of the downstream instance;
modifying state in the downstream instance; and
modifies the reward signal of the downstream instance. (Eliasmith teaches multiple instance composed into a hierarchical structure that the output of one instance is a modified state space of a lower dimension of the downstream instance as depicted in Fig. 5, in [0099]; modifying the reward at the input downstream instance, in [0128]; adding state input by determining which states should be switching in accordance with the current task goal, in [0056].)

Regarding claim 4, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein the module representing state/action values consists of two interconnected sub-modules, each of which receives state information with or without time delay as input, and the output of one sub­ module is used to train the other in order to allow state and/or action value updates to be transferred over time. (Eliasmith teaches the modules for representing the activity state using artificial neurons that receives state information input that is connected by weights to approximate a function, [0079]; where the inputs to the node are images over time as input as depicted in Fig. 4B and Fig 5 used to train the neurons, that are represent each hierarchical compression state space that allow state updates, and can be processed to learn and train, in [0099].)

Regarding claim 5, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein there are initial couplings within and between different modules of the system, where each weighted coupling has a corresponding connection weight such that the output generated by each nonlinear component is weighted by the corresponding connection weights to generate a weighted output. (Eliasmith teaches the modules as executable programs in a computer system that comprise coupled artificial neurons, in [0046]; where the neural connection weights are coupled to allow computation of functions that facilitate the weighted outputs responses to the inputs, in [0081] as modeled by the neurons, in [0079] to learn optimized values  from initial couplings, in [0081].)

Regarding claim 6, the rejection of claim 5 is incorporated and Eliasmith in combination with Ka, further teaches the system of claim 5:
wherein a neural compiler is used to determine the initial couplings and connection weights. (Eliasmith teaches the use of the neural simulator to model neuron models where the couplings and connection weights and be modeled and learned computationally, that is from a determined initial set of parameter; where the Neural models may be simulated using a suitable neural simulator, such as the Nengo neural simulator (<http://www.nengo.ca/>) comprises a neural compiler for compiling scripted software for simulating neural system models, in [0080]-[0082].)

Regarding claim 7, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein at least one of the nonlinear components in an adaptive sub module that generates a multidimensional output is coupled to the action selection and/or error (Eliasmith teaches the sub module associated with a hierarchical level as an adaptive sub module that generates multidimensional output as an neuron network of N-nodes associated with N-dimensions, in [0098]; where each neuron node network comprises neuron nodes weighted by neural connection weights coupled to allow computation of functions that facilitate the outputs responses to form a output modifier, that is the function that facilities the outputs based on the connection weights and inputs vector, in [0081]; where the transformation at each hierarchical level are facilitated by the coupled error calculation modules that calculate the error signals, in [0071]-[0072].)

Regarding claim 9, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein the learning sub-module is configured to update the connection weights based on an outer product of the initial output and the outputs from the nonlinear. (Eliasmith teaches the adjusting of the connection weights, that is updates based on an initial value, and the output generated to compute the error signal that accounts for the difference in the observed output result generated and the desired output generated by the nonlinear components neurons of each level, in [0070]-[0072]; where the output is computed using an outer product by the information encoding that is implemented as a neural network level in the system, in [0123]-[0126].)

Regarding claim 10, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein each nonlinear component has a tuning curve that determines the output generated by the nonlinear component in response to any input and the tuning curve for each nonlinear component may be generated randomly. (Eliasmith teaches the each component of the 50D representation, that is the IT row, has as associated tuning curves used to indicate the output category of any input image has a two-dimensional representation space, depicted in Fig. 6D, that is a tuning curve where the mean value of the pointers is indicated with a large dark circle, in [0104]; where generating the activity for respective structure labels associated with the 50D layer row IT of spiking neurons by randomly selecting neurons form the population, that is the generating of curves randomly, in [0166] based on neuron response curve of the raster plot for capturing the spiking activity associated with a labeled structured, in [0166]; See [0055] where Incorporating by Reference “Eliasmith-2” (“A Uniﬁed Approach to Building and Controlling Spiking Attractor Networks”): teaches the use of a tuning process that implements a tuning curve for each neuron component, in pg. 1278: Sec. 2.1.)

Regarding claim 11, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein the nonlinear components are simulated neurons. (Eliasmith teaches neuron model components may be simulated, in [0080].)

Regarding claim 12, the rejection of claim 11 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 11:
wherein the neurons are spiking neurons. (Eliasmith teaches the neurons are spiking neurons in [0046].)

Regarding claim 13, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein the components are implemented in hardware specialized for simulating the nonlinear components. (Eliasmith teaches implementing the system in specialized hardware, in [0041], to simulate artificial neuron components, [0079]-[0080].)
	
Regarding independent claim 14 limitations, Eliasmith teaches a computer implemented method for reinforcement learning comprising:
receiving by an action values module stored on a computer readable medium environmental state as input, containing at least one adaptive element that learns state and/or action values based on an error signal; (Eliasmith teaches the system that carries the task, that is the action value module, of receiving input in the form of internal states, in [0152] In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward; where the system tasks are carried out as executable computer program instructions, in [0041] The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. These embodiments may be implemented in computer programs executing on program­mable computers, each computer including at least one pro­cessor, a data storage system (including volatile memory or non-volatile memory or other data storage elements or a combination thereof), and at least one communication inter­face.)
providing on the computer readable medium an action selection module coupled to the action values module; (Eliasmith teaches the action selection module component that is coupled to the action values module depicted as the visual input having internal state levels defined by the information encoding process as depicted in Fig. 3B, in [0091]; where the process is provided using a non-transitory computer-readable storage medium configured to execute computer programs, that is instructions for implement programing instructions, in [0041]-[0043]: The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. … Program code is applied to input data to perform the functions described herein and to generate output informa­tion… Each such computer program may be stored on a storage media or a device ( e.g., ROM, magnetic disk, optical disc), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein…)
computing an error signal to update state and/or action values in the action values module by a calculation module coupled to both the action values and action selection module; the update being based on a rewards signal …; (Eliasmith teaches generating an error signal of the observed result and the desired results associated with a task (action values) where the results are selected (computing error to update action values) to produce the highest immediate reward based on past history (based on reward signal), in [0072], Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history., and as depicted in Fig. 3 the reward evaluator is used to determine the reward (reward signal) associated with an input and the determination of one or more actions to carry out task for maximizing a real or perceived reward (based on a reward signal), in [0152]: …In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward…; where the error module couples to the action selection controller module that influences the routing of information in the system, in [0073]: Action selection controller 220 influences routing of information throughout the system 200., and the system tasks are carried out as executable computer program instructions, in [0041], comprising a multiple computer programs coupled to communicate with a computer system, in [0043] and the system tasks are carried out as executable computer program instructions, in [0041], comprising a multiple computer programs coupled to communicate with a computer system, in [0043].)
wherein
each module or sub-module comprises a plurality of nonlinear components, wherein each nonlinear component is configured to generate a scalar or vector output in response to the input and is coupled to the output module by at least one weighted coupling;  (Eliasmith teaches modules as executable programs in a computer system that comprise artificial neurons, as the plurality of nonlinear components, in [0046] Components of the system can perform processing, and communicate, using artificial neurons that implement neural networks. In some cases, non-neural data may also be communicated without the use of artificial neurons. In some cases, one or more components may be implemented without the use of artificial neurons (e.g., motor controls in some embodiments). In the example embodiments presented herein, the artificial neurons are spiking, although non-spik­ing may also be used. The connections between these net­works can be used to compute "semantic pointers", which model compressed representations of the activity of neural networks. Semantic pointers are vector representations that can be thought of as elements of a neural vector space, and can implement a form of abstraction level filtering or "compres­sion", in which high-dimensional structures can be abstracted…; where the nonlinear component computes, that is generates, sematic pointers that are vector representations associated with the connections (that includes an output signal) between networks, in [0046]; where the neural connection weights are coupled to allow computation of functions that facilitate the outputs responses to the inputs to be expressed as vectors, in [0081] The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights...)
the output from each nonlinear component is weighted by the connection weights of the corresponding weighted couplings and the weighted outputs are provided to the output module to form the output modifier; (Eliasmith teaches the output from each neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate the outputs responses to form a output modifier such as in image processing tasks, that is the function that facilities the outputs based on the connection synaptic weights and inputs vector, in [0081] & [0098]-[0099]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights. & [0098]-[0099]: …Finally, a further hierarchical level can be considered analo­gous to the inferior temporal cortex (IT). Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di­mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach. In one preferred neural engineering framework (NEF), the connec­tions between these layers define the vector space transfor­mations that can be implemented in connection weights between the layers. In an example embodiment, training is performed prior to implementation. However, in other embodiments, learning and training may occur during opera­tion…)
the input to the system is either discrete or continuous in time and space; and, (Eliasmith the input to the system is a as a input of images associated with a drawing task as depicted in Fig. 6A, in [0027]: FIG. 6A is an array of original input images for a copy drawing task; where the image inputs are captured as discrete data pixels captured over a time associated with the observation input as depicted in Fig. 13, in [00172]: FIG. 13 illustrates the performance of system 300 in a fluid reasoning task. The spiking activity encoding the cur­rently inferred rule is shown in the VMPFC row. This is a running average of the inverse convolution (i.e., the inferred transformation) between representations in DLPFCl and DLPFC2, as appropriate. The time course of the systems' activity can be observed, in which the system infers that the pattern in the input is "increase the number of elements by one" (see DLPFC2 row, for example); See Fig. 6A and Fig. 13.)
the input to the system is one of a scalar and a multidimensional vector (Eliasmith teaches the input to the system as the image represented as a multidimensional vector, in [0048]: For example, an image of the numeral "2" to be processed may be input as a 28x28 matrix of pixels. This image is at first represented as a 784-dimensional vector... For example, 50 dimensions may be used to represent underlying conceptual features of the image…  & [0098]-[0099] … Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di­mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach…) 
updating connection weights based on the initial output and the outputs generated by the nonlinear components; (Eliasmith teaches the learning sub-module as a part of the computer executable instruction programs, in [0041], as executable programs in a computer system, [0046] configured to execute functions as updating the connection weights, that is learning based on initial output and the outputs generated by the nonlinear neuron components of the artificial neurons in the network, in [0081]-[0082] : … The artificial neu­rons are formed into networks of neurons with interconnec­tions with varying weights, which can be regulated to disin­hibit (that is, allow) communication between neurons or to inhibit such communication, as is the case in their biological counterparts. In general, the artificial neurons are responsive to control signals that approximate the functions of neuro­chemicals… The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9)... In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights…)
the action values are updated based on a change in the synaptic weights on the output of neural populations and the change in synaptic weights is based on a given error signal and is computed based on a neural learning rule. (Eliasmith teaches updating action values by training and learning rule based on a plausible spike-based rule neural learning rule and the use of neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate update the action values including the neuron connection weights and error signals generated by user observation action for learning to choose an option, out of a set of available choices, in [0081]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011) … and in [0071]-[0072]:  All such transformations are updateable by error signals, which may come from the action selection compo­nent, or which may be internally generated. Error signals generally guide the learning of the transformation modules between two populations. Most often this may facilitate adjusting the connection weights between neurons within each module, but error signals may also be applied to adjust transformations at the level of the semantic pointers. Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history.; such as updating action value as draw actions based on the reinforcement learning rule and learned model with the updated weights based on the error signal, in [0169]: FIG. llA illustrates the reward and behavioral time course of system 300 in a reinforcement learning task. In the illustrated example, the task is a three-armed bandit task. Here the best action was to draw a "2". After some incorrect guesses, this contingency is learned by the model at the begin­ning of the task. However, it can be observed that two "unlucky" rewards at the end of the trial (at 9 s and 11 s) cause the "utility" trace (a decoding of the Str activity) to decrease, and hence the system chooses a "1" for its next guess. The reward prediction error signal is shown separately for each of the three possible actions (this can also be thought of as a reward vector, which is the decoding of a subset of vStr activity). As can be seen, "error 2" decreases as the trial proceeds, until the unlucky rewards occur.)
While Eliasmith teaches the use of learning methods to update the connections weights based on input and output information using local methods, in [0081]-[0082], using reinforcement learning techniques to associate rewards with tasks activity trials, in [00145] based on error signals, in [0071]-[0072].
Eliasmith does not expressly teach claim 14 limitation:
the update being based on based on the equation
    PNG
    media_image2.png
    73
    387
    media_image2.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor greater than zero and less than 1;
Ka does teach claim 14 limitation:
the update being based on based on the equation
    PNG
    media_image2.png
    73
    387
    media_image2.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor greater than zero and less than 1; (Ka teaches the role of computing the temporal difference error in reinforcement learning controller systems based on applicant’s equation for length of the equation given τ =2  as including the summation an timestamps 2 (i.e. t=2) unit of time observation, ƴ is a discount factor = .5,  rt  is r(t)=r(1) +r(2) = r(t)  given r(1)=0, -2*Q(s,a) =.5*V(t+1), and  Q(s’,a’) = -V(t) and using discount factor ƴ that is considered an integrated discount factor = .5, as depicted in Fig. 2 equation under 2(a) as an error calculation, δ(t) as claimed temporal difference equation, that is based on the recited equation in applicant claim, in pg. 207: Left Col: “…The most important role of temporal-difference error is in solving the temporal credit assignment problem in reinforcement learning theory. Houk, Adams and Barto [5] proposed an explicit neural circuit model for computing the temporal-difference error (Figure 2a) …” 

    PNG
    media_image8.png
    330
    291
    media_image8.png
    Greyscale


    PNG
    media_image9.png
    198
    1087
    media_image9.png
    Greyscale

)
The Eliasmith and Ka references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing reinforcement learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method for using reinforcement learning process disclosed by Eliasmith with the method integration of an intermediate action associated with a state and reward signal to learn system learning parameters and values using a temporal difference (TD) reinforcement 
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to improve reinforcement learning tasks for constructing reinforcement learning model computing temporal difference error based on reinforcement learning models representing connections in the basal ganglia to help resolve the three theoretical difficulties of reinforcement learning—that is, slowness, computation of temporal-difference error, and global neural networks (Ka, Fig. 2 And Sec. Hierarchical reinforcement learning model); Doing so would help capture temporal differences for constructing useful learning models in reinforcement learning task environment that help resolve theoretical issues for modeling behavioral learning that depend on reward and penalty, (Ka, Abstract).
Examiner notes that all modules are interpreted as computer executable instruction programs as taught by Eliasmith in [0041].

Regarding claim 15, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka further teaches the method of claim 14:
further comprising repeating the method in a hierarchical manner (Eliasmith teaches the artificial intelligence system comprising multiple instances composed into a hierarchy configuration, as the  hierarchical manner for repeating a method using artificial neural network learning process, in [0007]: In a first broad aspect, some embodiments provide an artificial intelligence system comprising: at least one inter­face hierarchy configured to receive an input of a high-dimen­sional representation and to compress the high-dimensional representation to generate a lower-dimensional representa­tion of the input; at least one processing module configured to receive the lower-dimensional representation and to generate a further representation; an action selection controller con­figured to control communication of the lower-dimensional representation and the further representation between the at least one interface hierarchy and the at least one processing module. And as processing layers of consecutive neural network layers part of the multiple instances composed into a hierarchical structure, in [0096]-[0099]:  Referring now to FIG. 5, there is illustrated a sim­plified schematic diagram of the compression hierarchy of the visual input hierarchy module 302 in one embodiment. [0097] The visual compression hierarchy has a 28x28 dimensional input layer (e.g., for receiving a 784-pixel input image) and consecutive hidden layers of 1000, 500, 300 and 50 nodes. [0098] The initial, 1000-node hierarchical layer, which generates a 1000-dimensional semantic pointer, can be con­sidered analogous to the primary visual cortex (Vl). A sec­ond, 500-node layer can be considered analogous to the sec­ondary visual cortex (V2). A third, 300-node layer can be considered analogous to the extrastriate visual cortex (V4). Finally, a further hierarchical level can be considered analo­gous to the inferior temporal cortex (IT). Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach. In one preferred neural engineering framework (NEF), …)
such that the output of one instance of the method performs one or more of
adding new state input to the input of the downstream instance;
modifying state in the downstream instance; and
modifies the reward signal of the downstream instance. (Eliasmith teaches multiple instance composed into a hierarchical structure that the output of one instance is a modified state space of a lower dimension of the downstream instance as depicted in Fig. 5, in [0099]; modifying the reward at the input downstream instance, in [0128]; adding state input by determining which states should be switching in accordance with the current task goal, in [0056].)


Regarding claim 17, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 14:
wherein the module representing state/action values consists of two interconnected sub-modules, each of which receives state information with or without time delay as input, and the output of one sub-module is used to train the other in order to allow state and/or action value updates to be transferred over time. (Eliasmith teaches the modules for representing the activity state using artificial neurons that receives state information input that is connected by weights to approximate a function, [0079]; where the inputs to the node are images over time as input as depicted in Fig. 4B and Fig 5 used to train the neurons, that are represent each hierarchical compression state space that allow state updates, and can be processed to learn and train, in [0099].)

Regarding claim 18, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 14:
wherein there are initial couplings within and between different modules, where each weighted coupling has a corresponding connection weight such that the output (Eliasmith teaches the modules as executable programs in a computer system that comprise coupled artificial neurons, in [0046]; where the neural connection weights are coupled to allow computation of functions that facilitate the weighted outputs responses to the inputs, in [0081] as modeled by the neurons, in [0079] to learn optimized values  from initial couplings, in [0081].)

Regarding claim 19, the rejection of claim 18 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 18:
further comprising determining by a neural complier the initial couplings and connection weights. (Eliasmith teaches the use of the neural simulator to model neuron models where the couplings and connection weights and be modeled and learned computationally, that is from a determined initial set of parameter; where the Neural models may be simulated using a suitable neural simulator, such as the Nengo neural simulator (<http://www.nengo.ca/>) comprises a neural compiler for compiling scripting based software package for simulating neural system models, in [0080]-[0082].)

Regarding claim 20, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 14:
wherein at least one of the nonlinear components in an adaptive submodule that generates a multidimensional output is coupled to the action selection and/or error calculation modules by a plurality of weighted couplings, one weighted coupling for each dimension of the multidimensional output modifier. (Eliasmith teaches the sub module associated with a hierarchical level as an adaptive sub module that generates multidimensional output as an neuron network of N-nodes associated with N-dimensions, in [0098]; where each neuron node network comprises neuron nodes weighted by neural connection weights coupled to allow computation of functions that facilitate the outputs responses to form a output modifier, that is the function that facilities the outputs based on the connection weights and inputs vector, in [0081]; where the transformation at each hierarchical level are facilitated by the coupled error calculation modules that calculate the error signals, in [0071]-[0072].)


Regarding claim 22, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 14:
further comprising updating by the learning sub­module the connection weights based on an outer product of the initial output and the outputs from the nonlinear components. (Eliasmith teaches the adjusting of the connection weights, that is updates based on an initial value, and the output generated to compute the error signal that accounts for the difference in the observed output result generated and the desired output generated by the nonlinear components neurons of each level, in [0070]-[0072]; where the output is computed using an outer product by the information encoding that is implemented as a neural network level in the system, in [0123]-[0126].)

Regarding claim 23, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 14:
wherein each nonlinear component has a tuning curve that determines the output generated by the nonlinear component in response to any input and the tuning curve for each nonlinear component may be generated randomly. (Eliasmith teaches the each component of the 50D representation, that is the IT row, has as associated tuning curves used to indicate the output category of any input image has a two-dimensional representation space, depicted in Fig. 6D, that is a tuning curve where the mean value of the pointers is indicated with a large dark circle, in [0104]; where generating the activity for respective structure labels associated with the 50D layer row IT of spiking neurons by randomly selecting neurons form the population, that is the generating of curves randomly, in [0166] based on neuron response curve of the raster plot for capturing the spiking activity associated with a labeled structured, in [0166]; See [0055] where Incorporating by Reference Eliasmith-2 (“A Uniﬁed Approach to Building and Controlling Spiking Attractor Networks”): teaches the use of a tuning process that implements a tuning curve for each neuron component, in pg. 1278: Sec. 2.1.)

Regarding claim 24, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 14:
wherein the nonlinear components are simulated neurons. (Eliasmith teaches neuron model components may be simulated, in [0080].)

Regarding claim 25, the rejection of claim 24 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 24:
wherein the neurons are spiking neurons. (Eliasmith teaches the neurons are spiking neurons in [0046].)

Alternatively, Claim 1 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Eliasmith et al. (US Patent Application Publication No. 2014/0156577, hereinafter ‘Eliasmith’), in view of Rasmussen et al. (NPL: “A neural model of hierarchical reinforcement learning”, hereinafter ‘RasE’).

	
Regarding independent claim 1 limitations, Eliasmith teaches a system implementing reinforcement learning:
the system comprising a computer processor and a computer readable medium having computer executable instructions executed by said processor; said computer readable medium including instructions for providing: (Eliasmith teaches the use of a non-transitory computer-readable storage medium configured to execute computer programs that is instructions, in [0043]: Each program may be implemented in a high level procedural or object oriented programming or scripting lan­guage, or both, to communicate with a computer system… Each such computer program may be stored on a storage media or a device (e.g., ROM, magnetic disk, optical disc), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. Embodiments of the system may also be considered to be implemented as a non-transitory computer-readable stor­age medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein, that are executed by the computer processor, in [0041]: The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. These embodiments may be implemented in computer programs executing on program­mable computers, each computer including at least one pro­cessor, a data storage system (including volatile memory or non-volatile memory or other data storage elements or a combination thereof), and at least one communication inter­face.)
a neural network including a plurality of synapses; (Eliasmith teaches the model of the brain structure as a neural network including a plurality of connection synapse, in [0090]: ... FIG. 3B is a schematic block diagram of the Spaun system that contains elements analogous to those highlighted in FIG. 3A. Lines terminating in circles indicate connections with neurons that produce output simu­lating the effects of gamma-Aminobutyric acid (GABA) at their output-so-called GABAergic (inhibitory) connections or synapses. Lines terminating in open squares indicate modulatory activity emulating dopaminergic (adaptive) con­nections.)
an action values module that receives environmental state as input, containing at least one adaptive element that learns state and/or action values based on an error signal; (Eliasmith teaches the system that carries the task, that is the action value module, of receiving input in the form of internal states, in [0152]: In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward; where the input comprises an action value based on an error signal to receive an signals from the environment as observed actions, considered an environmental state as input, containing at least one adaptive element that learns state and/or action values based on an error signal, in [0072]:  Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task (e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history.; where the system tasks are carried out as executable computer program instructions, in [0041].)
an action selection module coupled to the action values module; (Eliasmith teaches the action selection module component that is coupled to the action values module  as computer program code coupled using a computer system for implement programing instructions, in [0041]-[0043]: The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. … Program code is applied to input data to perform the functions described herein and to generate output informa­tion… Each such computer program may be stored on a storage media or a device ( e.g., ROM, magnetic disk, optical disc), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein…)
an error calculation module coupled to both the action values and action selection module, which computes an error signal used to update state and/or action values in the action values module based…; (Eliasmith teaches an error calculation module coupled to both the action values and action selection module, as computer programs coupled to a computing system, in [0041]-[0043, for executing operations used to generate error signal of the observed result and the desired results associated with a task (action values) where the results are selected to produce the highest immediate reward based on past history (based on reward signal), in [0072]: Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history., and as depicted in Fig. 3 the reward evaluator is used to determine the reward (reward signal) associated with an input and the determination of one or more actions to carry out task for maximizing a real or perceived reward (based on a reward signal), in [0152]: …In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward…; where the error module couples to the action selection controller module that influences the routing of information in the system, in [0073]: Action selection controller 220 influences routing of information throughout the system 200., and the system tasks are carried out as executable computer program instructions, in [0041], comprising a multiple computer programs coupled to communicate with a computer system, in [0043].)
 a learning sub-module wherein (Eliasmith teaches computer executable instruction programs, in [0041], as executable programs in a computer system, [0046].)
each module or sub-module comprises a plurality of nonlinear components, wherein each nonlinear component is configured to generate a scalar or vector output in response to the input and is coupled to the output module by at least one synaptic weighted coupling; (Eliasmith teaches modules as executable programs in a computer system that comprise artificial neurons, as the plurality of nonlinear components, in [0046]: Components of the system can perform processing, and communicate, using artificial neurons that implement neural networks. In some cases, non-neural data may also be communicated without the use of artificial neurons. In some cases, one or more components may be implemented without the use of artificial neurons (e.g., motor controls in some embodiments). In the example embodiments presented herein, the artificial neurons are spiking, although non-spik­ing may also be used. ...; where the nonlinear component computes, that is generates, sematic pointers that are vector representations associated with the connections (that includes an output signal) between networks, in [0046]; where the neural connection synaptic weights are coupled to allow computation of functions that facilitate the outputs responses to the inputs to be expressed as vectors, in [0081]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights...)
the output from each nonlinear component is weighted by the connection weights of the corresponding weighted couplings and the synaptic weighted outputs are provided to the output module to form the output modifier; (Eliasmith teaches the output from each neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate the outputs responses to form a output modifier such as in image processing tasks, that is the function that facilities the outputs based on the connection synaptic weights and inputs vector, in [0081] & [0098]-[0099]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights. & [0098]-[0099]: …Finally, a further hierarchical level can be considered analo­gous to the inferior temporal cortex (IT). Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di­mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach. In one preferred neural engineering framework (NEF), the connec­tions between these layers define the vector space transfor­mations that can be implemented in connection weights between the layers. In an example embodiment, training is performed prior to implementation. However, in other embodiments, learning and training may occur during opera­tion…)
the state and/or action values being updated are separated from the reward signal…; (Eliasmith teaches the output of more actions (updated action values) or decisions to carry out task as a way of maximizing a real or perceived reward (separated reward signal), in [0152]: …In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward…; where the system uses reinforcement learning to associate rewards with tasks activity trials, in [00145]: Reinforcement learning-Perform a three­ armed bandit task, in which it is determined which of three possible choices generates the greatest stochasti­cally generated reward. Reward contingencies can change from trial to trial.)
the learning sub-module is configured to update connection weights based on the initial output and the outputs generated by the nonlinear components; (Eliasmith teaches the learning sub-module as a part of the computer executable instruction programs, in [0041], as executable programs in a computer system, [0046] configured to execute functions as updating the connection weights, that is learning based on initial output and the outputs generated by the nonlinear neuron components of the artificial neurons in the network, in [0081]-[0082] : … The artificial neu­rons are formed into networks of neurons with interconnec­tions with varying weights, which can be regulated to disin­hibit (that is, allow) communication between neurons or to inhibit such communication, as is the case in their biological counterparts. In general, the artificial neurons are responsive to control signals that approximate the functions of neuro­chemicals… The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9)... In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights…)
the action values are updated based on a change in the synaptic weights on the output of neural populations and the change in synaptic weights is based on a given error signal and is computed based on a neural learning rule; (Eliasmith teaches updating action values by training and learning rule based on a plausible spike-based rule neural learning rule and the use of neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate update the action values including the neuron connection weights and error signals generated by user observation action for learning to choose an option, out of a set of available choices, in [0081]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011) … and in [0071]-[0072]:  All such transformations are updateable by error signals, which may come from the action selection compo­nent, or which may be internally generated. Error signals generally guide the learning of the transformation modules between two populations. Most often this may facilitate adjusting the connection weights between neurons within each module, but error signals may also be applied to adjust transformations at the level of the semantic pointers. Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history.; such as updating action value as draw actions based on the reinforcement learning rule and learned model with the updated weights based on the error signal, in [0169]: FIG. llA illustrates the reward and behavioral time course of system 300 in a reinforcement learning task. In the illustrated example, the task is a three-armed bandit task. Here the best action was to draw a "2". After some incorrect guesses, this contingency is learned by the model at the begin­ning of the task. However, it can be observed that two "unlucky" rewards at the end of the trial (at 9 s and 11 s) cause the "utility" trace (a decoding of the Str activity) to decrease, and hence the system chooses a "1" for its next guess. The reward prediction error signal is shown separately for each of the three possible actions (this can also be thought of as a reward vector, which is the decoding of a subset of vStr activity). As can be seen, "error 2" decreases as the trial proceeds, until the unlucky rewards occur.)
the input to the system is either discrete or continuous in time and space; and, (the input to the system is a as a input of images associated with a drawing task as depicted in Fig. 6A, in [0027]; where the image inputs are captured as discrete data pixels captured over a time associated with the observation input as depicted in Fig. 13, in [00172].)
the input to the system is one of a scalar and a multidimensional vector. (Eliasmith teaches the input to the system as the image represented as a multidimensional vector, in [0048] & [0098].)
While Eliasmith teaches the use of learning methods to update the connections weights based on input and output information using local methods, in [0081]-[0082], using reinforcement learning techniques to associate rewards with tasks activity trials, in [00145] based on error signals, in [0071]-[0072].

computes an error signal used to update state and/or action values … based on  the equation
    PNG
    media_image11.png
    114
    605
    media_image11.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor greater than zero and less than 1;
the state and/or action values being updated are separated from the reward signal by one or more intermediate states and/or actions;
RasE does teach claim 1 limitation:
computes an error signal used to update state and/or action values … based on the equation
    PNG
    media_image2.png
    73
    387
    media_image2.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor greater than zero and less than 1; (RasE teaches computing the claimed temporal difference as temporal deference error in equations (2) and (4) where discount factor = .5 and tau= 2 thus  r(s,a) = claimed 
    PNG
    media_image12.png
    111
    81
    media_image12.png
    Greyscale
=r(t=1) +r(t=2)=0 + r(s,a)= 
    PNG
    media_image13.png
    85
    147
    media_image13.png
    Greyscale
 = claimed 
    PNG
    media_image14.png
    73
    54
    media_image14.png
    Greyscale
 and 
    PNG
    media_image15.png
    36
    109
    media_image15.png
    Greyscale
 = claimed Q(s’, a’), 
    PNG
    media_image16.png
    30
    80
    media_image16.png
    Greyscale
= claimed  
    PNG
    media_image17.png
    38
    85
    media_image17.png
    Greyscale
, in pg. 1253: Sec. Hierarchical reinforcement learning :

    PNG
    media_image18.png
    770
    767
    media_image18.png
    Greyscale

The Eliasmith and RasE references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing reinforcement learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method for using reinforcement learning process disclosed by Eliasmith with the method using time delay in computing error as TD computation in reinforcement learning method as disclosed by RasE with the method of reinforcement learning techniques disclosed by Eliasmith.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods in order to improve reinforcement learning tasks for constructing reinforcement learning model computing temporal difference error based on reinforcement learning models that are more sophisticated models of human decision making that use the time delays to encapsulate the activity of the subpolicy (RasE, pg. 1253: Sec. Hierarchical reinforcement learning). Doing so will provide a general 

Examiner notes that all claimed modules are interpreted as computer executable instruction programs as taught by Eliasmith in [0041].

Regarding independent claim 14 limitations, Eliasmith teaches a computer implemented method for reinforcement learning comprising:
receiving by an action values module stored on a computer readable medium environmental state as input, containing at least one adaptive element that learns state and/or action values based on an error signal; (Eliasmith teaches the system that carries the task, that is the action value module, of receiving input in the form of internal states, in [0152] In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward; where the system tasks are carried out as executable computer program instructions, in [0041] The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. These embodiments may be implemented in computer programs executing on program­mable computers, each computer including at least one pro­cessor, a data storage system (including volatile memory or non-volatile memory or other data storage elements or a combination thereof), and at least one communication inter­face.)
providing on the computer readable medium an action selection module coupled to the action values module; (Eliasmith teaches the action selection module component that is coupled to the action values module depicted as the visual input having internal state levels defined by the information encoding process as depicted in Fig. 3B, in [0091]; where the process is provided using a non-transitory computer-readable storage medium configured to execute computer programs, that is instructions for implement programing instructions, in [0041]-[0043]: The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. … Program code is applied to input data to perform the functions described herein and to generate output informa­tion… Each such computer program may be stored on a storage media or a device ( e.g., ROM, magnetic disk, optical disc), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein…)
computing an error signal to update state and/or action values in the action values module by a calculation module coupled to both the action values and action selection module; the update being based on a rewards signal …; (Eliasmith teaches generating an error signal of the observed result and the desired results associated with a task (action values) where the results are selected (computing error to update action values) to produce the highest immediate reward based on past history (based on reward signal), in [0072], Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history., and as depicted in Fig. 3 the reward evaluator is used to determine the reward (reward signal) associated with an input and the determination of one or more actions to carry out task for maximizing a real or perceived reward (based on a reward signal), in [0152]: …In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward…; where the error module couples to the action selection controller module that influences the routing of information in the system, in [0073]: Action selection controller 220 influences routing of information throughout the system 200., and the system tasks are carried out as executable computer program instructions, in [0041], comprising a multiple computer programs coupled to communicate with a computer system, in [0043] and the system tasks are carried out as executable computer program instructions, in [0041], comprising a multiple computer programs coupled to communicate with a computer system, in [0043].)
wherein
each module or sub-module comprises a plurality of nonlinear components, wherein each nonlinear component is configured to generate a scalar or vector output in response to the input and is coupled to the output module by at least one weighted coupling;  (Eliasmith teaches modules as executable programs in a computer system that comprise artificial neurons, as the plurality of nonlinear components, in [0046] Components of the system can perform processing, and communicate, using artificial neurons that implement neural networks. In some cases, non-neural data may also be communicated without the use of artificial neurons. In some cases, one or more components may be implemented without the use of artificial neurons (e.g., motor controls in some embodiments). In the example embodiments presented herein, the artificial neurons are spiking, although non-spik­ing may also be used. The connections between these net­works can be used to compute "semantic pointers", which model compressed representations of the activity of neural networks. Semantic pointers are vector representations that can be thought of as elements of a neural vector space, and can implement a form of abstraction level filtering or "compres­sion", in which high-dimensional structures can be abstracted…; where the nonlinear component computes, that is generates, sematic pointers that are vector representations associated with the connections (that includes an output signal) between networks, in [0046]; where the neural connection weights are coupled to allow computation of functions that facilitate the outputs responses to the inputs to be expressed as vectors, in [0081] The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights...)
the output from each nonlinear component is weighted by the connection weights of the corresponding weighted couplings and the weighted outputs are provided to the output module to form the output modifier; (Eliasmith teaches the output from each neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate the outputs responses to form a output modifier such as in image processing tasks, that is the function that facilities the outputs based on the connection synaptic weights and inputs vector, in [0081] & [0098]-[0099]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights. & [0098]-[0099]: …Finally, a further hierarchical level can be considered analo­gous to the inferior temporal cortex (IT). Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di­mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach. In one preferred neural engineering framework (NEF), the connec­tions between these layers define the vector space transfor­mations that can be implemented in connection weights between the layers. In an example embodiment, training is performed prior to implementation. However, in other embodiments, learning and training may occur during opera­tion…)
the input to the system is either discrete or continuous in time and space; and, (Eliasmith the input to the system is a as a input of images associated with a drawing task as depicted in Fig. 6A, in [0027]: FIG. 6A is an array of original input images for a copy drawing task; where the image inputs are captured as discrete data pixels captured over a time associated with the observation input as depicted in Fig. 13, in [00172]: FIG. 13 illustrates the performance of system 300 in a fluid reasoning task. The spiking activity encoding the cur­rently inferred rule is shown in the VMPFC row. This is a running average of the inverse convolution (i.e., the inferred transformation) between representations in DLPFCl and DLPFC2, as appropriate. The time course of the systems' activity can be observed, in which the system infers that the pattern in the input is "increase the number of elements by one" (see DLPFC2 row, for example))
the input to the system is one of a scalar and a multidimensional vector (Eliasmith teaches the input to the system as the image represented as a multidimensional vector, in [0048]: For example, an image of the numeral "2" to be processed may be input as a 28x28 matrix of pixels. This image is at first represented as a 784-dimensional vector... For example, 50 dimensions may be used to represent underlying conceptual features of the image…  & [0098]-[0099] … Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di­mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach…) 
updating connection weights based on the initial output and the outputs generated by the nonlinear components; (Eliasmith teaches the learning sub-module as a part of the computer executable instruction programs, in [0041], as executable programs in a computer system, [0046] configured to execute functions as updating the connection weights, that is learning based on initial output and the outputs generated by the nonlinear neuron components of the artificial neurons in the network, in [0081]-[0082] : … The artificial neu­rons are formed into networks of neurons with interconnec­tions with varying weights, which can be regulated to disin­hibit (that is, allow) communication between neurons or to inhibit such communication, as is the case in their biological counterparts. In general, the artificial neurons are responsive to control signals that approximate the functions of neuro­chemicals… The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9),... In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights…)
the action values are updated based on a change in the synaptic weights on the output of neural populations and the change in synaptic weights is based on a given error signal and is computed based on a neural learning rule. (Eliasmith teaches updating action values by training and learning rule based on a plausible spike-based rule neural learning rule and the use of neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate update the action values including the neuron connection weights and error signals generated by user observation action for learning to choose an option, out of a set of available choices, in [0081]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011) … and  in [0071]-[0072]:  All such transformations are updateable by error signals, which may come from the action selection compo­nent, or which may be internally generated. Error signals generally guide the learning of the transformation modules between two populations. Most often this may facilitate adjusting the connection weights between neurons within each module, but error signals may also be applied to adjust transformations at the level of the semantic pointers. Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history.; such as updating action value as draw actions based on the reinforcement learning rule and learned model with the updated weights based on the error signal, in [0169]: FIG. llA illustrates the reward and behavioral time course of system 300 in a reinforcement learning task. In the illustrated example, the task is a three-armed bandit task. Here the best action was to draw a "2". After some incorrect guesses, this contingency is learned by the model at the begin­ning of the task. However, it can be observed that two "unlucky" rewards at the end of the trial (at 9 s and 11 s) cause the "utility" trace (a decoding of the Str activity) to decrease, and hence the system chooses a "1" for its next guess. The reward prediction error signal is shown separately for each of the three possible actions (this can also be thought of as a reward vector, which is the decoding of a subset of vStr activity). As can be seen, "error 2" decreases as the trial proceeds, until the unlucky rewards occur.)
While Eliasmith teaches the use of learning methods to update the connections weights based on input and output information using local methods, in [0081]-[0082], using 
Eliasmith does not expressly teach claim 14 limitation:
the update being based on based on the equation
    PNG
    media_image2.png
    73
    387
    media_image2.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor greater than zero and less than 1; 
RasE does teach claim 14 limitation:
computes an error signal used to update state and/or action values … based on the equation

    PNG
    media_image2.png
    73
    387
    media_image2.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor greater than zero and less than 1; (RasE teaches computing the claimed temporal difference as temporal deference error in equations (2) and (4) where discount factor = .5 and tau= 2 thus  r(s,a) = claimed 
    PNG
    media_image12.png
    111
    81
    media_image12.png
    Greyscale
=r(t=1) +r(t=2)=0 + r(s,a)= 
    PNG
    media_image13.png
    85
    147
    media_image13.png
    Greyscale
 = claimed 
    PNG
    media_image14.png
    73
    54
    media_image14.png
    Greyscale
 and 
    PNG
    media_image15.png
    36
    109
    media_image15.png
    Greyscale
 = claimed Q(s’, a’), 
    PNG
    media_image16.png
    30
    80
    media_image16.png
    Greyscale
= claimed  
    PNG
    media_image17.png
    38
    85
    media_image17.png
    Greyscale
, in pg. 1253: Sec. Hierarchical reinforcement learning :

    PNG
    media_image18.png
    770
    767
    media_image18.png
    Greyscale

The Eliasmith and RasE references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing reinforcement learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method for using reinforcement learning process disclosed by Eliasmith with the method using time delay in computing error as TD computation in reinforcement learning method as disclosed by RasE with the method of reinforcement learning techniques disclosed by Eliasmith.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods in order to improve reinforcement learning tasks for constructing reinforcement learning model computing temporal difference error based on reinforcement learning models that are more sophisticated models of human decision making that use the time delays to encapsulate the activity of the subpolicy (RasE, pg. 1253: Sec. Hierarchical reinforcement learning). Doing so will provide a general 
Examiner notes that all modules are interpreted as computer executable instruction programs as taught by Eliasmith in [0041].

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Listed Below:
Patjans et al. (NPL: “An Imperfect Dopaminergic Error Signal Can Drive Temporal-Difference Learning”) teaches the range values [0,1] for a discount factor in computing Temporal Difference error.
Rasmussen et al. (NPL: “A neural reinforcement learning model for tasks with unknown time delays”, hereinafter ‘RasC’): teaches computing temporal difference in SMDP framework similar to claimed equation.
Rasmussen (NPL: “Hierarchical reinforcement learning in a biologically plausible neural architecture”, hereinafter ‘Ras’): teaches the use of the error equation as claimed in equation (12).
Tan et al. (Non-Patent Literature: “Integrating temporal difference methods and self-organizing neural networks for reinforcement learning with delayed evaluative feedback”) teaches the use of TD error in reinforcement learning. 
Mnih et al. (US Pub. No. 2015/0100530) teaches the method and systems for reinforcement learning using action-values as the q-value for training neural networks by updating weight values with hidden layers for mapping instances in a hierarchical manner; and teaches reinforcement learning framework as a learning method for 
Bouvier et al. (NPL: “Spiking Neural Networks Hardware Implementations and Challenges: A Survey”) teaches the schematic representation of a biological neuron with synapses that receive and process information from external pre-synaptic and post synaptic neurons. 
Eleftheriou et al. (US Pub No. 20160267379): teaches the synapse in an artificial neural network as nonlinear connection components.
Taylor et al (Non-Patent Literature: “Comparing evolutionary and temporal difference methods in a reinforcement learning domain”): Teaches the use of a temporal difference reinforcement learning method uses an action based on a state-action-reward-state-action (SARSA) approach to estimate the action value function (updating action values) that are separated from the immediate reward (reward signal) by an intermediate action state chosen subsequent to a state s, in pg. 1323: Sec. 2.2: Sarsa.
Petroff (US Patent Application Publication No. 20090327011): Reinforcement learning algorithms include state-action-reward –state-action (SARSA) where the state action occurs before feedback is provided in [0036].
Rom (US Pub. No. 2010/0145402): teaches the Hebbian synapse state that are used learning and training reinforcement learning scheme based on post neuron spikes.                                                                                                                                                                                                                                                                                                                                                                                                 
Any inquiry concerning this communication or earlier communications from the examiner should be 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/O.O.A./Examiner, Art Unit 2126     
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129