1DETAILED ACTION
This action is in response to the communications filed 01/22/2021 in which claims 1 and 14 were amended; claims 3, 8, 16, and 21 were cancelled; and claims 1-2, 4-7, 9-15, 17-20, and 22-25 are still pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first 
inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/22/2021 has been entered.
 
Drawings
The drawings were received on 10/25/2015.  These drawings are acceptable.

Examiner remarks
Examiner notes that applicant did not have markings to show all the changes to the current claim set.  MPEP 714 requires all claims being currently amended must be presented with markings to indicate the changes that have been made relative to the immediate prior version. The changes in any amended claim must be shown by strike-through (for deleted matter) or underlining (for added matter) with 2 exceptions: (1) for deletion of five or fewer consecutive characters, double brackets may be used 
Please follow this practice for all responses.

Examiner notes the cited patent application publication (US Pub. No. 2014/0156577) lists the applicant as the author and was published in 06/05/2014. This is considered art that does not meet the exception under 102(b)(1)(A) because the cited disclosure was published 1 year and two months before the effective filing date of the current claimed invention, 08/26/2015.	

Examiner notes the cited NPL lists the applicant as the author in the published NPL titled “Large-scale synthesis of function spiking neural circuits” published May 5th 2014. This is considered art that does not meet the exception under 102(b)(1)(A) because the cited disclosure was published 1 year and three months before the effective filing date of the current claimed invention, August 26, 2015.	


Response to Arguments
Applicant’s arguments and amendments filed 01/22/2021 have been fully considered.

Applicant’s arguments with respect to the 35 USC § 112 rejections, have been fully considered. 
The 35 USC § 112(a) and  § 112(b) made in the previous office action has been withdrawn as the current amendments have removed the problematic terms.
The 35 USC § 112(b) has been updated to address the current claim amendments.
Regarding applicant’s remarks with respect to the rejection of claims under 35 U.S.C. § 103, the arguments have been fully considered. The applicant has agreed that the cited prior failed to render the disclosed equation as recited by the amended claim limitation. 
The examiner respectfully disagrees. The MPEP notes that the claims are interpreted under broadest reasonable interpretation (BRI) in light of applicant specification, see MPEP 2111. The examiner also notes that the applicant has admitted the cited prior, Kawato et al (NPL: “Efficient reinforcement learning: computational theories, neuroscience and robotics”, hereinafter ‘Ka’), teaches the claim limitation for a time step =1 where τ is greater than 1. Specifically, teaches that the equations disclosed in Fig. 2 for computing the temporal differences in a reinforcement learning environment; where given τ= 2 the summation includes t @ t=0 and t=1 and the second summation does not expressly depend on τ, so it can be considered V(t) and the equation is the same for t=1 @ τ=2. In addition, for τ=2 the and ƴ=1 the equation recited in claim 1 and 14 are the same to one cited in Rasmussen et al. (NPL: “A neural model of hierarchical reinforcement learning”, hereinafter ‘RasE’) in equation 4 for computing the temporal difference (TD error) for computing the Q-values in an hierarchical reinforcement learning environment, as recited by applicant amended claims, in pg. 1253. Therefore, the rejection of claims under 35 U.S.C. § 103 has been maintained.


Claim Interpretation
Regarding claims 1, 3-5, and 7-10, the claim limitations are not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph because the claim construction as recited in claim 1, 

Claim Objections
Claims  1 and 14 are objected to because of the following informalities: the claims recite an equation that is unclear:

    PNG
    media_image1.png
    126
    1012
    media_image1.png
    Greyscale

The equation is blurry and makes it hard to clearly determine all the variables recited in the equation. Appropriate correction is required.

Claim Rejections - 35 USC § 112-Indefiniteness
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-2, 4-7, 9-15, 17-20, and 22-25 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the 

Regarding claims 1 and 14, the claims recite the limitation below render the claim indefinite because the applicant has express an equation and limitation includes variables expressed in an improper form:

    PNG
    media_image1.png
    126
    1012
    media_image1.png
    Greyscale

Where the second summation is over a time stamp that is not expressed in the discount factor or Q(s,a), both are not expressed as functions of t as required by the equation summation in: 

    PNG
    media_image2.png
    80
    131
    media_image2.png
    Greyscale

; it is unclear what the intended scope of applicant claims for the recited equation.
Applicant has also recited ‘(12)’ that is assumed to not be part of the equation. Applicant should clarify the expression of the claim to indicate if the notation is part of the equation or not.
	Therefore, the claim is rendered indefinite. The examiner interprets any parameter associated with an error calculations using temporal-difference computation for updating state and /or action values in a reinforcement learning model/algorithm as within the scope of the claim limitation. 



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim 1-2, 4-7, 9-15, 17-20, and 22-25 are rejected under 35 U.S.C. 103 as being unpatentable over Eliasmith et al. (US Patent Application Publication No. 2014/0156577, hereinafter ‘Eliasmith’), in view of  Kawato et al (NPL: “Efficient reinforcement learning: computational theories, neuroscience and robotics”, hereinafter ‘Ka’).

	
Regarding independent claim 1 limitations, Eliasmith teaches a system implementing reinforcement learning:
the system comprising a computer processor and a computer readable medium having computer executable instructions executed by said processor; said computer readable medium including instructions for providing: (Eliasmith teaches the use of a non-transitory computer-readable storage medium configured to execute computer programs that is instructions, in [0043]: Each program may be implemented in a high level procedural or object oriented programming or scripting lan­guage, or both, to communicate with a computer system… Each such computer program may be stored on a storage media or a device ( e.g., ROM, magnetic disk, optical disc), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. Embodiments of the system may also be considered to be implemented as a non-transitory computer-readable stor­age medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein, that are executed by the computer processor, in [0041]: The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. These embodiments may be implemented in computer programs executing on program­mable computers, each computer including at least one pro­cessor, a data storage system (including volatile memory or non-volatile memory or other data storage elements or a combination thereof), and at least one communication inter­face.)
a neural network including a plurality of synapses; (Eliasmith teaches the model of the brain structure as a neural network including a plurality of connection synapse, in [0090]: ... FIG. 3B is a schematic block diagram of the Spaun system that contains elements analogous to those highlighted in FIG. 3A. Lines terminating in circles indicate connections with neurons that produce output simu­lating the effects of gamma-Aminobutyric acid (GABA) at their output-so-called GABAergic (inhibitory) connections or synapses. Lines terminating in open squares indicate modulatory activity emulating dopaminergic (adaptive) con­nections.)
an action values module that receives environmental state as input, containing at least one adaptive element that learns state and/or action values based on an error signal; (Eliasmith teaches the system that carries the task, that is the action value module, of receiving input in the form of internal states, in [0152]: In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward; where the input comprises an action value based on an error signal to receive an signals from the environment as observed actions, considered an environmental state as input, containing at least one adaptive element that learns state and/or action values based on an error signal, in [0072]:  Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history.; where the system tasks are carried out as executable computer program instructions, in [0041].)
an action selection module coupled to the action values module; (Eliasmith teaches the action selection module component that is coupled to the action values module  as computer program code coupled using a computer system for implement programing instructions, in [0041]-[0043]: The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. … Program code is applied to input data to perform the functions described herein and to generate output informa­tion… Each such computer program may be stored on a storage media or a device ( e.g., ROM, magnetic disk, optical disc), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein…)
an error calculation module coupled to both the action values and action selection module, which computes an error signal used to update state and/or action values in the action values module based…; (Eliasmith teaches an error calculation module coupled to both the action values and action selection module, as computer programs coupled to a computing system, in [0041]-[0043, for executing operations used to generate error signal of the observed result and the desired results associated with a task (action values) where the results are selected to produce the highest immediate reward based on past history (based on reward signal), in [0072]: Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history., and as depicted in Fig. 3 the reward evaluator is used to determine the reward (reward signal) associated with an input and the determination of one or more actions to carry out task for maximizing a real or perceived reward (based on a reward signal), in [0152]: …In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward…; where the error module couples to the action selection controller module that influences the routing of information in the system, in [0073]: Action selection controller 220 influences routing of information throughout the system 200., and the system tasks are carried out as executable computer program instructions, in [0041], comprising a multiple computer programs coupled to communicate with a computer system, in [0043].)
 a learning sub-module wherein (Eliasmith teaches computer executable instruction programs, in [0041], as executable programs in a computer system, [0046].)
each module or sub-module comprises a plurality of nonlinear components, wherein each nonlinear component is configured to generate a scalar or vector output in response to the input and is coupled to the output module by at least one synaptic weighted coupling; (Eliasmith teaches modules as executable programs in a computer system that comprise artificial neurons, as the plurality of nonlinear components, in [0046]: Components of the system can perform processing, and communicate, using artificial neurons that implement neural networks. In some cases, non-neural data may also be communicated without the use of artificial neurons. In some cases, one or more components may be implemented without the use of artificial neurons ( e.g., motor controls in some embodiments). In the example embodiments presented herein, the artificial neurons are spiking, although non-spik­ing may also be used. ...; where the nonlinear component computes, that is generates, sematic pointers that are vector representations associated with the connections (that includes an output signal) between networks, in [0046]; where the neural connection synaptic weights are coupled to allow computation of functions that facilitate the outputs responses to the inputs to be expressed as vectors, in [0081]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights...)
the output from each nonlinear component is weighted by the connection weights of the corresponding weighted couplings and the synaptic weighted outputs are provided to the output module to form the output modifier; (Eliasmith teaches the output from each neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate the outputs responses to form a output modifier such as in image processing tasks, that is the function that facilities the outputs based on the connection synaptic weights and inputs vector, in [0081] & [0098]-[0099]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights. & [0098]-[0099]: …Finally, a further hierarchical level can be considered analo­gous to the inferior temporal cortex (IT). Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di­mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach. In one preferred neural engineering framework (NEF), the connec­tions between these layers define the vector space transfor­mations that can be implemented in connection weights between the layers. In an example embodiment, training is performed prior to implementation. However, in other embodiments, learning and training may occur during opera­tion…)
the state and/or action values being updated are separated from the reward signal…; (Eliasmith teaches the output of more actions (updated action values) or decisions to carry out task as a way of maximizing a real or perceived reward (separated reward signal), in [0152]: …In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward…; where the system uses reinforcement learning to associate rewards with tasks activity trials, in [00145]: Reinforcement learning-Perform a three­ armed bandit task, in which it is determined which of three possible choices generates the greatest stochasti­cally generated reward. Reward contingencies can change from trial to trial.)
the learning sub-module is configured to update connection weights based on the initial output and the outputs generated by the nonlinear components; (Eliasmith teaches the learning sub-module as a part of the computer executable instruction programs, in [0041], as executable programs in a computer system, [0046] configured to execute functions as updating the connection weights, that is learning based on initial output and the outputs generated by the nonlinear neuron components of the artificial neurons in the network, in [0081]-[0082] : … The artificial neu­rons are formed into networks of neurons with interconnec­tions with varying weights, which can be regulated to disin­hibit (that is, allow) communication between neurons or to inhibit such communication, as is the case in their biological counterparts. In general, the artificial neurons are responsive to control signals that approximate the functions of neuro­chemicals… The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9),... In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights…)
the action values are updated based on a change in the synaptic weights on the output of neural populations and the change in synaptic weights is based on a given error signal and is computed based on a neural learning rule; (Eliasmith teaches updating action values by training and learning rule based on a plausible spike-based rule neural learning rule and the use of neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate update the action values including the neuron connection weights and error signals generated by user observation action for learning to choose an option, out of a set of available choices, in [0081]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ) … and  in [0071]-[0072]:  All such transformations are updateable by error signals, which may come from the action selection compo­nent, or which may be internally generated. Error signals generally guide the learning of the transformation modules between two populations. Most often this may facilitate adjusting the connection weights between neurons within each module, but error signals may also be applied to adjust transformations at the level of the semantic pointers. Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history.; such as updating action value as draw actions based on the reinforcement learning rule and learned model with the updated weights based on the error signal, in [0169]: FIG. llA illustrates the reward and behavioral time course of system 300 in a reinforcement learning task. In the illustrated example, the task is a three-armed bandit task. Here the best action was to draw a "2". After some incorrect guesses, this contingency is learned by the model at the begin­ning of the task. However, it can be observed that two "unlucky" rewards at the end of the trial (at 9 s and 11 s) cause the "utility" trace ( a decoding of the Str activity) to decrease, and hence the system chooses a "1" for its next guess. The reward prediction error signal is shown separately for each of the three possible actions (this can also be thought of as a reward vector, which is the decoding of a subset of vStr activity). As can be seen, "error 2" decreases as the trial proceeds, until the unlucky rewards occur.)
the input to the system is either discrete or continuous in time and space; and, (the input to the system is a as a input of images associated with a drawing task as depicted in Fig. 6A, in [0027]; where the image inputs are captured as discrete data pixels captured over a time associated with the observation input as depicted in Fig. 13, in [00172].)
the input to the system is one of a scalar and a multidimensional vector. (Eliasmith teaches the input to the system as the image represented as a multidimensional vector, in [0048] & [0098].)
While Eliasmith teaches the use of learning methods to update the connections weights based on input and output information using local methods, in [0081]-[0082], using reinforcement learning techniques to associate rewards with tasks activity trials, in [00145] based on error signals, in [0071]-[0072].
Eliasmith does not expressly teach claim 1 limitation:
computes an error signal used to update state and/or action values … based on  the equation
    PNG
    media_image1.png
    126
    1012
    media_image1.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor;
the state and/or action values being updated are separated from the reward signal by one or more intermediate states and/or actions;
Ka does teach claim 1 limitation:
computes an error signal used to update state and/or action values … based on the equation
    PNG
    media_image1.png
    126
    1012
    media_image1.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor; (Ka teaches the role of computing the temporal difference error in reinforcement learning controller systems based on applicant’s equation for length of the equation given τ =2  as including the summation an timestamps 0 and 1 (i.e. t=0 & t= τ-1=1) unit of time observation, ƴ is a discount factor,  rt  is r(t), Q(s,a) is V(t+1) and V(t) = Q(s’,a’) and using negative discount factor ƴ that is considered an integrated discount factor, as depicted in Fig. 2 equation under 2(b) as an error calculation, δ(t), that is based on the recited equation in applicant claim, in pg. 209: “…Furthermore, the ﬁring rates of this second population of neurons predict the amount of future reward, thus seeming to encode the predicted reward or the value function V(t + 1). These remarkable ﬁndings suggest two possible neural mechan-isms (Figure 2c,d) for computation of temporal-difference error. In Figure 2c, some intranuclear circuits within the PPN or SNc, or some membrane properties of dopamin-ergic neurons, execute either temporal difference or differ-entiation (box in Figure 2c). By contrast, the model in Figure 2d predicts that the primary reward information r(t) and the expected reward at the next time step V(t+1)are carried by excitatory inputs from the PPN to the SNc, whereas the inhibitory input from the striatum conveys the subtracted predicted reward information at the current time V(t)…” 

[AltContent: textbox ([img-media_image3.png])]











)
the state and/or action values being updated are separated from the reward signal by one or more intermediate states and/or actions; (Ka teaches the use of a temporal difference reinforcement learning method where the state/action values depicted as x and u values respectively are separated by a reward signal by the reword module as depicted in Fig. 1:
[AltContent: textbox ([img-media_image4.png])]











The Eliasmith and Ka references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing reinforcement learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method for using reinforcement learning process disclosed by Eliasmith with the method integration of an intermediate action associated with a state and reward signal to learn system learning parameters and values using a temporal difference (TD) reinforcement learning method as disclosed by Ka with the method of reinforcement learning techniques disclosed by Eliasmith.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to improve reinforcement learning tasks for constructing reinforcement learning model computing temporal difference error for yielding predictable results and based on reinforcement learning models representing connections in the basal ganglia to help resolve the three theoretical 
The examiner notes that the equation has variables that have not been appropriately described in the claim limitation, therefore the variables can be mapped to a boarder scope as highlighted. It is improper to limit the scope to a preferred embodiment not required by the claim limitation, see MPEP 2111.	Examiner notes that all claimed modules are interpreted as computer executable instruction programs as taught by Eliasmith in [0041].

Regarding claim 2, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein multiple instances of the system are composed into a hierarchical structure,  (Eliasmith teaches the artificial intelligence system comprising multiple instances composed into a hierarchy configuration, as the  hierarchical structure, in [0007]: In a first broad aspect, some embodiments provide an artificial intelligence system comprising: at least one inter­face hierarchy configured to receive an input of a high-dimen­sional representation and to compress the high-dimensional representation to generate a lower-dimensional representa­tion of the input; at least one processing module configured to receive the lower-dimensional representation and to generate a further representation; an action selection controller con­figured to control communication of the lower-dimensional representation and the further representation between the at least one interface hierarchy and the at least one processing module. And as processing layers of consecutive neural network layers part of the multiple instances composed into a hierarchical structure, in [0096]-[0099]:  Referring now to FIG. 5, there is illustrated a sim­plified schematic diagram of the compression hierarchy of the visual input hierarchy module 302 in one embodiment. [0097] The visual compression hierarchy has a 28x28 dimensional input layer (e.g., for receiving a 784-pixel input image) and consecutive hidden layers of 1000, 500, 300 and 50 nodes. [0098] The initial, 1000-node hierarchical layer, which generates a 1000-dimensional semantic pointer, can be con­sidered analogous to the primary visual cortex (Vl). A sec­ond, 500-node layer can be considered analogous to the sec­ondary visual cortex (V2). A third, 300-node layer can be considered analogous to the extrastriate visual cortex (V4). Finally, a further hierarchical level can be considered analo­gous to the inferior temporal cortex (IT). Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-dimensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach. In one preferred neural engineering framework (NEF), …)
wherein the output of one instance performs one or more of 
adding new state input to the input of the downstream instance;
modifying state in the downstream instance; and
modifies the reward signal of the downstream instance. (Eliasmith teaches multiple instance composed into a hierarchical structure that the output of one instance is a modified state space of a lower dimension of the downstream instance as depicted in Fig. 5, in [0099]; modifying the reward at the input downstream instance, in [0128]; adding state input by determining which states should be switching in accordance with the current task goal, in [0056].)


wherein the module representing state/action values consists of two interconnected sub-modules, each of which receives state information with or without time delay as input, and the output of one sub­ module is used to train the other in order to allow state and/or action value updates to be transferred over time. (Eliasmith teaches the modules for representing the activity state using artificial neurons that receives state information input that is connected by weights to approximate a function, [0079]; where the inputs to the node are images over time as input as depicted in Fig. 4B and Fig 5 used to train the neurons, that are represent each hierarchical compression state space that allow state updates, and can be processed to learn and train, in [0099].)

Regarding claim 5, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein there are initial couplings within and between different modules of the system, where each weighted coupling has a corresponding connection weight such that the output generated by each nonlinear component is weighted by the corresponding connection weights to generate a weighted output. (Eliasmith teaches the modules as executable programs in a computer system that comprise coupled artificial neurons, in [0046]; where the neural connection weights are coupled to allow computation of functions that facilitate the weighted outputs responses to the inputs, in [0081] as modeled by the neurons, in [0079] to learn optimized values  from initial couplings, in [0081].)


wherein a neural compiler is used to determine the initial couplings and connection weights. (Eliasmith teaches the use of the neural simulator to model neuron models where the couplings and connection weights and be modeled and learned computationally, that is from a determined initial set of parameter; where the Neural models may be simulated using a suitable neural simulator, such as the Nengo neural simulator (<http://www.nengo.ca/>) comprises a neural compiler for compiling scripted software for simulating neural system models, in [0080]-[0082].)

Regarding claim 7, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein at least one of the nonlinear components in an adaptive sub module that generates a multidimensional output is coupled to the action selection and/or error calculation modules by a plurality of weighted couplings, one weighted coupling for each dimension of the multidimensional output modifier. (Eliasmith teaches the sub module associated with a hierarchical level as an adaptive sub module that generates multidimensional output as an neuron network of N-nodes associated with N-dimensions, in [0098]; where each neuron node network comprises neuron nodes weighted by neural connection weights coupled to allow computation of functions that facilitate the outputs responses to form a output modifier, that is the function that facilities the outputs based on the connection weights and inputs vector, in [0081]; where the transformation at each hierarchical level are facilitated by the coupled error calculation modules that calculate the error signals, in [0071]-[0072].)

Regarding claim 9, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein the learning sub-module is configured to update the connection weights based on an outer product of the initial output and the outputs from the nonlinear. (Eliasmith teaches the adjusting of the connection weights, that is updates based on an initial value, and the output generated to compute the error signal that accounts for the difference in the observed output result generated and the desired output generated by the nonlinear components neurons of each level, in [0070]-[0072]; where the output is computed using an outer product by the information encoding that is implemented as a neural network level in the system, in [0123]-[0126].)

Regarding claim 10, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein each nonlinear component has a tuning curve that determines the output generated by the nonlinear component in response to any input and the tuning curve for each nonlinear component may be generated randomly. (Eliasmith teaches the each component of the 50D representation, that is the IT row, has as associated tuning curves used to indicate the output category of any input image has a two-dimensional representation space, depicted in Fig. 6D, that is a tuning curve where the mean value of the pointers is indicated with a large dark circle, in [0104]; where generating the activity for respective structure labels associated with the 50D layer row IT of spiking neurons by randomly selecting neurons form the population, that is the generating of curves randomly, in [0166] based on neuron response curve of the raster plot for capturing the spiking activity associated with a labeled structured, in [0166]; See [0055] where Incorporating by Reference “Eliasmith-2” (“A Uniﬁed Approach to Building and Controlling Spiking Attractor Networks”): teaches the use of a tuning process that implements a tuning curve for each neuron component, in pg. 1278: Sec. 2.1.)

Regarding claim 11, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein the nonlinear components are simulated neurons. (Eliasmith teaches neuron model components may be simulated, in [0080].)

Regarding claim 12, the rejection of claim 11 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 11:
wherein the neurons are spiking neurons. (Eliasmith teaches the neurons are spiking neurons in [0046].)

Regarding claim 13, the rejection of claim 1 is incorporated and Eliasmith in combination with Ka further teaches the system of claim 1:
wherein the components are implemented in hardware specialized for simulating the nonlinear components. (Eliasmith teaches implementing the system in specialized hardware, in [0041], to simulate artificial neuron components, [0079]-[0080].)
	
Regarding independent claim 14 limitations, Eliasmith teaches a computer implemented method for reinforcement learning comprising:
receiving by an action values module stored on a computer readable medium environmental state as input, containing at least one adaptive element that learns state and/or action values based on an error signal; (Eliasmith teaches the system that carries the task, that is the action value module, of receiving input in the form of internal states, in [0152] In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward; where the system tasks are carried out as executable computer program instructions, in [0041] The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. These embodiments may be implemented in computer programs executing on program­mable computers, each computer including at least one pro­cessor, a data storage system (including volatile memory or non-volatile memory or other data storage elements or a combination thereof), and at least one communication inter­face.)
providing on the computer readable medium an action selection module coupled to the action values module; (Eliasmith teaches the action selection module component that is coupled to the action values module depicted as the visual input having internal state levels defined by the information encoding process as depicted in Fig. 3B, in [0091]; where the process is provided using a non-transitory computer-readable storage medium configured to execute computer programs, that is instructions for implement programing instructions, in [0041]-[0043]: The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. … Program code is applied to input data to perform the functions described herein and to generate output informa­tion… Each such computer program may be stored on a storage media or a device ( e.g., ROM, magnetic disk, optical disc), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein…)
computing an error signal to update state and/or action values in the action values module by a calculation module coupled to both the action values and action selection module; the update being based on a rewards signal …; (Eliasmith teaches generating an error signal of the observed result and the desired results associated with a task (action values) where the results are selected (computing error to update action values) to produce the highest immediate reward based on past history (based on reward signal), in [0072], Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history., and as depicted in Fig. 3 the reward evaluator is used to determine the reward (reward signal) associated with an input and the determination of one or more actions to carry out task for maximizing a real or perceived reward (based on a reward signal), in [0152]: …In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward…; where the error module couples to the action selection controller module that influences the routing of information in the system, in [0073]: Action selection controller 220 influences routing of information throughout the system 200., and the system tasks are carried out as executable computer program instructions, in [0041], comprising a multiple computer programs coupled to communicate with a computer system, in [0043] and the system tasks are carried out as executable computer program instructions, in [0041], comprising a multiple computer programs coupled to communicate with a computer system, in [0043].)
wherein
each module or sub-module comprises a plurality of nonlinear components, wherein each nonlinear component is configured to generate a scalar or vector output in response to the input and is coupled to the output module by at least one weighted coupling;  (Eliasmith teaches modules as executable programs in a computer system that comprise artificial neurons, as the plurality of nonlinear components, in [0046] Components of the system can perform processing, and communicate, using artificial neurons that implement neural networks. In some cases, non-neural data may also be communicated without the use of artificial neurons. In some cases, one or more components may be implemented without the use of artificial neurons ( e.g., motor controls in some embodiments). In the example embodiments presented herein, the artificial neurons are spiking, although non-spik­ing may also be used. The connections between these net­works can be used to compute "semantic pointers", which model compressed representations of the activity of neural networks. Semantic pointers are vector representations that can be thought of as elements of a neural vector space, and can implement a form of abstraction level filtering or "compres­sion", in which high-dimensional structures can be abstracted…; where the nonlinear component computes, that is generates, sematic pointers that are vector representations associated with the connections (that includes an output signal) between networks, in [0046]; where the neural connection weights are coupled to allow computation of functions that facilitate the outputs responses to the inputs to be expressed as vectors, in [0081] The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights...)
the output from each nonlinear component is weighted by the connection weights of the corresponding weighted couplings and the weighted outputs are provided to the output module to form the output modifier; (Eliasmith teaches the output from each neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate the outputs responses to form a output modifier such as in image processing tasks, that is the function that facilities the outputs based on the connection synaptic weights and inputs vector, in [0081] & [0098]-[0099]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights. & [0098]-[0099]: …Finally, a further hierarchical level can be considered analo­gous to the inferior temporal cortex (IT). Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di­mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach. In one preferred neural engineering framework (NEF), the connec­tions between these layers define the vector space transfor­mations that can be implemented in connection weights between the layers. In an example embodiment, training is performed prior to implementation. However, in other embodiments, learning and training may occur during opera­tion…)
the input to the system is either discrete or continuous in time and space; and, (Eliasmith the input to the system is a as a input of images associated with a drawing task as depicted in Fig. 6A, in [0027]: FIG. 6A is an array of original input images for a copy drawing task; where the image inputs are captured as discrete data pixels captured over a time associated with the observation input as depicted in Fig. 13, in [00172]: FIG. 13 illustrates the performance of system 300 in a fluid reasoning task. The spiking activity encoding the cur­rently inferred rule is shown in the VMPFC row. This is a running average of the inverse convolution (i.e., the inferred transformation) between representations in DLPFCl and DLPFC2, as appropriate. The time course of the systems' activity can be observed, in which the system infers that the pattern in the input is "increase the number of elements by one" (see DLPFC2 row, for example)
[AltContent: textbox ([img-media_image5.png])]


[AltContent: textbox ([img-media_image6.png])]




)
the input to the system is one of a scalar and a multidimensional vector (Eliasmith teaches the input to the system as the image represented as a multidimensional vector, in [0048]: For example, an image of the numeral "2" to be processed may be input as a 28x28 matrix of pixels. This image is at first represented as a 784-dimensional vector... For example, 50 dimensions may be used to represent underlying conceptual features of the image…  & [0098]-[0099] … Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di­mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach…) 
updating connection weights based on the initial output and the outputs generated by the nonlinear components; (Eliasmith teaches the learning sub-module as a part of the computer executable instruction programs, in [0041], as executable programs in a computer system, [0046] configured to execute functions as updating the connection weights, that is learning based on initial output and the outputs generated by the nonlinear neuron components of the artificial neurons in the network, in [0081]-[0082] : … The artificial neu­rons are formed into networks of neurons with interconnec­tions with varying weights, which can be regulated to disin­hibit (that is, allow) communication between neurons or to inhibit such communication, as is the case in their biological counterparts. In general, the artificial neurons are responsive to control signals that approximate the functions of neuro­chemicals… The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9),... In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights…)
the action values are updated based on a change in the synaptic weights on the output of neural populations and the change in synaptic weights is based on a given error signal and is computed based on a neural learning rule. (Eliasmith teaches updating action values by training and learning rule based on a plausible spike-based rule neural learning rule and the use of neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate update the action values including the neuron connection weights and error signals generated by user observation action for learning to choose an option, out of a set of available choices, in [0081]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ) … and  in [0071]-[0072]:  All such transformations are updateable by error signals, which may come from the action selection compo­nent, or which may be internally generated. Error signals generally guide the learning of the transformation modules between two populations. Most often this may facilitate adjusting the connection weights between neurons within each module, but error signals may also be applied to adjust transformations at the level of the semantic pointers. Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history.; such as updating action value as draw actions based on the reinforcement learning rule and learned model with the updated weights based on the error signal, in [0169]: FIG. llA illustrates the reward and behavioral time course of system 300 in a reinforcement learning task. In the illustrated example, the task is a three-armed bandit task. Here the best action was to draw a "2". After some incorrect guesses, this contingency is learned by the model at the begin­ning of the task. However, it can be observed that two "unlucky" rewards at the end of the trial (at 9 s and 11 s) cause the "utility" trace ( a decoding of the Str activity) to decrease, and hence the system chooses a "1" for its next guess. The reward prediction error signal is shown separately for each of the three possible actions (this can also be thought of as a reward vector, which is the decoding of a subset of vStr activity). As can be seen, "error 2" decreases as the trial proceeds, until the unlucky rewards occur.)
While Eliasmith teaches the use of learning methods to update the connections weights based on input and output information using local methods, in [0081]-[0082], using reinforcement learning techniques to associate rewards with tasks activity trials, in [00145] based on error signals, in [0071]-[0072].
Eliasmith does not expressly teach claim 14 limitation:
the update being based on based on the equation
    PNG
    media_image1.png
    126
    1012
    media_image1.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor;
Ka does teach claim 14 limitation:
computes an error signal used to update state and/or action values … based on the equation
    PNG
    media_image1.png
    126
    1012
    media_image1.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor; (Ka teaches the role of computing the temporal difference error in reinforcement learning controller systems based on applicant’s equation for length of the equation given τ =2  as including the summation an timestamps 0 and 1 (i.e. t=0 & t= τ-1=1) unit of time observation, ƴ is a discount factor,  rt  is r(t), Q(s,a) is V(t+1) and V(t) = Q(s’,a’) and using negative discount factor ƴ that is considered an integrated discount factor, as depicted in Fig. 2 equation under 2(b) as an error calculation, δ(t), that is based on the recited equation in applicant claim, in pg. 209: “…Furthermore, the ﬁring rates of this second population of neurons predict the amount of future reward, thus seeming to encode the predicted reward or the value function V(t + 1). These remarkable ﬁndings suggest two possible neural mechan-isms (Figure 2c,d) for computation of temporal-difference error. In Figure 2c, some intranuclear circuits within the PPN or SNc, or some membrane properties of dopamin-ergic neurons, execute either temporal difference or differ-entiation (box in Figure 2c). By contrast, the model in Figure 2d predicts that the primary reward information r(t) and the expected reward at the next time step V(t+1)are carried by excitatory inputs from the PPN to the SNc, whereas the inhibitory input from the striatum conveys the subtracted predicted reward information at the current time V(t)…” 

[AltContent: textbox ([img-media_image3.png])]











)
The Eliasmith and Ka references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing reinforcement learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method for using reinforcement learning process disclosed by Eliasmith with the method integration of an intermediate action associated with a state and reward signal to learn system learning parameters and values using a temporal difference (TD) reinforcement learning method as disclosed by Ka with the method of reinforcement learning techniques disclosed by Eliasmith.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to improve reinforcement learning tasks for constructing reinforcement learning model computing temporal difference error for yielding predictable results and based on reinforcement 
Examiner notes that all modules are interpreted as computer executable instruction programs as taught by Eliasmith in [0041].

Regarding claim 15, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka further teaches the method of claim 14:
further comprising repeating the method in a hierarchical manner (Eliasmith teaches the artificial intelligence system comprising multiple instances composed into a hierarchy configuration, as the  hierarchical manner for repeating a method using artificial neural network learning process, in [0007]: In a first broad aspect, some embodiments provide an artificial intelligence system comprising: at least one inter­face hierarchy configured to receive an input of a high-dimen­sional representation and to compress the high-dimensional representation to generate a lower-dimensional representa­tion of the input; at least one processing module configured to receive the lower-dimensional representation and to generate a further representation; an action selection controller con­figured to control communication of the lower-dimensional representation and the further representation between the at least one interface hierarchy and the at least one processing module. And as processing layers of consecutive neural network layers part of the multiple instances composed into a hierarchical structure, in [0096]-[0099]:  Referring now to FIG. 5, there is illustrated a sim­plified schematic diagram of the compression hierarchy of the visual input hierarchy module 302 in one embodiment. [0097] The visual compression hierarchy has a 28x28 dimensional input layer ( e.g., for receiving a 784-pixel input image) and consecutive hidden layers of 1000, 500, 300 and 50 nodes. [0098] The initial, 1000-node hierarchical layer, which generates a 1000-dimensional semantic pointer, can be con­sidered analogous to the primary visual cortex (Vl). A sec­ond, 500-node layer can be considered analogous to the sec­ondary visual cortex (V2). A third, 300-node layer can be considered analogous to the extrastriate visual cortex (V4). Finally, a further hierarchical level can be considered analo­gous to the inferior temporal cortex (IT). Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach. In one preferred neural engineering framework (NEF), …)
such that the output of one instance of the method performs one or more of
adding new state input to the input of the downstream instance;
modifying state in the downstream instance; and
modifies the reward signal of the downstream instance. (Eliasmith teaches multiple instance composed into a hierarchical structure that the output of one instance is a modified state space of a lower dimension of the downstream instance as depicted in Fig. 5, in [0099]; modifying the reward at the input downstream instance, in [0128]; adding state input by determining which states should be switching in accordance with the current task goal, in [0056].)


Regarding claim 17, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 14:
wherein the module representing state/action values consists of two interconnected sub-modules, each of which receives state information with or without time delay as input, and the output of one sub-module is used to train the other in order to allow state and/or action value updates to be transferred over time. (Eliasmith teaches the modules for representing the activity state using artificial neurons that receives state information input that is connected by weights to approximate a function, [0079]; where the inputs to the node are images over time as input as depicted in Fig. 4B and Fig 5 used to train the neurons, that are represent each hierarchical compression state space that allow state updates, and can be processed to learn and train, in [0099].)

Regarding claim 18, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 14:
wherein there are initial couplings within and between different modules, where each weighted coupling has a corresponding connection weight such that the output generated by each nonlinear component is weighted by the corresponding connection weights to generate a weighted output. (Eliasmith teaches the modules as executable programs in a computer system that comprise coupled artificial neurons, in [0046]; where the neural connection weights are coupled to allow computation of functions that facilitate the weighted outputs responses to the inputs, in [0081] as modeled by the neurons, in [0079] to learn optimized values  from initial couplings, in [0081].)


further comprising determining by a neural complier the initial couplings and connection weights. (Eliasmith teaches the use of the neural simulator to model neuron models where the couplings and connection weights and be modeled and learned computationally, that is from a determined initial set of parameter; where the Neural models may be simulated using a suitable neural simulator, such as the Nengo neural simulator (<http://www.nengo.ca/>) comprises a neural compiler for compiling scripting based software package for simulating neural system models, in [0080]-[0082].)

Regarding claim 20, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 14:
wherein at least one of the nonlinear components in an adaptive submodule that generates a multidimensional output is coupled to the action selection and/or error calculation modules by a plurality of weighted couplings, one weighted coupling for each dimension of the multidimensional output modifier. (Eliasmith teaches the sub module associated with a hierarchical level as an adaptive sub module that generates multidimensional output as an neuron network of N-nodes associated with N-dimensions, in [0098]; where each neuron node network comprises neuron nodes weighted by neural connection weights coupled to allow computation of functions that facilitate the outputs responses to form a output modifier, that is the function that facilities the outputs based on the connection weights and inputs vector, in [0081]; where the transformation at each hierarchical level are facilitated by the coupled error calculation modules that calculate the error signals, in [0071]-[0072].)


Regarding claim 22, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 14:
further comprising updating by the learning sub­module the connection weights based on an outer product of the initial output and the outputs from the nonlinear components. (Eliasmith teaches the adjusting of the connection weights, that is updates based on an initial value, and the output generated to compute the error signal that accounts for the difference in the observed output result generated and the desired output generated by the nonlinear components neurons of each level, in [0070]-[0072]; where the output is computed using an outer product by the information encoding that is implemented as a neural network level in the system, in [0123]-[0126].)

Regarding claim 23, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 14:
wherein each nonlinear component has a tuning curve that determines the output generated by the nonlinear component in response to any input and the tuning curve for each nonlinear component may be generated randomly. (Eliasmith teaches the each component of the 50D representation, that is the IT row, has as associated tuning curves used to indicate the output category of any input image has a two-dimensional representation space, depicted in Fig. 6D, that is a tuning curve where the mean value of the pointers is indicated with a large dark circle, in [0104]; where generating the activity for respective structure labels associated with the 50D layer row IT of spiking neurons by randomly selecting neurons form the population, that is the generating of curves randomly, in [0166] based on neuron response curve of the raster plot for capturing the spiking activity associated with a labeled structured, in [0166]; See [0055] where Incorporating by Reference Eliasmith-2 (“A Uniﬁed Approach to Building and Controlling Spiking Attractor Networks”): teaches the use of a tuning process that implements a tuning curve for each neuron component, in pg. 1278: Sec. 2.1.)

Regarding claim 24, the rejection of claim 14 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 14:
wherein the nonlinear components are simulated neurons. (Eliasmith teaches neuron model components may be simulated, in [0080].)

Regarding claim 25, the rejection of claim 24 is incorporated and Eliasmith in combination with Ka, further teaches the method of claim 24:
wherein the neurons are spiking neurons. (Eliasmith teaches the neurons are spiking neurons in [0046].)

Alternatively, Claim 1 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Eliasmith et al. (US Patent Application Publication No. 2014/0156577, hereinafter ‘Eliasmith’), in view of  Rasmussen et al. (NPL: “A neural reinforcement learning model for tasks with unknown time delays”, hereinafter ‘RasC’) and in further view of Rasmussen et al. (NPL: “A neural model of hierarchical reinforcement learning”, hereinafter ‘RasE’)

	
Regarding independent claim 1 limitations, Eliasmith teaches a system implementing reinforcement learning:
the system comprising a computer processor and a computer readable medium having computer executable instructions executed by said processor; said computer readable medium including instructions for providing: (Eliasmith teaches the use of a non-transitory computer-readable storage medium configured to execute computer programs that is instructions, in [0043]: Each program may be implemented in a high level procedural or object oriented programming or scripting lan­guage, or both, to communicate with a computer system… Each such computer program may be stored on a storage media or a device ( e.g., ROM, magnetic disk, optical disc), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. Embodiments of the system may also be considered to be implemented as a non-transitory computer-readable stor­age medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein, that are executed by the computer processor, in [0041]: The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. These embodiments may be implemented in computer programs executing on program­mable computers, each computer including at least one pro­cessor, a data storage system (including volatile memory or non-volatile memory or other data storage elements or a combination thereof), and at least one communication inter­face.)
a neural network including a plurality of synapses; (Eliasmith teaches the model of the brain structure as a neural network including a plurality of connection synapse, in [0090]: ... FIG. 3B is a schematic block diagram of the Spaun system that contains elements analogous to those highlighted in FIG. 3A. Lines terminating in circles indicate connections with neurons that produce output simu­lating the effects of gamma-Aminobutyric acid (GABA) at their output-so-called GABAergic (inhibitory) connections or synapses. Lines terminating in open squares indicate modulatory activity emulating dopaminergic (adaptive) con­nections.)
an action values module that receives environmental state as input, containing at least one adaptive element that learns state and/or action values based on an error signal; (Eliasmith teaches the system that carries the task, that is the action value module, of receiving input in the form of internal states, in [0152]: In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward; where the input comprises an action value based on an error signal to receive an signals from the environment as observed actions, considered an environmental state as input, containing at least one adaptive element that learns state and/or action values based on an error signal, in [0072]:  Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history.; where the system tasks are carried out as executable computer program instructions, in [0041].)
an action selection module coupled to the action values module; (Eliasmith teaches the action selection module component that is coupled to the action values module  as computer program code coupled using a computer system for implement programing instructions, in [0041]-[0043]: The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. … Program code is applied to input data to perform the functions described herein and to generate output informa­tion… Each such computer program may be stored on a storage media or a device ( e.g., ROM, magnetic disk, optical disc), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein…)
an error calculation module coupled to both the action values and action selection module, which computes an error signal used to update state and/or action values in the action values module based…; (Eliasmith teaches an error calculation module coupled to both the action values and action selection module, as computer programs coupled to a computing system, in [0041]-[0043, for executing operations used to generate error signal of the observed result and the desired results associated with a task (action values) where the results are selected to produce the highest immediate reward based on past history (based on reward signal), in [0072]: Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history., and as depicted in Fig. 3 the reward evaluator is used to determine the reward (reward signal) associated with an input and the determination of one or more actions to carry out task for maximizing a real or perceived reward (based on a reward signal), in [0152]: …In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward…; where the error module couples to the action selection controller module that influences the routing of information in the system, in [0073]: Action selection controller 220 influences routing of information throughout the system 200., and the system tasks are carried out as executable computer program instructions, in [0041], comprising a multiple computer programs coupled to communicate with a computer system, in [0043].)
 a learning sub-module wherein (Eliasmith teaches computer executable instruction programs, in [0041], as executable programs in a computer system, [0046].)
each module or sub-module comprises a plurality of nonlinear components, wherein each nonlinear component is configured to generate a scalar or vector output in response to the input and is coupled to the output module by at least one synaptic weighted coupling; (Eliasmith teaches modules as executable programs in a computer system that comprise artificial neurons, as the plurality of nonlinear components, in [0046]: Components of the system can perform processing, and communicate, using artificial neurons that implement neural networks. In some cases, non-neural data may also be communicated without the use of artificial neurons. In some cases, one or more components may be implemented without the use of artificial neurons ( e.g., motor controls in some embodiments). In the example embodiments presented herein, the artificial neurons are spiking, although non-spik­ing may also be used. ...; where the nonlinear component computes, that is generates, sematic pointers that are vector representations associated with the connections (that includes an output signal) between networks, in [0046]; where the neural connection synaptic weights are coupled to allow computation of functions that facilitate the outputs responses to the inputs to be expressed as vectors, in [0081]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights...)
the output from each nonlinear component is weighted by the connection weights of the corresponding weighted couplings and the synaptic weighted outputs are provided to the output module to form the output modifier; (Eliasmith teaches the output from each neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate the outputs responses to form a output modifier such as in image processing tasks, that is the function that facilities the outputs based on the connection synaptic weights and inputs vector, in [0081] & [0098]-[0099]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights. & [0098]-[0099]: …Finally, a further hierarchical level can be considered analo­gous to the inferior temporal cortex (IT). Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di­mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach. In one preferred neural engineering framework (NEF), the connec­tions between these layers define the vector space transfor­mations that can be implemented in connection weights between the layers. In an example embodiment, training is performed prior to implementation. However, in other embodiments, learning and training may occur during opera­tion…)
the state and/or action values being updated are separated from the reward signal…; (Eliasmith teaches the output of more actions (updated action values) or decisions to carry out task as a way of maximizing a real or perceived reward (separated reward signal), in [0152]: …In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward…; where the system uses reinforcement learning to associate rewards with tasks activity trials, in [00145]: Reinforcement learning-Perform a three­ armed bandit task, in which it is determined which of three possible choices generates the greatest stochasti­cally generated reward. Reward contingencies can change from trial to trial.)
the learning sub-module is configured to update connection weights based on the initial output and the outputs generated by the nonlinear components; (Eliasmith teaches the learning sub-module as a part of the computer executable instruction programs, in [0041], as executable programs in a computer system, [0046] configured to execute functions as updating the connection weights, that is learning based on initial output and the outputs generated by the nonlinear neuron components of the artificial neurons in the network, in [0081]-[0082] : … The artificial neu­rons are formed into networks of neurons with interconnec­tions with varying weights, which can be regulated to disin­hibit (that is, allow) communication between neurons or to inhibit such communication, as is the case in their biological counterparts. In general, the artificial neurons are responsive to control signals that approximate the functions of neuro­chemicals… The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9),... In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights…)
the action values are updated based on a change in the synaptic weights on the output of neural populations and the change in synaptic weights is based on a given error signal and is computed based on a neural learning rule; (Eliasmith teaches updating action values by training and learning rule based on a plausible spike-based rule neural learning rule and the use of neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate update the action values including the neuron connection weights and error signals generated by user observation action for learning to choose an option, out of a set of available choices, in [0081]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ) … and  in [0071]-[0072]:  All such transformations are updateable by error signals, which may come from the action selection compo­nent, or which may be internally generated. Error signals generally guide the learning of the transformation modules between two populations. Most often this may facilitate adjusting the connection weights between neurons within each module, but error signals may also be applied to adjust transformations at the level of the semantic pointers. Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history.; such as updating action value as draw actions based on the reinforcement learning rule and learned model with the updated weights based on the error signal, in [0169]: FIG. llA illustrates the reward and behavioral time course of system 300 in a reinforcement learning task. In the illustrated example, the task is a three-armed bandit task. Here the best action was to draw a "2". After some incorrect guesses, this contingency is learned by the model at the begin­ning of the task. However, it can be observed that two "unlucky" rewards at the end of the trial (at 9 s and 11 s) cause the "utility" trace ( a decoding of the Str activity) to decrease, and hence the system chooses a "1" for its next guess. The reward prediction error signal is shown separately for each of the three possible actions (this can also be thought of as a reward vector, which is the decoding of a subset of vStr activity). As can be seen, "error 2" decreases as the trial proceeds, until the unlucky rewards occur.)
the input to the system is either discrete or continuous in time and space; and, (the input to the system is a as a input of images associated with a drawing task as depicted in Fig. 6A, in [0027]; where the image inputs are captured as discrete data pixels captured over a time associated with the observation input as depicted in Fig. 13, in [00172].)
the input to the system is one of a scalar and a multidimensional vector. (Eliasmith teaches the input to the system as the image represented as a multidimensional vector, in [0048] & [0098].)
While Eliasmith teaches the use of learning methods to update the connections weights based on input and output information using local methods, in [0081]-[0082], using reinforcement learning techniques to associate rewards with tasks activity trials, in [00145] based on error signals, in [0071]-[0072].
Eliasmith does not expressly teach claim 1 limitation:
computes an error signal used to update state and/or action values … based on  the equation
    PNG
    media_image1.png
    126
    1012
    media_image1.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor;
the state and/or action values being updated are separated from the reward signal by one or more intermediate states and/or actions;
RasC does teach claim 1 limitation:
computes an error signal used to update state and/or action values … based on the equation
    PNG
    media_image1.png
    126
    1012
    media_image1.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor; (RasC teaches computing the learning error using an integrator for computing the discount as an integrative discount as depicted in Fig. 2

    PNG
    media_image7.png
    543
    581
    media_image7.png
    Greyscale

Where the error is computed as temporal deference error in equation (3) that is based on the recited claimed equation (12) in pgs. 3259-3260: Sec. Learning  & Sec. Error Calculation “ The basic process of TD reinforcement learning involves updating the agent’s estimation of the value of each action, the Q values…For this model, the error is the desired change in the Q value, i.e., ∆ Q (s, a) from Equation 4… The previous section raises the question of where the error, E, comes from. That is, how is Equation 4 computed? The network that performs this calculation is shown in Figure 2. Note that this is the E component shown in Figure 1, and receives the inputs shown there (the Q value of the selected action, and the reward from the environment). One challenge is the integration of the incoming reward (the summation in Equation 4). This is accomplished by the top-right component in the network… When a state transition occurs, the bottom population will then be representing the value of the selected action in the new state, Q(s’,a’), while the “stored value” population maintains Q(s,a). The discount is calculated by integrating the value repre- sented in the “stored value” population, using the same recur-rent setup as is used to integrate the incoming reward. This value is then subtracted from the current Q input to calculate a discounted action value. This is not identical to the discount expressed in Equation 4, but it has a similar computational ef-fect: it reduces the value of future states proportional to the time elapsed and the value of the state. The final “error” [δ(s,a)] population thus has all the pieces it needs to compute the SMDP learning update. It adds the accumu-lated reward [∑rt summed/integrated over a delay period τ] and the discounted Q(s’,a’) value, and subtracts the stored Q(s,a) value [Q(s,a) + ∑ƴ Q(s,a); where the summation is over delay period τ ], resulting in the error signal required by the neural learning rule (Equation 8)”.)
the state and/or action values being updated are separated from the reward signal by one or more intermediate states and/or actions; (RasC teaches the use of a temporal difference reinforcement learning method where the state/action values depicted as s and a values respectively are separated by a reward signal by the reword module as depicted in Fig. 1:


    PNG
    media_image7.png
    543
    581
    media_image7.png
    Greyscale


The Eliasmith and RasC references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing reinforcement learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method for using reinforcement learning process disclosed by Eliasmith with the method integration of an intermediate action associated with a state and reward signal to learn system learning parameters and values using a temporal difference (TD) reinforcement learning method as disclosed by RasC with the method of reinforcement learning techniques disclosed by Eliasmith.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to improve reinforcement learning tasks for constructing reinforcement learning model computing temporal difference error for yielding predictable results based on reinforcement learning models that are more sophisticated models of human decision making (RacC, Abstract).
	Additionally, RasE teaches delay period τ as a known factor for computing the error calculations , in pg. 1253: Sec. Hierarchical reinforcement learning :

    PNG
    media_image8.png
    770
    767
    media_image8.png
    Greyscale

 Eliasmith, RasC, and RasE references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing reinforcement learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method for using reinforcement learning process disclosed by Eliasmith with the method using time delay in computing error as TD computation in reinforcement learning method as disclosed by RasE with the method of reinforcement learning techniques collectively disclosed by Eliasmith and RasC.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to improve reinforcement learning tasks for constructing reinforcement learning model computing temporal difference error for yielding predictable results based on reinforcement learning models that are more sophisticated models of human decision making that use the time delays to encapsulate the activity of the subpolicy (RasE, pg. 1253: Sec. Hierarchical reinforcement learning).

Examiner notes that all claimed modules are interpreted as computer executable instruction programs as taught by Eliasmith in [0041].

Regarding independent claim 14 limitations, Eliasmith teaches a computer implemented method for reinforcement learning comprising:
receiving by an action values module stored on a computer readable medium environmental state as input, containing at least one adaptive element that learns state and/or action values based on an error signal; (Eliasmith teaches the system that carries the task, that is the action value module, of receiving input in the form of internal states, in [0152] In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward; where the system tasks are carried out as executable computer program instructions, in [0041] The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. These embodiments may be implemented in computer programs executing on program­mable computers, each computer including at least one pro­cessor, a data storage system (including volatile memory or non-volatile memory or other data storage elements or a combination thereof), and at least one communication inter­face.)
providing on the computer readable medium an action selection module coupled to the action values module; (Eliasmith teaches the action selection module component that is coupled to the action values module depicted as the visual input having internal state levels defined by the information encoding process as depicted in Fig. 3B, in [0091]; where the process is provided using a non-transitory computer-readable storage medium configured to execute computer programs, that is instructions for implement programing instructions, in [0041]-[0043]: The embodiments of the systems and methods described herein may be implemented in hardware or soft­ware, or a combination of both. … Program code is applied to input data to perform the functions described herein and to generate output informa­tion… Each such computer program may be stored on a storage media or a device ( e.g., ROM, magnetic disk, optical disc), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein…)
computing an error signal to update state and/or action values in the action values module by a calculation module coupled to both the action values and action selection module; the update being based on a rewards signal …; (Eliasmith teaches generating an error signal of the observed result and the desired results associated with a task (action values) where the results are selected (computing error to update action values) to produce the highest immediate reward based on past history (based on reward signal), in [0072], Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history., and as depicted in Fig. 3 the reward evaluator is used to determine the reward (reward signal) associated with an input and the determination of one or more actions to carry out task for maximizing a real or perceived reward (based on a reward signal), in [0152]: …In general, the described systems and methods are capable of carrying out tasks that involve receiving input, for example in the form of internal or external stimuli, manipulating one or more internal states or representations of the input, and pro­ducing an output, where output can be the end result of the task or an intermediary step. For example, output may be in the form of one or more actions or decisions. In some cases, the system may be configured to carry out tasks as a way of maximizing a real or perceived reward…; where the error module couples to the action selection controller module that influences the routing of information in the system, in [0073]: Action selection controller 220 influences routing of information throughout the system 200., and the system tasks are carried out as executable computer program instructions, in [0041], comprising a multiple computer programs coupled to communicate with a computer system, in [0043] and the system tasks are carried out as executable computer program instructions, in [0041], comprising a multiple computer programs coupled to communicate with a computer system, in [0043].)
wherein
each module or sub-module comprises a plurality of nonlinear components, wherein each nonlinear component is configured to generate a scalar or vector output in response to the input and is coupled to the output module by at least one weighted coupling;  (Eliasmith teaches modules as executable programs in a computer system that comprise artificial neurons, as the plurality of nonlinear components, in [0046] Components of the system can perform processing, and communicate, using artificial neurons that implement neural networks. In some cases, non-neural data may also be communicated without the use of artificial neurons. In some cases, one or more components may be implemented without the use of artificial neurons ( e.g., motor controls in some embodiments). In the example embodiments presented herein, the artificial neurons are spiking, although non-spik­ing may also be used. The connections between these net­works can be used to compute "semantic pointers", which model compressed representations of the activity of neural networks. Semantic pointers are vector representations that can be thought of as elements of a neural vector space, and can implement a form of abstraction level filtering or "compres­sion", in which high-dimensional structures can be abstracted…; where the nonlinear component computes, that is generates, sematic pointers that are vector representations associated with the connections (that includes an output signal) between networks, in [0046]; where the neural connection weights are coupled to allow computation of functions that facilitate the outputs responses to the inputs to be expressed as vectors, in [0081] The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights...)
the output from each nonlinear component is weighted by the connection weights of the corresponding weighted couplings and the weighted outputs are provided to the output module to form the output modifier; (Eliasmith teaches the output from each neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate the outputs responses to form a output modifier such as in image processing tasks, that is the function that facilities the outputs based on the connection synaptic weights and inputs vector, in [0081] & [0098]-[0099]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ), the entire contents of which are hereby incorporated by reference. In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights. & [0098]-[0099]: …Finally, a further hierarchical level can be considered analo­gous to the inferior temporal cortex (IT). Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di­mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach. In one preferred neural engineering framework (NEF), the connec­tions between these layers define the vector space transfor­mations that can be implemented in connection weights between the layers. In an example embodiment, training is performed prior to implementation. However, in other embodiments, learning and training may occur during opera­tion…)
the input to the system is either discrete or continuous in time and space; and, (Eliasmith the input to the system is a as a input of images associated with a drawing task as depicted in Fig. 6A, in [0027]: FIG. 6A is an array of original input images for a copy drawing task; where the image inputs are captured as discrete data pixels captured over a time associated with the observation input as depicted in Fig. 13, in [00172]: FIG. 13 illustrates the performance of system 300 in a fluid reasoning task. The spiking activity encoding the cur­rently inferred rule is shown in the VMPFC row. This is a running average of the inverse convolution (i.e., the inferred transformation) between representations in DLPFCl and DLPFC2, as appropriate. The time course of the systems' activity can be observed, in which the system infers that the pattern in the input is "increase the number of elements by one" (see DLPFC2 row, for example)
[AltContent: textbox ([img-media_image5.png])]


[AltContent: textbox ([img-media_image6.png])]




)
the input to the system is one of a scalar and a multidimensional vector (Eliasmith teaches the input to the system as the image represented as a multidimensional vector, in [0048]: For example, an image of the numeral "2" to be processed may be input as a 28x28 matrix of pixels. This image is at first represented as a 784-dimensional vector... For example, 50 dimensions may be used to represent underlying conceptual features of the image…  & [0098]-[0099] … Each hierarchical layer generally generates progressively lower-dimensional semantic pointers, with the result that a 784-dimensional representation at the input can be reduced to a SO-dimensional representation in the IT layer. [0099] The visual compression hierarchy network learns the compression needed to reduce any input image to a SO-di­mensional (SOD) semantic pointer. Each of the hierarchical layers define vector spaces that can be embedded into spiking neurons using a functional neural network approach…) 
updating connection weights based on the initial output and the outputs generated by the nonlinear components; (Eliasmith teaches the learning sub-module as a part of the computer executable instruction programs, in [0041], as executable programs in a computer system, [0046] configured to execute functions as updating the connection weights, that is learning based on initial output and the outputs generated by the nonlinear neuron components of the artificial neurons in the network, in [0081]-[0082] : … The artificial neu­rons are formed into networks of neurons with interconnec­tions with varying weights, which can be regulated to disin­hibit (that is, allow) communication between neurons or to inhibit such communication, as is the case in their biological counterparts. In general, the artificial neurons are responsive to control signals that approximate the functions of neuro­chemicals… The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9),... In at least some embodiments, other more efficient optimization methods can be used to determine the synaptic weights…)
the action values are updated based on a change in the synaptic weights on the output of neural populations and the change in synaptic weights is based on a given error signal and is computed based on a neural learning rule. (Eliasmith teaches updating action values by training and learning rule based on a plausible spike-based rule neural learning rule and the use of neural component is weighted by neural connection weights coupled to allow computation of functions that facilitate update the action values including the neuron connection weights and error signals generated by user observation action for learning to choose an option, out of a set of available choices, in [0081]: The use of neural connection weights ("synaptic weights") allows the computation of particular functions, where inputs and outputs can be expressed as vectors. The neural connection weights can be learned with a biologically plausible spike-based rule as described by D. MacNeil, C. Eliasmith, PLoS ONE 6(9), e22885, doi:10.1371/journal. pone.0022885 (2011 ) … and  in [0071]-[0072]:  All such transformations are updateable by error signals, which may come from the action selection compo­nent, or which may be internally generated. Error signals generally guide the learning of the transformation modules between two populations. Most often this may facilitate adjusting the connection weights between neurons within each module, but error signals may also be applied to adjust transformations at the level of the semantic pointers. Error signals themselves can be generated exter­nally or internally. External error signals are those that pro­vide feedback from the environment to the system about the error in its responses. Internal error signals are those that are generated within the system by observing the results of its actions, calculating an error between an actual result and the desired result, and using that error internally generated error to guide learning. One example of the use of error signals is in a "bandit" task ( e.g., modeled on the "one-armed bandit" casino machine) where the system can learn to choose an option, out of a set of available choices, that results in what is anticipated to produce the highest immediate reward based on past history.; such as updating action value as draw actions based on the reinforcement learning rule and learned model with the updated weights based on the error signal, in [0169]: FIG. llA illustrates the reward and behavioral time course of system 300 in a reinforcement learning task. In the illustrated example, the task is a three-armed bandit task. Here the best action was to draw a "2". After some incorrect guesses, this contingency is learned by the model at the begin­ning of the task. However, it can be observed that two "unlucky" rewards at the end of the trial (at 9 s and 11 s) cause the "utility" trace ( a decoding of the Str activity) to decrease, and hence the system chooses a "1" for its next guess. The reward prediction error signal is shown separately for each of the three possible actions (this can also be thought of as a reward vector, which is the decoding of a subset of vStr activity). As can be seen, "error 2" decreases as the trial proceeds, until the unlucky rewards occur.)
While Eliasmith teaches the use of learning methods to update the connections weights based on input and output information using local methods, in [0081]-[0082], using 
Eliasmith does not expressly teach claim 14 limitation:
the update being based on based on the equation
    PNG
    media_image1.png
    126
    1012
    media_image1.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor; 
RasC does teach claim 14 limitation:
computes an error signal used to update state and/or action values … based on the equation
    PNG
    media_image1.png
    126
    1012
    media_image1.png
    Greyscale

where Q(s,a) is the value of taking action a in state s and Q(s', a') is the value of taking action a' in state s', where a' and s' are states and actions occurring some number of timesteps after a and s, τ is the number timesteps separating a/s from a'/s and is greater than 1, rt is the reward signal at timestep t, and y is a discount factor; (RasC teaches computing the learning error using an integrator for computing the discount as an integrative discount as depicted in Fig. 2

    PNG
    media_image7.png
    543
    581
    media_image7.png
    Greyscale

Where the error is computed as temporal deference error in equation (3) that is based on the recited claimed equation (12) in pgs. 3259-3260: Sec. Learning  & Sec. Error Calculation “ The basic process of TD reinforcement learning involves updating the agent’s estimation of the value of each action, the Q values…For this model, the error is the desired change in the Q value, i.e., ∆ Q (s, a) from Equation 4… The previous section raises the question of where the error, E, comes from. That is, how is Equation 4 computed? The network that performs this calculation is shown in Figure 2. Note that this is the E component shown in Figure 1, and receives the inputs shown there (the Q value of the selected action, and the reward from the environment). One challenge is the integration of the incoming reward (the summation in Equation 4). This is accomplished by the top-right component in the network… When a state transition occurs, the bottom population will then be representing the value of the selected action in the new state, Q(s’,a’), while the “stored value” population maintains Q(s,a). The discount is calculated by integrating the value repre- sented in the “stored value” population, using the same recur-rent setup as is used to integrate the incoming reward. This value is then subtracted from the current Q input to calculate a discounted action value. This is not identical to the discount expressed in Equation 4, but it has a similar computational ef-fect: it reduces the value of future states proportional to the time elapsed and the value of the state. The final “error” [δ(s,a)] population thus has all the pieces it needs to compute the SMDP learning update. It adds the accumu-lated reward [∑rt summed/integrated over a delay period τ] and the discounted Q(s’,a’) value, and subtracts the stored Q(s,a) value [Q(s,a) + ∑ƴ Q(s,a); where the summation is over delay period τ ], resulting in the error signal required by the neural learning rule (Equation 8)”.)
the state and/or action values being updated are separated from the reward signal by one or more intermediate states and/or actions; (RasC teaches the use of a temporal difference reinforcement learning method where the state/action values depicted as s and a values respectively are separated by a reward signal by the reword module as depicted in Fig. 1:


    PNG
    media_image7.png
    543
    581
    media_image7.png
    Greyscale


 Eliasmith and RasC references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing reinforcement learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method for using reinforcement learning process disclosed by Eliasmith with the method integration of an intermediate action associated with a state and reward signal to learn system learning parameters and values using a temporal difference (TD) reinforcement learning method as disclosed by RasC with the method of reinforcement learning techniques disclosed by Eliasmith.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to improve reinforcement learning tasks for constructing reinforcement learning model computing temporal difference error for yielding predictable results based on reinforcement learning models that are more sophisticated models of human decision making (RacC, Abstract).
	Additionally, RasE teaches delay period τ as a known factor for computing the error calculations , in pg. 1253: Sec. Hierarchical reinforcement learning :

    PNG
    media_image8.png
    770
    767
    media_image8.png
    Greyscale

 Eliasmith, RasC, and RasE references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing reinforcement learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method for using reinforcement learning process disclosed by Eliasmith with the method using time delay in computing error as TD computation in reinforcement learning method as disclosed by RasE with the method of reinforcement learning techniques collectively disclosed by Eliasmith and RasC.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to improve reinforcement learning tasks for constructing reinforcement learning model computing temporal difference error for yielding predictable results based on reinforcement learning models that are more sophisticated models of human decision making that use the time delays to encapsulate the activity of the subpolicy (RasE, pg. 1253: Sec. Hierarchical reinforcement learning).

Examiner notes that all modules are interpreted as computer executable instruction programs as taught by Eliasmith in [0041].

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Listed Below:
Rasmussen (NPL: “Hierarchical reinforcement learning in a biologically plausible neural architecture”, hereinafter ‘Ras’): teaches the use of the error equation as claimed in equation (12).
Tan et al. (Non-Patent Literature: “Integrating temporal difference methods and self-organizing neural networks for reinforcement learning with delayed evaluative feedback”) teaches the use of TD error in reinforcement learning. 
Mnih et al. (US Pub. No. 2015/0100530) teaches the method and systems for reinforcement learning using action-values as the q-value for training neural networks by updating weight values with hidden layers for mapping instances in a hierarchical manner; and teaches reinforcement learning framework as a learning method for Markov Decision Processes using state sequences an action observations per time-step using Q-learning algorithms based on Bellman equation for optimizing the action-value functions Q(s’,a’) where the assumption is that the Q(s’,a’) is known the possible action a’, in [0049]-[055].
Bouvier et al. (NPL: “Spiking Neural Networks Hardware Implementations and Challenges: A Survey”) teaches the schematic representation of a biological neuron with synapses that receive and process information from external pre-synaptic and post synaptic neurons. 
Eleftheriou et al. (US Pub No. 20160267379): teaches the synapse in an artificial neural network as nonlinear connection components.
Taylor et al (Non-Patent Literature: “Comparing evolutionary and temporal difference methods in a reinforcement learning domain”): Teaches the use of a temporal difference reinforcement learning method uses an action based on a state-action-reward-state-action (SARSA) approach to estimate the action value function (updating action values) that are separated from the immediate reward (reward signal) by an intermediate action state chosen subsequent to a state s, in pg. 1323: Sec. 2.2: Sarsa.
Petroff (US Patent Application Publication No. 20090327011): Reinforcement learning algorithms include state-action-reward –state-action (SARSA) where the state action occurs before feedback is provided in [0036].
Rom (US Pub. No. 2010/0145402): teaches the Hebbian synapse state that are used learning and training reinforcement learning scheme based on post neuron spikes.                                                                                                                                                                                                                                                                                                                                                                                                    
Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUWATOSIN ALABI whose telephone number is (571)272-0516.  The examiner can normally be reached on Monday-Friday, 8:00am-5:00pm EST..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 






/O.O.A./Examiner, Art Unit 2126     
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126