DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: “a first input to receive an observation”, “at least one actor neural network, coupled to receive the state data and configured to define a policy function mapping the state data to action data defining an action”, “at least one critic neural network, coupled to receive the action data, the state data, and return data derived from the reward data, and configured to define a value function which generates an error signal”, “ a replay buffer to store reinforcement learning transitions”, “a second input to receive training data defining demonstration transition data”, and “the neural network system is configured to train the at least one actor neural network and the at least one critic neural network off-policy using the error signal” in claim 1; “a sample selection system to sample the reinforcement learning transitions” in claim 4; “the neural network system is configured to update the learning critic neural network off-policy”, “the system is configured to update the learning actor neural network”, and “the system is configured to, at intervals, update weights of the target actor neural network” in claim 7; “a safety controller to impose safety or other constraints on the action data” in claim 9.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-18, 20, and 22 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-9 of co-pending Application No. 17/962008 (reference application).  Although the claims at issue are not identical, they are not patentably distinct from each other because they are obvious variants of the same invention.  Furthermore, the claims of the co-pending application anticipate the claims of the instant application.
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.

	Claims of co-pending application		Claims of instant application
1.A computer-implemented method for controlling an agent interacting with an environment to perform a task, comprising: receiving an observation comprising input state data characterizing a state of the environment; processing the input state data using an actor neural network to generate an output including output action data defining an output action; and controlling the agent to perform the output action; wherein the actor neural network has been trained jointly with a critic neural network that is configured to define a value function that generates an error signal based on an input comprising action data defining an action, state data characterizing the state of the environment, and return data derived from reward data representing a reward from the action performed in a training process; wherein the training process has been performed based on training data including data from a demonstration of the task within the environment that includes demonstration transition data for a series of demonstration transitions including demonstration examples of the state data, the action data, the reward data, and new state data representing a new state; wherein during the training process, the actor neural network has been used to operate on the environment to generate operation transition data comprising operational examples of the state data, the action data, the reward data and the new state data, and the actor neural network and the critic neural network have been trained off-policy using the error signal and using stored tuples sampled from a replay buffer comprising tuples from both the operation transition data and the demonstration transition data.  






2. The method of claim 1, wherein the reward comprises a sparse reward that has a plurality of discrete values dependent upon the state of the environment.  

3. The method of claim 1, wherein only a minority subset of states of the environment provide the reward.  

4. The method of claim 1, wherein during the training process, the stored tuples have been sampled from the replay buffer for training the actor neural network and the critic neural network according to a sampling probability that prioritizes sampling of tuples of the demonstration examples.  
5. The method of claim 1, wherein the return data comprises a combination of the reward data and values from the critic neural network obtained from an (n-1)-step forward rollout of actions selected using the actor neural network; and wherein the training process employs at least two different values of n to train the network.  


6. The method of claim 1, wherein during the training process, the critic neural network has been trained using return data which comprises a mix of 1-step and n-step returns.  

7. The method of claim 1, wherein during the training process: weights of a learning critic neural network have been updated off-policy using the error signal determined from a target actor neural network and a target critic neural network, and updating weights of a learning actor neural network using a deterministic policy gradient comprising a product of a gradient of the output of the learning critic neural network and a gradient of the output of the learning actor neural network evaluated using the stored tuples of both the operation transition data and the demonstration transition data; wherein: weights of the target actor neural network have been updated at intervals using the learning actor neural network; and weights of the target critic neural network have been updated at intervals using the learning critic neural network.  






8. The method of claim 1, wherein the training data comprises kinesthetic teaching data from manipulation of a mechanical system.  


9. The method of claim 1, further comprising a safety controller to impose safety or other constraints on the action data.



1. (Original) An off-policy reinforcement learning actor-critic neural network system, to select actions to be performed by an agent interacting with an environment to perform a task, the system comprising: a first input to receive an observation comprising state data characterizing a state of the environment, and reward data representing a reward from operating with an action in the environment; at least one actor neural network, coupled to receive the state data and configured to define a policy function mapping the state data to action data defining an action, wherein the at least one actor neural network has an output to provide the action data for the agent to perform the action, and wherein the environment transitions to a new state in response to the action; at least one critic neural network, coupled to receive the action data, the state data, and return data derived from the reward data, and configured to define a value function which generates an error signal; a replay buffer to store reinforcement learning transitions comprising operation transition data from operation of the system, wherein the operation transition data comprises tuples of said state data, said action data, said reward data and new state data representing said new state; and a second input to receive training data defining demonstration transition data, the demonstration transition data comprising a set of said tuples from a demonstration of the task within the environment, wherein reinforcement learning transitions stored in the replay buffer further comprise the demonstration transition data; and wherein the neural network system is configured to train the at least one actor neural network and the at least one critic neural network off-policy using the error signal and using stored tuples from the replay buffer comprising tuples from both the operation transition data and the demonstration transition data.  

2. (Original) The system as claimed in claim 1 wherein said reward comprises a sparse reward which has a plurality of discrete values dependent upon the state of the environment. 
 
3. (Previously Presented) The system as claimed in claim 1 wherein only a minority subset of states of the environment provide the reward.  
4. (Previously Presented) The system as claimed in claim 1, further comprising a sample selection system to sample the reinforcement learning transitions according to a sampling probability, wherein the sampling probability prioritizes sampling of the demonstration transition data tuples.  
5. (Previously Presented) The system as claimed in claim 1, wherein the return data comprises a combination of the reward data and values from the critic neural network obtained from an (n-1)-step forward rollout of actions selected using the actor neural network; and wherein the system is configured to employ at least two different values of n to train the network.  

6. (Previously Presented) The system as claimed in claim 1 configured to train the critic neural network using return data which comprises a mix of 1-step and n-step returns.  

7. (Previously Presented) The system as claimed in claim 1 comprising learning and target actor neural networks and learning and target critic neural networks, wherein the neural network system is configured to update the learning critic neural network off-policy using the error signal, wherein the error signal is derived from the target critic target neural network, the target actor neural network, and the stored tuples of both the operation transition data and the demonstration transition data; wherein the system is configured to update the learning actor neural network using a deterministic policy gradient comprising a product of a gradient of the output of the learning critic neural network and a gradient of the output of the learning actor neural network evaluated using the stored tuples of both the operation transition data and the demonstration transition data; and wherein the system is configured to, at intervals, update weights of the target actor neural network using the learning actor neural network and to update weights of the target critic neural network using the learning critic neural network.  
8. (Previously Presented) The system as claimed in claim 1 wherein the training data comprises kinesthetic teaching data from manipulation of a mechanical system. 
 
9. (Previously Presented) The system as claimed in claim 1 further comprising a safety controller to impose safety or other constraints on the action data.

Claims 10-20 are similar to the above claims.


Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows: 
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-9 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.

Although claims 1-9 appear to fall within a statutory category (i.e., neural network system), claims 1-9 encompass nothing more than data structures (neural network system is nothing more than a computer program.  Also, it is not clear whether the “replay buffer” is an allocation within the neural network for learning transitions or a physical memory.  The specification never makes it clear what form the memory takes.  Therefore, it is reasonable to construe the “replay buffer” as a data structure of the neural network instead of some physical memory”.  Furthermore, it is well known the art that buffers can be implemented by using a virtual data buffer in software).  Thus, claims 1-9 are directed to non-statutory subject matter because their scope includes a computer program embodiment, an abstract data structure which does not fall within one of the four statutory categories.  See also MPEP § 2106.IV.B.1.a.  Data structures not claimed as embodied in computer readable media are descriptive material per se and are not statutory because they are not capable of causing functional change in the computer.  See, e.g., Warmerdam, 33 F.3d at 1361, 31 USPQ2d at 1760 (claim to a data structure per se held nonstatutory).  Such claimed data structures do not define any structural and functional interrelationships between the data structure and other claimed aspects of the invention, which permit the data structure's functionality to be realized.  In contrast, a claimed computer readable medium encoded with a data structure defines structural and functional interrelationships between the data structure and the computer software and hardware components which permit the data structure's functionality to be realized, and is thus statutory.  Similarly, computer programs claimed as computer listings per se, i.e., the descriptions or expressions of the programs are not physical “things.”  They are neither computer components nor statutory processes, as they are not “acts” being performed.  Such claimed computer programs do not define any structural and functional interrelationships between the computer program and other claimed elements of a computer, which permit the computer program's functionality to be realized.
Claims 1-18, 20, and 22 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Independent claim 1 recites “… receive an observation …”, “… neural network … define a policy function …”, “… critic neural network … define a value function …”, “a replay buffer to store reinforcement learning transition …”, “a second input … defining demonstration transition data …”, and “…train … actor neural network … critic neural network …”.   These limitations, under its broadest reasonable interpretation, are directed to nothing more than a series of mathematical calculations occurring within the neural network to accomplish training without additional steps or a practical application combined with a data gathering step at an input.  Therefore, the claim falls within the “Mathematical Concepts” grouping of abstract ideas.  Accordingly, the claim recites an abstract idea. This judicial exception is not integrated into a practical application.  In particular, the claim only recites additional elements – neural network system and replay buffer.  The use of these elements are recited at a high-level of generality (i.e., as a generic computer device performing a generic computer function) such that it amounts no more than mere instructions to apply the exception using a generic computer component.  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.  The claim is directed to an abstract idea.  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.  As discussed above with respect to integration of the abstract idea into a practical application, the additional elements are merely for the purpose of data gathering and/or insignificant extra-solution activity that amount to no more than mere instructions to apply the exception using a generic computer component.  Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible.
Similarly, dependent claims 2-9 include additional steps/elements that are considered “insignificant extra-solution activity to the judicial exception” because they fail to provide meaningful significance that go beyond generally linking the use of an abstract idea to a particular technological environment.  Therefore, these claims are also not patent eligible.
Independent claims 10 and 20 recite “capturing training data …”, “storing the demonstration transition data …”, “operating on the environment …”, “storing the operation transition data …”, and “sampling”.   These limitations, under its broadest reasonable interpretation, cover performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting “computer”, nothing in the claim element precludes the step from practically being performed in the mind.  For example, but for the “computer” language, these steps in the context of these claim encompass the user manually captures training data, stores or memorizes the demonstration transition data, operates the actor-critic system to generate operation transition data, stores or memorizes the generated data, and then samples or selects data to train the actor-critic system by making adjustment according the selected data.  All of these steps can be performed in the mind and/or using a pen and paper.  If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas.  Accordingly, the claim recites an abstract idea. This judicial exception is not integrated into a practical application.  In particular, the claim only recites additional elements - using a computer to perform these steps.  The use of a computer is recited at a high-level of generality (i.e., as a generic computer device performing a generic computer function) such that it amounts no more than mere instructions to apply the exception using a generic computer component.  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.  The claim is directed to an abstract idea.  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.  As discussed above with respect to integration of the abstract idea into a practical application, the additional element of a computer is for the purpose of data gathering and/or insignificant extra-solution activity that amount to no more than mere instructions to apply the exception using a generic computer component.  Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.  Therefore, the claims are not patent eligible.
Similarly, dependent claims 11-18 and 22 include additional steps that are considered “insignificant extra-solution activity to the judicial exception” because they fail to provide meaningful significance that go beyond generally linking the use of an abstract idea to a particular technological environment.  Therefore, these claims are also not patent eligible.
Allowable Subject Matter
Claims 1-18, 20, and 22 are allowable and will be allowed with the 101 issue is resolved.  An examiner’s statement of reasons for allowance was already provided in a previous communication.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
The prior art made of record listed below and more in attached PTO-892 form, and not relied upon is considered pertinent to applicant's disclosure.  Gu et al. (US-PGPUB 2017/0228662 A1) reinforcement learning using advantage estimates (see Fig.1).  VAN SEIJEN et al. (US-PGPUB 2018/0165603 A1) hybrid reward architecture for reinforcement learning (see Figs.1-2).  Dulac-Arnold et al. (U.S. Patent No. 10,885,432 B1) selecting actions from large discrete action sets using reinforcement learning (see Fig.1).  
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUYEN X VO whose telephone number is (571)272-7631. The examiner can normally be reached M-F, 8-4.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/HUYEN X VO/Primary Examiner, Art Unit 2656