DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2.	This communication is in response to the Applicant’s submission filed 25 March 2022, where:
Claim 1 is pending.
Claims 1 is rejected.
Information Disclosure Statement
3.	An information disclosure statement was submitted on 27 May 2022. The submission complies with the provisions of 37 CFR 1.97. Accordingly, the Examiner considered the information disclosure statement.
Claim Interpretation
3.	The following is a quotation of 35 U.S.C. § 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. § 112(f) is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. § 112(f):
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. § 112(f). The presumption that the claim limitation is interpreted under 35 U.S.C. § 112(f) is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. § 112(f). The presumption that the claim limitation is not interpreted under 35 U.S.C. § 112(f) is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. § 112(f) except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. § 112(f), except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. § 112(f) because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are: 
“subsystem” in claim 1.
Because this claim limitation is being interpreted under 35 U.S.C. § 112(f), it is being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If Applicant does not intend to have this limitations interpreted under 35 U.S.C. § 112(f), applicant may: 
(1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. § 112(f) (e.g., by reciting sufficient structure to perform the claimed function); or 
(2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. § 112(f).
Double Patenting
4.	The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claim 1 is rejected on the ground of nonstatutory obviousness-type double patenting as being unpatentable over claim 6 of US Patent 11288568 as follows:
Instant Application
17704721
US Patent 11288568
(SN 15429088)

Claim 1
Claim 6

1. A reinforcement learning system for computing Q values for actions to be performed by an agent interacting with an environment from a continuous action space of actions, the system comprising:











a value subnetwork configured to:

receive an observation characterizing a current state of the environment; and




process the observation to generate a value estimate, the value estimate being an estimate of an expected return resulting from the environment being in the current state;




a policy subnetwork configured to:
receive the observation, and

process the observation to generate an ideal point in the continuous action space; and





a subsystem configured to:

receive a particular point in the continuous action space representing a particular action;






generate an advantage estimate for the particular action from a distance between the ideal point and the particular point; and











generate a Q value for the particular action that is an estimate of an expected return resulting from the agent performing the particular action when the environment is in the current state by combining the advantage estimate and the value estimate.
6. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, the system configured to cause the one or more computers to perform a method for training a policy neural network of a reinforcement learning system, wherein the policy neural network is configured to select actions to be performed by an agent interacting with an environment from a set of actions that lie on a continuous domain, the method comprising:

* * *
the training comprising:

obtaining an experience tuple identifying i) a training observation characterizing a training state of the environment, . . . ;

. . . the training comprising:

processing the training observation using a value neural network and in accordance with current values of parameters of the value neural network to generate a first value estimate, the first value estimate being an estimate of an expected return resulting from the environment being in the training state . . . ;

providing the training observation as input to the policy neural network;

for the training observation, obtaining, as output from the policy neural network and generated in accordance with current values of the parameters of the policy neural network, an output action in the set of actions that lie on the continuous domain;



determining a distance in the continuous domain between i) the output action in the set of actions that lie on the continuous domain that is obtained as output from the policy neural network by processing the training observation and ii) the training action that was performed by the agent in response to the training observation;
generating an advantage estimate for the training action that was performed by the agent in response to the training observation from the determined distance in the continuous domain between i) the output action in the set of actions that lie on the continuous domain that is obtained as output from the policy neural network by processing the training observation and ii) the training action that was performed by the agent in response to the training observation, 

* * *

generating a Q value for the training action performed in response to the training observation by combining the advantage estimate for the training action performed in response to the training observation and the first value estimate that is an estimate of an expected return resulting from the environment being in the training state . . . ;

* * *

The claims of US Patent 11288568 anticipates  the claim of the instant application. Though the claims of US Patent 11288568 recite a training context, the instant claim recites some of the features of US Patent 11288568 because training tuples include examples of “particular” actions or points, of “current” states, and include “observations.” 
Also, though the claim of US Patent 11288568 does not recite the term "subsystem" , however this is inherent in the combination of elements recited therein corresponding to combination of elements  recited as comprising the subsystem in the instant claim.

Claim Rejections - 35 USC § 112
5.	The following is a quotation of 35 U.S.C. § 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
6.	Claim 1 is rejected under 35 U.S.C. § 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
Claim limitation “a subsystem” invokes 35 U.S.C. § 112(f). However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. For example, the disclosure simply repeats the claim language pertaining to the subsystem without disclosing a corresponding structure, material, or acts for performing the entire claimed function (see Specification ¶ 0008, 0011; claim 1). Therefore, the claim is indefinite and is rejected under 35 U.S.C. § 112(b).
Applicant may:
(a)	Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. § 112(f); 
(b)	Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)	Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)	Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)	Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.
7.	Claim 1 is rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
The term “ideal” in claim 1 is a relative term which renders the claim indefinite. The term “ideal” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. The claim recites “process the observation to generate an ideal point in the continuous action space,” but does not provide a standard for ascertaining “ideal.” The Specification characterizes an “ideal point,” however, such characterization does not provide a standard for ascertaining the requisite degree of achieving “ideal.” See Specification ¶ 0026 (“The ideal point 122 represents an action that, if performed in response to the observation, is expected to produce a maximum Q value of all actions in the continuous space. That is, the ideal point comprises output of the currently trained neural network indicating an optimal action given the current internal state of the neural network.”).
Claim Rejections - 35 U.S.C. § 101
8.	The following is a quotation of 35 U.S.C. § 101:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
9.	Claim 1 is rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea without significantly more.
Claim 1 recites a reinforcement learning system, which is an article of manufacture, one of the four statutory categories of patentable subject matter. However, claim 1 further recites “process the observation to generate a value estimate, . . . ,” “process the observation to generate an ideal point . . . ,” “generate an advantage generate estimate . . . from a distance between the ideal point and the particular point,” and “generate a Q value for the particular action . . . by combining the advantage estimate and the value estimate.” These limitations recite a “mathematical process,” which is one of the three groupings of abstract ideas. Thus, claim 1 recites an abstract idea.
The abstract idea of claim 1 is not integrated into a practical application, because the only other additional elements recited in claim 1 are (a) a value subnetwork, a policy subnetwork, a subsystem, which are platforms for executing or performing the limitations on generic computer components, MPEP § 2106.04(d)), and (b) an agent, and (c) an environment, which is generally linking the abstract idea to an intended use of an agent in an environment, such as a continuous action space for actions. (MPEP § 210.04(d)). Also, the claim recites the limitations of “receive an observation . . . ,” “receive the observation,” and “receive a particular point . . . representing a particular action,” which recite receiving data over a network, and are insignificant extra-solution activities. (MPEP § 2106.05(d) subsection II.i). Such additional elements cannot integrate the judicial exception into a practical application. Therefore, claim 1 is directed to the abstract idea.
Finally, the additional elements, taken alone or in combination, do not represent significantly more than the abstract idea itself. Generally linking the abstract idea to a field of use (i.e. specifying the intended use of the values in the matrices) does not provide an inventive concept (MPEP 2106.05(h); execution on generic computer components cannot provide significantly more than the abstract idea itself (MPEP § 2106.05(d)); and there is no nexus between the field-of-use and generic computer components which, when taken in combination, could provide an inventive concept nor significantly more than an abstract idea. Therefore, claim 1 is subject-matter ineligible.
Claim Rejections - 35 U.S.C. § 102
10.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. § 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless – 
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
11.	Claim 1 is rejected under 35 U.S.C. § 102(a)(2) as being anticipated by US Patent 9536191 to Arel et al [hereinafter Arel].
Regarding claim 1, Arel teaches [a] reinforcement learning system for computing Q values for actions to be performed by an agent interacting with an environment (Arel 3:51-56 teaches a reinforcement learning system that uses a confidence function representation to adjust a value function representation, to select actions to be performed by an agent interacting with an environment by performing actions selected from a set of actions, or both) from a continuous action space of actions (Arel 3:12-17 teaches using the confidence function representation in selecting actions can increase (continuous) the state space visited by the agent (from a continuous action space of actions) . . . ), the system (Arel, Fig. 1, teaches an example reinforcement learning system (Examiner annotations in dashed text boxes):

    PNG
    media_image1.png
    690
    729
    media_image1.png
    Greyscale

comprising:
a value subnetwork (Arel 1:22-26 teaches reinforcement learning systems using a neural network (value subnetwork)) configured to:
receive an observation characterizing a current state of the environment (Arel 4:8-12 teaches [t]he reinforcement learning system receives data (receive an observation) that . . . characterizes the current state of the environment); and
process the observation to generate a value estimate (Arel 4:37-40 teaches [w]hile the agent is interacting with the environment, the reinforcement learning system selects actions to be performed by the agent in order to maximize the expected return (process the observation to generate a value estimate)), the value estimate being an estimate of an expected return resulting from the environment being in the current state (Arel 4:40-43 teaches the expected return is a function of the rewards anticipated to be received over time in response to future actions performed by the agent (the value estimate being an estimate of an expected return resulting from the environment being in the current state); Examiner points out “value estimate” is synonymous with “expected return”);
a policy subnetwork (Arel 1:22-26 teaches reinforcement learning systems using a neural network (policy subnetwork)) configured to:
receive the observation (Arel 4:8-12 teaches [t]he reinforcement learning system receives data (receive an observation) that . . . characterizes the current state of the environment), and
process the observation to generate an ideal point in the continuous action space (Arel 4:37-43 teaches [w]hile the agent is interacting with the environment, the reinforcement learning system selects actions to be performed by the agent in order to maximize (ideal) the expected return (process the observation to generate an ideal point in the continuous action space)); and
a subsystem (Arel 1:22-26 teaches reinforcement learning systems using a neural network; for purposes of examination, a neural network is construed as a subsystem) configured to:
receive a particular point in the continuous action space representing a particular action (Arel 4:50-53 teaches in response to a given observation, the reinforcement learning system selects the action (receive a particular point in the continuous action space representing a particular action) to be performed by the agent by generating value function estimates in accordance with a value function representation);
generate an advantage estimate for the particular action (Arel 7:27-29 teaches a system [that] determines a respective confidence score for each action (generate an advantage estimate for the particular action) when the environment is in the current state (step 206)) from a distance between the ideal point and the particular point (Arel 4:4-7 teaches [e]xample tasks may include assembly tasks performed by industrial robots which may involve grasping and manipulation of objects within a given space of operation (that is, states include positional relationships, teaching a confidence score for each action entails a distance between the ideal point and the particular point)); and
generate a Q value for the particular action that is an estimate of an expected return resulting from the agent performing the particular action when the environment is in the current state (Arel 7:49-56 teaches the system adjusts, for each action, the respective function for the action using the respective confidence score the action to determine a respective adjusted value function estimate for the action (generate a Q value for the particular action) . . .) by combining the advantage estimate and the value estimate (Arel 7:57-67 teaches the adjusted value function estimate pt(st,at) for an action at when the environment is in a state t (when the environment is in the current state) satisfies:
pt(st,at) = (Q(spat)-Qmin) x c(spat),
where st is the state representation of the state t, Q(st,at) is the value function estimate for the action when the environment is in the state t, Qmin is a predetermined minimal possible value function estimate for any action, and c(st,at) is the confidence score for the action at when the environment is in the state t (by combining the advantage estimate and the value estimate)).
Examiner notes that the term "network” or "subnetwork" recited in Applicant's claims is interpreted to be a well-known hardware structure. 
Examiner notes that the Applicant’s preamble does not afford patentable weight to the Applicant’s claims because the claim preamble is not “necessary to give life, meaning, and vitality” to the claim. Moreover, because the Applicant’s preamble merely states the purpose or intended use of the invention rather than any distinct definition of any of the claimed invention’s limitations, the preamble is not considered a limitation and is of no significance to claim construction.
Conclusion
12.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
(US Published Application 20170213150 to Arel et al.) teaches maintaining data defining a plurality of partitions of a space of reinforcement learning (RL) input states, each partition corresponding to a respective supervised learning model such as a neural network.
(Fonteneau et al, “Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories,” Springer Science+Business Media New York (2012)) teaches, for a continuous state space case, an alternative to the use of function approximators by relying on the synthesis of “artificial trajectories” from the given sample of trajectories for designing and analyzing algorithms for batch mode reinforcement learning.
13.	Any inquiry concerning this communication or earlier communications from the Examiner should be directed to KEVIN L. SMITH whose telephone number is (571) 272-5964. Normally, the Examiner is available on Monday-Thursday 0730-1730. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s supervisor, KAKALI CHAKI can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/K.L.S./
Examiner, Art Unit 2122
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122