DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-18 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Regarding claim 1,
Step 1: Is the claim to a process, machine, manufacture or composition of matter?
Yes.
	Step 2A Prong One: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
	The claim limitations of:
computing a learned prior distribution from the sample results
computing one or more adaptive weights using the learned prior distribution
generating an estimate of a gradient sing the one or more adaptive weights
updating a policy for the learning model using the estimated gradient
are an abstract idea since these limitations can reasonably be performed in the human mind using pen and paper, i.e. mental processes. 
	Step 2A Prong Two: Does the claim recite additional elements that integrate the judicial exception into a practical application?
	The claims do not recite any additional elements that integrate the judicial exception into a practical application.
	Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception?
	The claim elements of:
receiving sample results from one or more agents implementing a learning model, wherein each sample result can be either successful or unsuccessful
does not amount to significantly more than the judicial exception since this is a mere reception of extra-solution activity, essentially data gathering and nothing beyond that.
Note that independent claims 7 and 13 recite the same substantial subject matter as independent claim 1, and thus the  same analysis applies to both claims.
Dependent claims 2-6, 7-12, and 14-18 recite, in part, generating high-reward and zero-reward credits (extra-solution), generating an estimate of a gradient (mental process), applying the task to semantic parsing (extra-solution high level recitation, not a practical application), and sample results corresponding to software code (extra-solution).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-2, 6-8, 12-14, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kalashnikov et al. US 2021/0237266 [herein Kal] in view of Beckman et al. USPAT 10,792,810.
	Regarding claims 1, 7, and 13, Kal teaches “a method for efficient off-policy credit assignment in reinforcement learning, the method comprising: receiving sample results from one or more agents implementing a learning model, wherein each sample result can be either successful or unsuccessful” ([0009] “For example, the stochastic optimization can be a derivative-free optimization algorithm, such as the cross-entropy method (CEM). CEM samples a batch of N values at each iteration, fits a Gaussian distribution to the best M<N of these samples, and then samples next batch of N from that Gaussian”) 
“computing a learned prior distribution from the sample results” ([0009] “CEM samples a batch of N values at each iteration, fits a Gaussian distribution to the best M<N of these samples, and then samples next batch of N from that Gaussian. As one non-limiting example, N can be 64 and M can be 6. During inference, CEM can be used to select 64 candidate actions, those actions evaluated in view of a current state and using the policy model, and the 6 best can be selected (e.g., the 6 with the highest Q-values generated using the policy model). A Gaussian distribution can be fit to those 6, and 64 more actions selected from that Gaussian. Those 64 actions can be evaluated in view of the current state and using the policy model, and the best one (e.g., the one with the highest Q-value generated using the policy model) can be selected as the action to be implemented”); 
“computing one or more adaptive weights using the learned prior distribution” ([0054] “The policy model 152 represents a Q-function that can be represented as Q.sub.θ(s, a), where θ denotes the learned weights in the neural network model.”); 
“updating a policy for the learning model using the estimated gradient” ([0020] “Generating the predicted Q-value includes processing the retrieved state data and the retrieved action using a current version of the neural network model, where the current version of the neural network model is updated relative to the version. The method further includes generating a loss based on the predicted Q-value and the target Q-value and updating the current version of the neural network model based on the loss.”)
	While Kal teaches the above limitations, Beckman teaches “generating an estimate of a gradient using the one or more adaptive weights” (Beckman col. 18 ¶2 “the robotic control system 220 can assign a positive reward to the higher-performing observation, generate an update vector using reinforcement learning, and then update the network parameters via a gradient update by weighting the update vector with the change vector. This approach enables the robotic control system 220 to identify where and by how much policy A differed from policy B, and then leverage the feedback saying that policy A caused better performance in order to weight the updates to these areas more heavily”);
	It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Kal with that of Beckman since a combination of known methods would yield predictable results. As shown in Beckman, it is known in the art to have adaptive weights based on rewards. Thus techniques would operate in a similar and predictable manner when combined with Kal.
	Note that independent claims 7 and 13 recite the same substantial subject matter as independent claim 1, only differing in embodiment. Thus the claims are subject to the same rejection. The difference in embodiments, including a system with processor/memory and non-transitory computer readable media are taught by Kal figure 10.
	Regarding claims 2, 8, and 14, the Kal and Beckman references have been addressed above. Kal further teaches “wherein computing one or more adaptive weights comprises generating at least one of a high-reward result credit and a zero-reward result credit” (col. 18 ¶2 “This approach logically presumes that the differences between policy A and policy B account for the superior performance of policy A, and so rewards these differences with more heavily weighted updates. This can increase the likelihood that actions that yielded positive rewards will continue to occur. Some embodiments can also assign negative rewards to the policy that was not favored.”)
Regarding claims 6, 12, and 18, the Kal and Beckman references have been addressed above. Beckman further teaches “wherein the sample results correspond to executable software code generated by the learning model based on natural language instructions text” (Beckman col. 19 ¶4 “Although the present disclosure discusses policy updates via reinforcement learning, this example can be considered more like contextual bandits because there is no closed feedback loop. However, the algorithm used extends to the reinforcement learning domain naturally. The code can be written in a way that expects stateful control, and just happens to have one state per episode.”)
Claim(s) 3-4, 9-10, and 15-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kalashnikov et al. US 2021/0237266 [herein Kal] in view of Beckman et al. USPAT 10,792,810 further in view of Olsen et al. US 2017/0031361.
Regarding claims 3, 9, and 15, the Kal and Beckman references have been addressed above. The references do not explicitly teach the claim limitations. Olson however teaches “wherein generating an estimate of a gradient comprises generating a high-reward score function gradient using the high-reward result credit to weight successful sample results” (Olson [0093] “To avoid overreacting to small variations (e.g., two policies 46 get the host vehicle 14 10.0 and 10.1 meters closer to the goal because they do almost the same thing), weights may be computed that get set to zero when the range across a single metric is too low to be informative. The final weights for each metric are either zero for uninformative metrics or a pre-determined weight chosen by the designer. The final reward for each sampled rollout is a weighted sum of all the metrics.”)
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Kal and Beckman with that of Olson since a combination of known methods would yield predictable results. Olson shows that reward are known in the context of policy decision making. Thus one would combing the teachings of the references in order to have more robust policies as shown in Kal and Beckman allowing for better training and learning.
Regarding claims 4, 10, and 16, the Kal, Beckman, and Olson references have been addressed above. OLson further teaches “wherein generating an estimate of a gradient comprises generating a zero-reward score function gradient using the zero-reward result credit to weight unsuccessful sample results” (Olson [0081] “For a full policy 46 assignment (π,s) with rollout Ψ.sup.π,s, we compute the rollout reward r.sub.π,s as the weighted sum r.sub.π,s=Σ.sub.q=1.sup.|M|w.sub.qm.sub.q(Ψ.sup.π,s). Each m.sub.q(Ψ.sub.π,s) is normalized across all rollouts to ensure comparability between metrics. To avoid biasing decisions, a weight w.sub.q may be set to zero when the range of m.sub.q(•) across all samples is too small to be informative.”)
Claim(s) 5, 11, and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kalashnikov et al. US 2021/0237266 [herein Kal] in view of Beckman et al. USPAT 10,792,810 further in view of Liang, Chen, et al. "Memory augmented policy optimization for program synthesis and semantic parsing.".
Regarding claims 5, 11, and 17, the Kal and Beckman references have been addressed above. The references do not explicitly teach semantic parsing. Liang however teaches “wherein the learning model is applied to a task of semantic parsing” (Liang pg. 6 ¶5 “We evaluate MAPO on two program synthesis from natural language (also known as semantic parsing) benchmarks, WIKITABLEQUESTIONS and WIKISQL, which requires generating programs to query and process data from tables to answer natural language questions”)
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Kal and Beckman with that of Liang since as shown in Liang, semantic parsing is a common field of application for policy optimization, which has direct correlations with the applied references. Thus this would operate in a normal and predictable way.
It is noted that while not explicitly mapped, the Liang reference additionally teaches the limitations of the independent claims above. This is noted in the IDS dated 11/06/2020 which contains the PCT writing opinion. Should Applicant make any claim amendments, they should consider this document as well when attempting to overcome the cited art.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEVIN W FIGUEROA whose telephone number is (571)272-4623. The examiner can normally be reached Monday-Friday, 10AM-6PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MIRANDA HUANG can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

KEVIN W FIGUEROA
Primary Patent Examiner
Art Unit 2124



/Kevin W Figueroa/Primary Examiner, Art Unit 2124