DETAILED ACTION
This Office action is in reply to correspondence filed 7 November 2022 in regard to application no. 16/657,533.  Claims 1-20 are pending, of which claims 8-13 are withdrawn from consideration.  Claims 1-7 and 14-20 are considered below.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-7 and 14-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.  The claims are directed to statutory categories of invention, as each is directed to a system (machine) or method (process).  The claim(s) recite(s) predicting a reward, determining how to make a measurement, using the measurement to make another prediction about an “expected reward metric”, and conditionally using a second policy rather than a first.  As all of this is based on an expectation of measurement of something associated with a reward, and the specification makes it clear, e.g. ¶ 17, that a reward is based on a user action such as that a “user engages with [a] content item”, all of this is directed to certain methods of organizing human activity, specifically managing personal behavior e.g. incentivizing a person to take a desired action.
Further, aside from the bare inclusion of a generic computer, these are all steps that can be done mentally or with a pen and paper.  An advertiser can mentally predict whether a customer will respond to something, can make another prediction as to whether she will respond to some different thing, can base this on his own memory of past interactions, and can use whichever offering he thinks most likely to lead to a response.  None of this presents any difficulty, and none requires any technology at all.
This judicial exception is not integrated into a practical application because aside from the bare inclusion of a generic computer, discussed below, nothing is done beyond what was set forth above, which does not go beyond using a computer as a tool to implement the abstract idea.  See MPEP § 2106.05(f).
As the claims only manipulate data about rewards, policies, metrics and the like, they do not improve the “functioning of a computer’ or of “any other technology or technical field”. See MPEP § 2106.05(a). They do not apply the abstract idea “with, or by use of a particular machine”, MPEP § 2106.05(b), as the below-cited Guidance makes it clear that a generic computer is not the particular machine envisioned.
They do not effect a “transformation or reduction of a particular article to a different state or thing”, MPEP § 2106.05(c), first because such data, being intangible, are not a particular article at all, and second because the claimed manipulation is neither transformative nor reductive; as the courts have pointed out, in the end, data are still data.
They do not apply the abstract idea “in some other meaningful way beyond generally linking [it] to a particular technological environment”, MPEP § 2106.05(e), as the lack of technical and algorithmic detail in the claims is such as to not go beyond a general linkage.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the additional claim elements, considered individually and as an ordered combination, do not amount to significantly more than the abstract idea. T he claim includes a processor and memory storing instructions.  These elements are recited at a high degree of generality and the specification makes it clear, ¶ 40, that no particular computer is required but that any of a number of broad classes of known, pre-existing, generic computers will suffice, such as “a personal computer, a laptop computer, a tablet computer, a mobile computing device, or a distributed computing device”.
The computer only performs generic computer functions of sharing data and nondescriptly manipulating data.  Generic computers performing generic computer functions, without an inventive concept, do not amount to significantly more than the abstract idea.  The type of information being manipulated does not impose meaningful limitations or render the idea less abstract.
The claim elements when considered as an ordered combination — a generic computer performing a chronological sequence of abstract steps — do nothing more than when they are considered individually.  The other independent claim is simply a different embodiment but is similarly directed to a generic computer performing essentially the same process.
The dependent claims further do not amount to significantly more than the abstract idea: claims 2, 5, 15 and 18 simply recite further, abstract manipulation of data; claims 3, 4, 6, 16, 17 and 19 are simply further descriptive of the type of information being manipulated; claims 7 and 20 simply recite a source of data.
The claims are not patent eligible.  For further guidance please see “2019 Revised Patent Subject Matter Eligibility Guidance”, 84 Fed. Reg. 50, 55 (7 January 2019), now incorporated into the MPEP as MPEP §2106.03 — 2106.07(c).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 4, 7, 14, 17 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Schrittwieser et al. (WIPO Publication No. 2018/215665) in view of Zhang et al. (U.S. Publication No. 2019/0318648, filed 12 April 2018) further in view of Chang et al. (U.S. Publication No. 2009/0254413).

In-line citations are to Schrittwieser.
With regard to Claim 1:
Schrittwieser teaches: A system comprising:
at least one processor; [0019; “processor”] and
memory storing instructions that, when executed by the at least one processor, causes the system to perform a set of operations, [0113; “modules of computer program instructions” are stored in a “memory device”] the set of operations comprising: 
generating a reward predictor for historical data associated with a logging policy… [0053; an “estimate of a return” based on “numeric rewards received” based on an “agent interacting with [an] environment”; the interactions, of course, must have been in the past if anything is to be based on their having taken place; that the “rewards reflect the progress of the agent toward accomplishing [a] specific result” reads on their being associated with a policy] 
determining an off-policy evaluation model, wherein the off-policy evaluation model comprises an estimator selected from the group consisting of a quality-agnostic estimator and a quality-based estimator... [0010; “determining a target return based on evaluating progress of the task”; as this makes no mention of quality at all, it reads on a quality-agnostic estimator] 

Schrittwieser does not explicitly teach evaluating, using the off-policy evaluation model and based on the historical data associated with the logging policy, a target policy to determine whether an expected reward metric of the target policy is higher than a reward metric of the logging policy or when it is determined that the expected reward metric is higher than the reward metric of the logging policy, generating an indication to use the target policy instead of the logging policy, but it is known in the art.  Zhang teaches a learning system [title] which uses a particular “model” when it “gets higher rewards” than an alternative model; [0080] the system is used to “estimate [an] expected accumulated future reward”. [0102] Zhang and Schrittwieser are analogous art as each is directed to electronic means for making predictions regarding rewards. 

It would have been obvious to one of ordinary skill in the art just prior to the filing of the claimed invention to combine the teaching of Zhang with that of Schrittwieser in order to improve computer features and uses, as taught by Zhang; [0001] further, it is simply a combination of known parts with predictable results, simply performing the steps of Zhang after those of Schrittwieser and on the data of either; each part works independently of the other, and each works in combination identically to how it works when not combined, with no new and unexpected result inherent or disclosed.

Schrittwieser does not explicitly teach that a reward predictor is usable to generate an expected reward based on the historical data, but in addition to being of no patentable significance as explained below, it is known in the art.  Chang teaches a campaign optimization system. [title] It creats a “rewards product model” based on “historical transactions”, [0030] computes a “predicted value”, [0043] and an “expected profitability” of a consumer who may choose to, or not to, participate in a campaign. [0056] Chang and Schrittwieser are analogous art as each is directed to electronic means for managing reward-related information.

It would have been obvious to one of ordinary skill in the art just prior to the filing of the claimed invention to combine the teaching of Chang with that of Schrittwieser in order to improve consumer loyalty, as taught by Chang; [abstract] further, it is simply a substitution of one known part for another with predictable results, simply creating a reward predictor for Chang’s purpose rather than that of Schrittwieser; the substitution produces no new and unexpected result.

In this and the subsequent claims, as it is not positively claimed that a reward predictor is used to generate an expected reward, that it is “usable” to do so consists entirely of intended-use language which is considered but given no patentable weight.  The reference is provided for the purpose of compact prosecution.

With regard to Claim 4:
The system of claim 1, wherein determining the off-policy evaluation model comprises determining a hyperparameter for the estimator. [0099; a value is only reduced “after a threshold number” of operations; per the applicant’s specification, ¶ 26, a hyperparameter may be a threshold]

With regard to Claim 7:
The system of claim 1, wherein the set of operations further comprises: accessing the historical data from a historical data store, [0053 as cited above in regard to claim 1; any source of historical data reads on an historical data store] wherein the historical data comprises a context, an action associated with the context, and a reward for the action. [id.; the interactions read on actions; 0003; a reward is dependent on the effect of the performance of an action on the environment; the environment reads on the claimed context]

This claim is not patentably distinct from claim 1.  As explained, any source of historical data reads on an historical data store, and what the data “comprises” is of no patentable significance, first because it is nonfunctional, descriptive language, and second, as the data only “comprises” these, they can also include other items, and any further processing could be based entirely on the other items.  The reference is provided for the purpose of compact prosecution.

With regard to Claim 14:
Schrittwieser teaches: A method for off-policy evaluation of a target policy, the method comprising:
generating a reward predictor for historical data associated with a logging policy… [0053; an “estimate of a return” based on “numeric rewards received” based on an “agent interacting with [an] environment”; the interactions, of course, must have been in the past if anything is to be based on their having taken place; that the “rewards reflect the progress of the agent toward accomplishing [a] specific result” reads on their being associated with a policy] 
determining an off-policy evaluation model, wherein the off-policy evaluation model comprises an estimator selected from the group consisting of a quality-agnostic estimator and a quality-based estimator... [0010; “determining a target return based on evaluating progress of the task”; as this makes no mention of quality at all, it reads on a quality-agnostic estimator] 

Schrittwieser does not explicitly teach evaluating, using the off-policy evaluation model and based on the historical data associated with the logging policy, a target policy to determine whether an expected reward metric of the target policy is higher than a reward metric of the logging policy or when it is determined that the expected reward metric is higher than the reward metric of the logging policy, generating an indication to use the target policy instead of the logging policy, but it is known in the art.  Zhang teaches a learning system [title] which uses a particular “model” when it “gets higher rewards” than an alternative model; [0080] the system is used to “estimate [an] expected accumulated future reward”. [0102] Zhang and Schrittwieser are analogous art as each is directed to electronic means for making predictions regarding rewards. 

It would have been obvious to one of ordinary skill in the art just prior to the filing of the claimed invention to combine the teaching of Zhang with that of Schrittwieser in order to improve computer features and uses, as taught by Zhang; [0001] further, it is simply a combination of known parts with predictable results, simply performing the steps of Zhang after those of Schrittwieser and on the data of either; each part works independently of the other, and each works in combination identically to how it works when not combined, with no new and unexpected result inherent or disclosed.

Schrittwieser does not explicitly teach that a reward predictor is usable to generate an expected reward based on the historical data, but in addition to being of no patentable significance as explained above, it is known in the art.  Chang teaches a campaign optimization system. [title] It creats a “rewards product model” based on “historical transactions”, [0030] computes a “predicted value”, [0043] and an “expected profitability” of a consumer who may choose to, or not to, participate in a campaign. [0056] Chang and Schrittwieser are analogous art as each is directed to electronic means for managing reward-related information.

It would have been obvious to one of ordinary skill in the art just prior to the filing of the claimed invention to combine the teaching of Chang with that of Schrittwieser in order to improve consumer loyalty, as taught by Chang; [abstract] further, it is simply a substitution of one known part for another with predictable results, simply creating a reward predictor for Chang’s purpose rather than that of Schrittwieser; the substitution produces no new and unexpected result.

With regard to Claim 17:
The method of claim 14, wherein determining the off-policy evaluation model comprises generating a hyperparameter for the estimator. [0099; a value is only reduced “after a threshold number” of operations; per the applicant’s specification, ¶ 26, a hyperparameter may be a threshold]

With regard to Claim 20:
The method of claim 14, further comprising:
accessing the historical data from a historical data store, [0053 as cited above in regard to claim 1; any source of historical data reads on an historical data store] wherein the historical data comprises a context, an action associated with the context, and a reward for the action. [id.; the interactions read on actions; 0003; a reward is dependent on the effect of the performance of an action on the environment; the environment reads on the claimed context]

This claim is not patentably distinct from claim 14.  As explained, any source of historical data reads on an historical data store, and what the data “comprises” is of no patentable significance, first because it is nonfunctional, descriptive language, and second, as the data only “comprises” these, they can also include other items, and any further processing could be based entirely on the other items.  The reference is provided for the purpose of compact prosecution.

Claim(s) 2 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Schrittwieser et al. in view of Zhang et al. further in view of Chang et al. further in view of Flockhart et al. (U.S. Publication No. 2013/0223612).

These claims are similar so are analyzed together.
With regard to Claim 2:
The system of claim 1, wherein determining the off-policy evaluation model comprises: 
generating, for the quality-agnostic estimator, a first mean squared error (MSE) metric; [0016; an MSE is generated to measure the error between output in a training target network and that in a training network] 
generating, for the quality-based estimator, a second MSE metric; [0109; the same measurement is taken between two weighted sums] 
when the first MSE is lower than the second MSE, selecting the quality-agnostic estimator as the estimator; and 
when the second MSE is lower than the first MSE, selecting the quality-based estimator as the estimator.

With regard to Claim 15:
The method of claim 14, wherein determining the off-policy evaluation model comprises:
generating, for the quality-agnostic estimator, a first mean squared error (MSE) metric; [0016; an MSE is generated to measure the error between output in a training target network and that in a training network] 
generating, for the quality-based estimator, a second MSE metric; [0109; the same measurement is taken between two weighted sums] 
when the first MSE is lower than the second MSE, selecting the quality-agnostic estimator as the estimator; and 
when the second MSE is lower than the first MSE, selecting the quality-based estimator as the estimator. 

Schrittwieser, Zhang and Chang teach the system of claim 1 and method of claim 14 including using an MSE metric and making a choice based on relative numeric values as cited above and in regard to claim 1, but do not explicitly teach a quality-based and quality-agnostic value, but it is Known in the art. Flockhart teaches a system for administering a contact center [abstract] in which “non- quality sensitive algorithms” are given a certain weight, [0072] and “quality- based” objectives are also determined. [0074] Flockhart and Schrittwieser are analogous art as each is directed to electronic means for making estimates. 

It would have been obvious to one of ordinary skill in the art just prior to the filing of the claimed invention to combine the teaching of Flockhart with that of Schrittwieser, Zhang and Chang in order to vary input weights based on conditions, as taught by Flockhart; [0004] further, it is simply a substitution of known parts for others with predictable results, simply using Flockhart’s interpretation of data in place of that of Schrittwieser; the substitution produces no new and unexpected result.

Claim(s) 3 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Schrittwieser et al. in view of Zhang et al. further in view of Chang et al. further in view of Gunjan et al. (U.S. Publication No. 2016/0005056).

These claims are similar so are analyzed together.
With regard to Claim 3:
The system of claim 1, wherein:
the off-policy evaluation model comprises a combination of direct modeling of a reward predictor and inverse propensity scoring; and
a weight of the reward predictor in the off-policy evaluation model is determined according to the estimator.

With regard to Claim 16:
The method of claim 14, wherein:
the off-policy evaluation model comprises a combination of direct modeling of a reward predictor and inverse propensity scoring; and
a weight of the reward predictor in the off-policy evaluation model is determined according to the estimator.

Schrittwieser, Zhang and Chang teach the system of claim 1 and method of claim 14 but do not explicitly teach combining direct and inverse modeling, but it is known in the art.  Gunjan teaches an affinity predictor [title] that assigns “weights” to “determined differences” in order to predict an affinity. [0007] Weights may be “directly proportional” to certain coefficients or may be “inversely proportional”. [0067; claims 2-3] All of the information may be combined. [0049] Gunjan and Schrittwieser are analogous art as each is directed to electronic means for making predictions. 

It would have been obvious to one of ordinary skill in the art just prior to the filing of the claimed invention to combine the teaching of Gunjan with that of Schrittwieser, Zhang and Chang in order to provide improved recommendations, as taught by Gunjan; [0006] further, it is simply a substitution of one known part for another with predictable results, simply combining Gunjan’s proportions to arrive at a value in place of the method of Schrittwieser; the substitution produces no new and unexpected result.

Claim(s) 5 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Schrittwieser et al. in view of Zhang et al. further in view of Chang et al. further in view of Nandi et al. (U.S. Publication No. 2017/0364371).

These claims are similar so are analyzed together.
With regard to Claim 5:
The system of claim 1, wherein the set of operations further comprises: 
receiving, from a user device, a second indication of a context;
determining, according to the target policy, an action based on the received context; and 
providing, in response to the second indication, a third indication of the determined action.

With regard to Claim 18:
The method of claim 14, further comprising: 
receiving, from a user device, a second indication of a context; 
determining, according to the target policy, an action based on the received context; and 
providing, in response to the second indication, a third indication of the determined action. 

Schrittwieser, Zhang and Chang teach the system of claim 1 and method of claim 14, but do not explicitly teach providing an indication of an action determined from a received indicia of context, but it is known in the art.  Nandi teaches a context- dependent digital action assistance tool [title] which receives “context-specific rules” to “determine when to display” data. [0057] It includes “receiving context information” of a “user” to make information available “for presentation by the user device”. [0130] A rule reads on a policy. The user makes a request “on a predictable basis”. [0029] Nandi and Schrittwieser are analogous art as each is directed to electronic means for managing information related to predictions.

It would have been obvious to one of ordinary skill in the art just prior to the filing of the claimed invention to combine the teaching of Nandi with that of Schrittwieser, Zhang and Chang in order to leverage behavioral patterns, as taught by Nandi; [0002] further, it is simply a combination of known parts with predictable results, simply performing Nandi’s steps after those of Schrittwieser and Zhang; each part works independently of the other, and each works in combination identically to how it works when not combined, with no new and unexpected result inherent or disclosed.

Claim(s) 6 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Schrittwieser et al. in view of Zhang et al. further in view of Chang et al. further in view of Li et al. (U.S. Publication No. 2019/0303995, filed 3 April 2018).

These claims are similar so are analyzed together.
With regard to Claim 6:
The system of claim 1, wherein the quality-agnostic estimator comprises a threshold at which an importance weight is clipped if the weight exceeds the threshold.

With regard to Claim 19:
The method of claim 14, wherein the quality-agnostic estimator comprises a threshold at which an importance weight is clipped if the weight exceeds the threshold.

Schrittwieser, Zhang and Chang teach the system of claim 1 and method of claim 14, including the use of a quality-agnostic estimator as cited above, but do not explicitly teach clipping a weight exceeding a threshold, but it is known in the art.  Li teaches a training system to evaluate digital content. [title] It “predicts a performance value” indicating a probability of interaction by a user. [abstract] A weight is clipped if it “exceeds” a “clipping threshold”; it is explicitly an “importance weight”. [0028] Li and Schrittwieser are analogous art as each is directed to electronic means for making predictions. 

It would have been obvious to one of ordinary skill in the art just prior to the filing of the claimed invention to combine the teaching of Li with that of Schrittwieser, Zhang and Chang in order to improve accuracy; [0003, 0007] further, it is simply a substitution of one known part for another with predictable results, simply using Li’s clipped weight in place of the parameter of Schrittwieser; the substitution produces no new and unexpected result.

Response to Arguments
Applicant's arguments filed 7 November 2022 in regard to rejections made under 35 U.S.C. § 101 have been fully considered but they are not persuasive.  As an initial matter, the applicant’s amendment has overcome the objections and the rejections made under § 112(b) in the previous Office action; these have been withdrawn.  In regard to § 101, first, the applicant asserts that the claims do not recite any abstract idea; however, “following rules or instructions” is explicitly one of the certain methods of organizing human activity deemed abstract, and the applicant does not explain his assertion that the “recited aspects are not akin to such examples”.
In regard to mental activity, again, the applicant makes a conclusory statement that generating an expected reward, etc., cannot “practically be performed in the human mind”, but gives no explanation as to why any of this would present the slightest difficulty for a person to do mentally.
The applicant then asserts that the elements improve “another technology or technical field” but does not explain how any of the non-abstract elements provide for any such improvement, and the Examiner finds none.  The applicant does not attempt to traverse the Examiner’s finding of fact in regard to step 2B.  The claims are not patent eligible and the rejection is maintained.

Applicant’s arguments with respect to claim(s) 1-7 and 14-20 in regard to rejections made under § 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.  The applicant refers mainly to language added by amendment and for which the reference to Chang has been incorporated herein.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SCOTT C ANDERSON whose telephone number is (571)270-7442. The examiner can normally be reached M-F 9:00 to 5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bennett Sigmond can be reached on (303) 297-4411. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SCOTT C ANDERSON/           Primary Examiner, Art Unit 3694