DETAILED ACTION

	This action is in response to claims filed 31 July 2018 for application 16/050,176. Currently claims 1-20 are pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant's claim for priority based on Provisional application 62/697,242 filed on 12 July, 2018.
Information Disclosure Statement
An information disclosure statement (IDS) was submitted on 31 July 2018. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner. 
Claim Objections
Claim 3 is objected to because of the following informalities:  The word “Thomson” appears to have a typo in it.  Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1 - 20 are rejected under 35 U.S.C. § 101 because the claimed invention is directed towards abstract ideas without significantly more. 
Regarding claim 1, according to the first step (Step 1) of the 101 analysis, claim 1 is directed to a system (manufacture) and falls within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter). In the next step (Step 2A, prong 1) of the analysis, the limitations of a recommendation component that recommends a decision based on one or more decision policies, wherein the decision complies with one or more constraints of a constrained decision policy; and an explanation component that generates an explanation of the decision, the explanation comprising one or more factors contributing to the decision, under the broadest reasonable interpretation, cover performance of these limitations in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, the limitations of a memory that stores computer executable components; and a processor that executes the computer executable components stored in the memory, wherein the computer executable components comprise, are considered to be additional elements. However, the judicial exceptions are not integrated into a practical application because the additional elements are recited so generically (no details whatsoever are provided other than that they are a memory that stores computer executable components and a processor that executes the computer executable components in the memory) that they represent no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application. (Also see MPEP 2106.05(b)). In the last step (Step 2B) of the analysis, the additional elements do not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two, the memory that stores 
Regarding claim 2, in step 2A, prong 2, the limitation of further comprising a learning component that learns the constrained decision policy implicitly based on a set of example decisions, under the broadest reasonable interpretation, covers performance of this limitation in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. In the last step (Step 2B) of the analysis, it does not add any additional elements that amount to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.
Regarding claim 3, in step 2A, prong 2, the limitation of further comprising a learning component that learns the constrained decision policy implicitly using classical Thomson sampling, under the broadest reasonable interpretation, involves mathematical calculations. So, the claim recites judicial exceptions and it falls within the “Mathematical concepts” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. In the last step (Step 2B) of the analysis, it does not add any additional elements that amount to significantly more than the abstract idea as the same 
Regarding claim 4, in step 2A, prong 2, the limitation of further comprising a selection component that selects a decision policy of the one or more decision policies, using a constrained contextual multi-armed bandit setting, wherein the selection component selects the decision policy based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy, under the broadest reasonable interpretation, involves mathematical relationships and calculations. So, the claim recites judicial exceptions and it falls within the “Mathematical concepts” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. In the last step (Step 2B) of the analysis, it does not add any additional elements that amount to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.
Regarding claim 5, in step 2A, prong 2, the limitation of wherein the one or more decision policies are selected from a group consisting of: one or more second constrained decision policies; a reward-based decision policy; and a hybrid decision policy, under the broadest reasonable interpretation, covers performance of this limitation in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. In the last step (Step 2B) of the analysis, it does not add any additional elements that amount to significantly more than the abstract idea as the same generic 
Regarding claim 6, in step 2A, prong 2, the limitation of further comprising a learning component that learns a reward-based decision policy, in an online environment, based on feedbackPage 44 of 48 P201800294US02 corresponding to a plurality of decisions recommended by the recommendation component in the online environment, under the broadest reasonable interpretation, covers performance of this limitation in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. In the last step (Step 2B) of the analysis, it does not add any additional elements that amount to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.
Regarding claim 7, in step 2A, prong 2, the limitation of further comprising a blending component that blends two or more of the one or more decision policies to generate a hybrid decision policy, under the broadest reasonable interpretation, covers performance of this limitation in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. In the last step (Step 2B) of the analysis, it does not add any additional elements that amounts to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.
claim 8, in step 2A, prong 2, the limitation of further comprising a reward component adapted to receive a reward signal from an entity based on the decision, wherein the reward signal is indicative of quality of the decision, under the broadest reasonable interpretation, covers performance of this limitations in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, the claim recites an additional element namely, thereby facilitating at least one of improved processing accuracy or improved processing efficiency associated with the processor. However, it is not integrated into a practical application because it only recites the desired result of improved processing accuracy or improved processing efficiency associated with the processor without reciting the actual improvements to the technology. (See MPEP 2106.05(a)). In the last step (Step 2B) of the analysis, the additional elements do not amount to significantly more than the abstract idea because the additional elements do not meaningfully limit the judicial exception when considered both individually and as a combination and thus fails to add an inventive concept to the claim. The claim is not patent eligible.
Regarding claim 9, in step 2A, prong 2, the limitation of further comprising an update component that updates at least one of the one or more decision policies based on a reward signal indicative of feedback corresponding to the decision, under the broadest reasonable interpretation, covers performance of this limitations in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, the claim recites an additional element namely, thereby facilitating at least one of improved processing accuracy or improved processing efficiency associated with the processor. However, it is not integrated into a practical application because it only recites the desired result of improved processing accuracy or improved processing 
Regarding claim 10, in step 2A, prong 2, the limitation of wherein the constrained decision policy is selected from a group consisting of: an ethical decision policy; a legal decision policy; a value decision policy; and a preference decision policy, under the broadest reasonable interpretation, covers performance of this limitation in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application because it does not add any additional elements that integrate the abstract idea into a practical application. In the last step (Step 2B) of the analysis, it does not add any additional elements that amount to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.
Regarding claim 11, according to the first step (Step 1) of the 101 analysis, claim 1 is directed to a method (process) and falls within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter). In the next step (Step 2A, prong 1) of the analysis, the limitations of recommending a decision based on one or more decision policies, wherein the decision complies with one or more constraints of a constrained decision policy; and generating, by the system, an explanation of the decision, the explanation comprising one or more factors contributing to the decision, under the broadest reasonable interpretation, 
Regarding claim 12, in step 2A, prong 2, the limitation of further comprising learning, by the system, the constrained decision policy implicitly, based on a set of example decisions, under the broadest reasonable interpretation, covers performance of this limitation in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. In the last step (Step 2B) of the analysis, it does not add any additional elements that amount to significantly more than the abstract idea as the same 
Regarding claim 13, in step 2A, prong 2, the limitation of further comprising selecting, by the system, a decision policy of the one or more decision policies, based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy, using a constrained contextual multi-armed bandit setting, under the broadest reasonable interpretation, involves mathematical relationships and calculations. So, the claim recites judicial exceptions and it falls within the “Mathematical concepts” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. In the last step (Step 2B) of the analysis, it does not add any additional elements that amount to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.
Regarding claim 14, in step 2A, prong 2, the limitation of further comprising learning, by the system, in an online environment, a reward-based decision policy based on feedback corresponding to a plurality of recommended decisions, under the broadest reasonable interpretation, covers performance of this limitation in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. In the last step (Step 2B) of the analysis, it does not add any additional elements that amount to significantly more than the abstract idea as the same generic computer components are used to 
Regarding claim 15, in step 2A, prong 2, the limitation of further comprising blending, by the system, two or more of the one or more decision policies to generate a hybrid decision policy, under the broadest reasonable interpretation, covers performance of this limitation in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, the claim recites an additional element namely, thereby facilitating at least one of improved processing efficiency or improved processing time associated with the processor. However, it is not integrated into a practical application because it only recites the desired result of improved processing time or improved processing efficiency associated with the processor without reciting the actual improvements to the technology. (See MPEP 2106.05(a)). In the last step (Step 2B) of the analysis, the additional elements do not amount to significantly more than the abstract idea because the additional elements do not meaningfully limit the judicial exception when considered both individually and as a combination and thus fails to add an inventive concept to the claim. The claim is not patent eligible.
Regarding claim 16, according to the first step (Step 1) of the 101 analysis, claim 1 is directed to a computer program product (manufacture) and falls within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter). In the next step (Step 2A, prong 1) of the analysis, the limitations of recommend, by the processor, a decision based on one or more decision policies, wherein the decision complies with one or more constraints of a constrained decision policy; and generate, by the processor, an explanation of the decision, the explanation comprising one or more factors contributing to the decision, under the broadest reasonable interpretation, cover performance of these limitations 
Regarding claim 17, in step 2A, prong 2, the limitations of learn, the constrained decision policy implicitly, based on a set of example decisions; and learn, in an online environment, a reward-based decision policy based on feedback corresponding to a plurality of recommended decisions, under the broadest reasonable interpretation, cover performance of 
Regarding claim 18, in step 2A, prong 2, the limitation of select, a decision policy of the one or more decision policies, based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy, using a constrained contextual multi-armed bandit setting, under the broadest reasonable interpretation, involves mathematical relationships and calculations. So, the claim recites judicial exceptions and it falls within the “Mathematical concepts” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. In the last step (Step 2B) of the analysis, it does not add any additional elements that amounts to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.
Regarding claim 19, in step 2A, prong 2, the limitation of to blend, two or more of the one or more decision policies to generate a hybrid decision policy, under the broadest reasonable interpretation, covers performance of this limitation in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical 
Regarding claim 20, in step 2A, prong 2, the limitation of to update, at least one of the one or more decision policies based on a reward signal indicative of feedback corresponding to the decision, under the broadest reasonable interpretation, covers performance of this limitations in the mind. So, the claim recites judicial exceptions and it falls within the “Mental Processes” grouping of abstract ideas. In the next step (Step 2A, prong 2) of the analysis, the claim recites an additional element namely, thereby facilitating at least one of improved processing accuracy or improved processing efficiency associated with the processor. However, it is not integrated into a practical application because it only recites the desired result of improved processing accuracy or improved processing efficiency associated with the processor without reciting the actual improvements to the technology. (See MPEP 2106.05(a)). In the last step (Step 2B) of the analysis, the additional elements do not amount to significantly more than the abstract idea because the additional elements do not meaningfully limit the judicial exception when considered both individually and as a combination and thus fails to add an inventive concept to the claim. The claim is not patent eligible.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1, 2, 5, 10, 11, 12, and 16 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Etzioni et al (WO 2012071543 A2).
Regarding claim 1, Etzioni teaches: A system, comprising: a memory that stores computer executable components (the execution of instructions that may be stored in a system memory 722 [00241]); and a processor that executes the computer executable components stored in the memory, wherein the computer executable components comprise: (one or more processors 720 to communicate with each subsystem and to control the execution of instructions that may be stored in a system memory 722 [00241]): a recommendation component that recommends a decision based on one or more decision policies (The decision support component 300 may include a purchase timing recommendation component 304 configured at least to determine beneficial purchase timing recommendations based at least in part on product price and successor availability predictions [0050]. Another embodiment may create a policy for buy/wait decisions [00158]), wherein the decision complies with one or more constraints of a constrained decision policy (Another embodiment may create a policy for buy/wait decisions based on a cost-sensitive classifier where the desired prediction is WAIT when the cost of BUYING = 0 and cost of WAITING = profit - penalty for length of waiting period [00158]); and an explanation component that generates an explanation of the decision, the explanation comprising one or more factors contributing to the decision (A prediction explanation component 308 may be configured at least to determine one or more human-readable explanations for predictions made by the prediction service 200. For example, such explanations may correlate with most significant factors as determined with factor analysis [0050]).
Regarding claim 2, Etzioni teaches: The system of claim 1, further comprising a learning component that learns the constrained decision policy implicitly based on a set of example decisions (Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data. Such an embodiment may develop algorithms based on reinforcement learning and sequential decision making for this purpose. Create an initial training dataset for learning [00159]. Note: Training data corresponds to the set of example decisions).
Regarding claim 5, Etzioni teaches: The system of claim 1, wherein the one or more decision policies are selected from a group consisting of: one or more second constrained decision policies; a reward-based decision policy; and a hybrid decision policy (Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data [00159]. Note: This corresponds to a reward-based decision policy).
Regarding claim 10, Etzioni teaches: The system of claim 1, wherein the constrained decision policy is selected from a group consisting of: an ethical decision policy; a legal decision policy; a value decision policy; and a preference decision policy (The user decision support component 210 may take into account user preferences when providing predictions, recommendations and decision support information. Such user preferences may be stored in corresponding user profiles in a user account database 216 managed by a user account management component 218 [0047]. Note: This corresponds to preference decision policy).
Regarding claim 11, Etzioni teaches: A computer-implemented method, comprising: recommending, by a system operatively coupled to a processor, a decision based on one or (one or more processors  [00241]. The decision support component 300 may include a purchase timing recommendation component 304 configured at least to determine beneficial purchase timing recommendations based at least in part on product price and successor availability predictions [0050]. Another embodiment may create a policy for buy/wait decisions [00158]. Another embodiment may create a policy for buy/wait decisions based on a cost-sensitive classifier where the desired prediction is WAIT when the cost of BUYING = 0 and cost of WAITING = profit - penalty for length of waiting period [00158]); and generating, by the system, an explanation of the decision, the explanation comprising one or more factors contributing to the decision (A prediction explanation component 308 may be configured at least to determine one or more human-readable explanations for predictions made by the prediction service 200. For example, such explanations may correlate with most significant factors as determined with factor analysis [0050]).
Regarding claim 12, Etzioni teaches: The computer-implemented method of claim 11, further comprising learning, by the system, the constrained decision policy implicitly, based on a set of example decisions (Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data. Such an embodiment may develop algorithms based on reinforcement learning and sequential decision making for this purpose. Create an initial training dataset for learning [00159]. Note: Training data corresponds to the set of example decisions).
Regarding claim 16, Etzioni teaches: A computer program product facilitating a constrained decision-making and explanation of a recommendation process, the computer program product comprising a computer readable storage medium having program instructions (The software code may be stored as a series of instructions, or commands on a computer readable medium [00242]. one or more processors  [00241]), a decision based on one or more decision policies, wherein the decision complies with one or more constraints of a constrained decision policy (The decision support component 300 may include a purchase timing recommendation component 304 configured at least to determine beneficial purchase timing recommendations based at least in part on product price and successor availability predictions [0050]. Another embodiment may create a policy for buy/wait decisions [00158]. Another embodiment may create a policy for buy/wait decisions based on a cost-sensitive classifier where the desired prediction is WAIT when the cost of BUYING = 0 and cost of WAITING = profit - penalty for length of waiting period [00158]); and generate, by the processor, an explanation of the decision, the explanation comprising one or more factors contributing to the decision (A prediction explanation component 308 may be configured at least to determine one or more human-readable explanations for predictions made by the prediction service 200. For example, such explanations may correlate with most significant factors as determined with factor analysis [0050]).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Etzioni et al (WO 2012071543 A2) in view of Xia et al (Thompson Sampling for Budgeted Multi-armed Bandits, 2015).
Regarding claim 3, Etzioni teaches: The system of claim 1, further comprising a learning component that learns (as shown above).
However, Etzioni does not explicitly disclose: the constrained decision policy implicitly using classical Thomson sampling.
Xia teaches in an analogous system: the constrained decision policy implicitly using classical Thomson sampling (Thompson sampling is one of the earliest randomized algorithms for multi-armed bandits (MAB). In this paper, we extend the Thompson sampling to Budgeted MAB, where there is random cost for pulling an arm and the total cost is constrained by a budget [Page 1, Abstract]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Xia to use Thompson sampling. One would have been motivated to do this modification because doing so would give the benefit of achieving good performances in practical applications as taught by Xia [Page 1, Introduction, 2nd column, 2nd paragraph].

Claims 7, 15, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Etzioni et al (WO 2012071543 A2) in view of Henderson et al (Hybrid Reinforcement/Supervised Learningof Dialogue Policies from Fixed Data Sets, 2008).
Regarding claim 7, Etzioni teaches: The system of claim 1 (as shown above).
However, Etzioni does not explicitly disclose: further comprising a blending component that blends two or more of the one or more decision policies to generate a hybrid decision policy.
Henderson teaches in an analogous system: further comprising a blending component that blends two or more of the one or more decision policies to generate a hybrid decision policy (a hybrid model that combines reinforcement learning with supervised learning. The reinforcement learning is used to optimize a measure of dialogue reward, while the supervised learning is used to restrict the learned policy [Page 1, first paragraph]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Henderson to use a hybrid model. One would have been motivated to do this modification because doing so would give the benefit of improving techniques for bootstrapping 
Regarding claim 15, Etzioni teaches: The computer-implemented method of claim 11 (as shown above).
However, Etzioni does not explicitly disclose: further comprising blending, by the system, two or more of the one or more decision policies to generate a hybrid decision policy, thereby facilitating at least one of improved processing efficiency or improved processing time associated with the processor.
Henderson teaches in an analogous system: further comprising blending, by the system, two or more of the one or more decision policies to generate a hybrid decision policy (a hybrid model that combines reinforcement learning with supervised learning. The reinforcement learning is used to optimize a measure of dialogue reward, while the supervised learning is used to restrict the learned policy [Page 1, first paragraph]), thereby facilitating at least one of improved processing efficiency or improved processing time associated with the processor (The best hybrid policy improves over the average COMMUNICATOR system policy by 10% on our metric [Page 508, 2nd paragraph]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Henderson to use a hybrid model. One would have been motivated to do this modification because doing so would give the benefit of improving techniques for bootstrapping and automatic optimization of dialogue management policies from limited initial data sets as taught by Henderson [Page 1, paragraph 1].
claim 19, Etzioni teaches: The computer program product of claim 16, wherein the program instructions are further executable by the processor to cause the processor to (as shown above).
However, Etzioni does not explicitly disclose: blend, by the processor, two or more of the one or more decision policies to generate a hybrid decision policy.
Henderson teaches in an analogous system: blend, by the processor, two or more of the one or more decision policies to generate a hybrid decision policy (a hybrid model that combines reinforcement learning with supervised learning. The reinforcement learning is used to optimize a measure of dialogue reward, while the supervised learning is used to restrict the learned policy [Page 1, first paragraph]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Henderson to use a hybrid model. One would have been motivated to do this modification because doing so would give the benefit of improving techniques for bootstrapping and automatic optimization of dialogue management policies from limited initial data sets as taught by Henderson [Page 1, paragraph 1].
Claims 6, 8, 9, 14, 17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Etzioni et al (WO 2012071543 A2) in view of Caron et al (EP 2816511 A1).
Regarding claim 6, Etzioni teaches: The system of claim 1, further comprising a learning component that learns a reward-based decision policy (Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data [00159]. Note: This corresponds to a reward-based decision policy).

Caron teaches in an analogous system: in an online environment, based on feedback corresponding to a plurality of decisions recommended by the recommendation component in the online environment (recommender system to recommend items to a new user includes calculating reward estimates from multiple multi-armed bandit models of a user and her social network friends [Abstract].  When new users sign in to a recommender system for the first time, their social graph information is gathered. This may be from their account on social networking sites such as Facebook™, their email address books, or an interface where users can explicitly friend other users of the recommender system. Once a user is signed in, the recommender system picks an artist and samples a song by that artist to recommend to the user. The user may choose to skip the song if she does not enjoy it and move to the next song. From such repetitive feedback, the system wants to learn as fast as possible a set of artists that the user likes, giving her an incentive to continue to use the service [0014]. Learn online the top artists of u [0022].Note: Online is being interpreted as being over the internet).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Caron to use feedback in an online environment to learn. One would have been motivated to do this modification because doing so would give the benefit of learning as fast as possible giving users incentive to use the service as taught by Caron paragraph [0014].
Regarding claim 8, Etzioni teaches: The system of claim 1 (as shown above).
However, Etzioni does not explicitly disclose: further comprising a reward component adapted to receive a reward signal from an entity based on the decision, wherein the reward 
Caron teaches in an analogous system: further comprising a reward component adapted to receive a reward signal from an entity based on the decision, wherein the reward signal is indicative of quality of the decision, thereby facilitating at least one of improved processing accuracy or improved processing efficiency associated with the processor (In the MAB model, a decision maker repeatedly chooses among a finite set of K actions. At each step t, the action a chosen yields a reward X.sub.a,t drawn from a probability distribution intrinsic to a and unknown to the decision maker. The goal for the latter is to learn, as fast as possible, which are the actions yielding maximum reward in expectation [0010]. Note: Action corresponds to decision and the goal of learning as fast as possible corresponds to improved processing efficiency).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Caron to yield a reward for an action. One would have been motivated to do this modification because doing so would give the benefit of learning as fast as possible the actions yielding maximum reward as taught by Caron paragraph [0010].
Regarding claim 9, Etzioni teaches: The system of claim 1 (as shown above).
However, Etzioni does not explicitly disclose: further comprising an update component that updates at least one of the one or more decision policies based on a reward signal indicative of feedback corresponding to the decision, thereby facilitating at least one of improved processing accuracy or improved processing efficiency associated with the processor.
Caron teaches in an analogous system: further comprising an update component that updates at least one of the one or more decision policies based on a reward signal indicative of (At step 330, an update to the user's empirical estimate (Xa) is updated as a result of the feedback received from the new user. This is represented as line 7 of algorithm 2 or line 9 of algorithm 3. The feedback, provided by the new user over time, will eventually enhance the multi-armed bandit model of the new user such that the bandit will begin to generate high reward recommendations to the new user [0041]. Note: Generating high reward recommendations corresponds to improved processing).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Caron to update as a result of feedback. One would have been motivated to do this modification because doing so would give the benefit of enhancing the model such that the bandit will begin to generate high reward recommendations to the new user as taught by Caron paragraph [0041].
Regarding claim 14, Etzioni teaches: The computer-implemented method of claim 11 and a reward-based decision policy (Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data [00159]. Note: This corresponds to a reward-based decision policy.)
However, Etzioni does not explicitly disclose: further comprising learning, by the system, in an online environment, based on feedback corresponding to a plurality of recommended decisions.
Caron teaches in an analogous system: further comprising learning, by the system, in an online environment, based on feedback corresponding to a plurality of recommended decisions (recommender system to recommend items to a new user includes calculating reward estimates from multiple multi-armed bandit models of a user and her social network friends [Abstract].  When new users sign in to a recommender system for the first time, their social graph information is gathered. This may be from their account on social networking sites such as Facebook™, their email address books, or an interface where users can explicitly friend other users of the recommender system. Once a user is signed in, the recommender system picks an artist and samples a song by that artist to recommend to the user. The user may choose to skip the song if she does not enjoy it and move to the next song. From such repetitive feedback, the system wants to learn as fast as possible a set of artists that the user likes, giving her an incentive to continue to use the service [0014]. Learn online the top artists of u [0022]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Caron to use feedback in an online environment to learn. One would have been motivated to do this modification because doing so would give the benefit of learning as fast as possible giving users incentive to use the service as taught by Caron paragraph [0014].
Regarding claim 17, Etzioni teaches: The computer program product of claim 16, wherein the program instructions are further executable by the processor to cause the processor to: learn, by the processor, the constrained decision policy implicitly, based on a set of example decisions (Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data. Such an embodiment may develop algorithms based on reinforcement learning and sequential decision making for this purpose. Create an initial training dataset for learning [00159]. Note: Training data corresponds to the set of example decisions).

Caron teaches in an analogous system: and learn, by the processor, in an online environment, a reward-based decision policy based on feedback corresponding to a plurality of recommended decisions (recommender system to recommend items to a new user includes calculating reward estimates from multiple multi-armed bandit models of a user and her social network friends [Abstract].  When new users sign in to a recommender system for the first time, their social graph information is gathered. This may be from their account on social networking sites such as Facebook™, their email address books, or an interface where users can explicitly friend other users of the recommender system. Once a user is signed in, the recommender system picks an artist and samples a song by that artist to recommend to the user. The user may choose to skip the song if she does not enjoy it and move to the next song. From such repetitive feedback, the system wants to learn as fast as possible a set of artists that the user likes, giving her an incentive to continue to use the service [0014]. Learn online the top artists of u [0022]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Caron to use feedback in an online environment to learn. One would have been motivated to do this modification because doing so would give the benefit of learning as fast as possible giving users incentive to use the service as taught by Caron paragraph [0014].
Regarding claim 20, Etzioni teaches: The computer program product of claim 16, wherein the program instructions are further executable by the processor to cause the processor (as shown above).

Caron teaches in an analogous system: update, by the processor, at least one of the one or more decision policies based on a reward signal indicative of feedback corresponding to the decision, thereby facilitating at least one of improved processing accuracy or improved processing efficiency associated with the processor (At step 330, an update to the user's empirical estimate (Xa) is updated as a result of the feedback received from the new user. This is represented as line 7 of algorithm 2 or line 9 of algorithm 3. The feedback, provided by the new user over time, will eventually enhance the multi-armed bandit model of the new user such that the bandit will begin to generate high reward recommendations to the new user [0041]. Note: Generating high reward recommendations corresponds to improved processing).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Caron to update as a result of feedback. One would have been motivated to do this modification because doing so would give the benefit of enhancing the model such that the bandit will begin to generate high reward recommendations to the new user as taught by Caron paragraph [0041].
Claims 4, 13, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Etzioni et al (WO 2012071543 A2) in view of Huasen et al (Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits, 2015) and further in view of Slivkins (Contextual bandits with similarity information, 2011).
claim 4, Etzioni teaches: The system of claim 1, further comprising a selection component that selects a decision policy of the one or more decision policies (Apply policy to dataset to create new training data [00159]).
However, Etzioni does not explicitly disclose: using a constrained contextual multi-armed bandit setting, wherein the selection component selects the decision policy based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy.
Huasen teaches in an analogous system: using a constrained contextual multi-armed bandit setting (The contextual bandit problem is an important extension of the classic multi-armed bandit (MAB) problem [Page 1, Introduction]. Constrained contextual bandits involve complicated interactions between information acquisition and decision making [Page 7, last paragraph]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Huasen to use a constrained contextual multi-armed bandit setting. One would have been motivated to do this modification because doing so would give the benefit of studying computationally efficient algorithms that achieve logarithmic or sublinear regret for constrained contextual bandits as taught by Huasen [Page 8, Conclusion].
Slivkins teaches in an analogous system: wherein the selection component selects the decision policy based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy (Prior work on contextual bandits with similarity uses "uniform" partitions of the similarity space, so that each context-arm pair is approximated by the closest pair in the partition. Algorithms based on "uniform" partitions disregard the structure of the payoffs and the context arrivals, which is potentially wasteful. We present algorithms that are based on adaptive partitions, and take advantage of "benign" payoffs and context arrivals without sacri cing the worst-case performance. The central idea is to maintain a  ner partition in high-payoff regions of the similarity space and in popular regions of the context space. Our results apply to several other settings, e.g. MAB with constrained temporal change (Slivkins and Upfal, 2008) and sleeping bandits (Kleinberg et al., 2008a).[Abstract]. Our stochastic contextual MAB setting, and speci cally the contextual zooming algorithm, can be fruitfully applied beyond the ad placement scenario described above and beyond MAB with similarity information per se. First, writing xt = t one can incorporate "temporal constraints" (across time, for each arm), and combine them with "spatial constraints" across arms, for each time) [Page 682, paragraph 3].Note: Selecting an arm corresponds to selecting . Also algorithm 1 seems to use threshold [Page 688]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Slivkins to use similarity information. One would have been motivated to do this modification because doing so would give the benefit of maintaining a finer partition in high-payoff regions of the similarity space and in popular regions of the context space as taught by Slivkins [Abstract].
Regarding claim 13, Etzioni teaches: The computer-implemented method of claim 11, further comprising selecting, by the system, a decision policy of the one or more decision policies (Apply policy to dataset to create new training data [00159]).
However, Etzioni does not explicitly disclose: based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy, using a constrained contextual multi-armed bandit setting.
(Prior work on contextual bandits with similarity uses "uniform" partitions of the similarity space, so that each context-arm pair is approximated by the closest pair in the partition. Algorithms based on "uniform" partitions disregard the structure of the payoffs and the context arrivals, which is potentially wasteful. We present algorithms that are based on adaptive partitions, and take advantage of "benign" payoffs and context arrivals without sacri cing the worst-case performance. The central idea is to maintain a  ner partition in high-payoff regions of the similarity space and in popular regions of the context space. Our results apply to several other settings, e.g. MAB with constrained temporal change (Slivkins and Upfal, 2008) and sleeping bandits (Kleinberg et al., 2008a).[Abstract]. Our stochastic contextual MAB setting, and speci cally the contextual zooming algorithm, can be fruitfully applied beyond the ad placement scenario described above and beyond MAB with similarity information per se. First, writing xt = t one can incorporate "temporal constraints" (across time, for each arm), and combine them with "spatial constraints" across arms, for each time) [Page 682, paragraph 3].Note: Selecting an arm corresponds to selecting. Also algorithm 1 seems to use threshold [Page 688]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Slivkins to use similarity information. One would have been motivated to do this modification because doing so would give the benefit of maintaining a finer partition in high-payoff regions of the similarity space and in popular regions of the context space as taught by Slivkins [Abstract].
(Constrained contextual bandits involve complicated interactions between information acquisition and decision making [Page 7, last paragraph]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Huasen to use a constrained contextual multi-armed bandit setting. One would have been motivated to do this modification because doing so would give the benefit of studying computationally efficient algorithms that achieve logarithmic or sublinear regret for constrained contextual bandits as taught by Huasen [Page 8, Conclusion].
Regarding claim 18, Etzioni teaches: The computer program product of claim 16, wherein the program instructions are further executable by the processor to cause the processor to select, by the processor, a decision policy of the one or more decision policies (Apply policy to dataset to create new training data [00159]).
However, Etzioni does not explicitly disclose: based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy, using a constrained contextual multi-armed bandit setting.
Slivkins teaches in an analogous system: based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy (Prior work on contextual bandits with similarity uses "uniform" partitions of the similarity space, so that each context-arm pair is approximated by the closest pair in the partition. Algorithms based on "uniform" partitions disregard the structure of the payoffs and the context arrivals, which is potentially wasteful. We present algorithms that are based on adaptive partitions, and take advantage of "benign" payoffs and context arrivals without sacri cing the worst-case performance. The central idea is to maintain a  ner partition in high-payoff regions of the similarity space and in popular regions of the context space. Our results apply to several other settings, e.g. MAB with constrained temporal change (Slivkins and Upfal, 2008) and sleeping bandits (Kleinberg et al., 2008a).[Abstract]. Our stochastic contextual MAB setting, and speci cally the contextual zooming algorithm, can be fruitfully applied beyond the ad placement scenario described above and beyond MAB with similarity information per se. First, writing xt = t one can incorporate "temporal constraints" (across time, for each arm), and combine them with "spatial constraints" across arms, for each time) [Page 682, paragraph 3].Note: Selecting an arm corresponds to selecting. Also algorithm 1 seems to use threshold [Page 688]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Slivkins to use similarity information. One would have been motivated to do this modification because doing so would give the benefit of maintaining a finer partition in high-payoff regions of the similarity space and in popular regions of the context space as taught by Slivkins [Abstract].
Huasen teaches in an analogous system: using a constrained contextual multi-armed bandit setting (Constrained contextual bandits involve complicated interactions between information acquisition and decision making [Page 7, last paragraph]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Huasen to use a constrained contextual multi-armed bandit setting. One would have been motivated to do this modification because doing so would give the benefit of studying computationally efficient algorithms that achieve logarithmic or sublinear regret for constrained contextual bandits as taught by Huasen [Page 8, Conclusion].
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
He et al (US 20170103413 A1) discloses Device, Method, and Computer readable medium of Generating Recommendations via Ensemble Multi-arm bandit with an LPBoost.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHAITANYA RAMESH JAYAKUMAR whose telephone number is (571)272-3369.  The examiner can normally be reached on Mon-Fri 7am-3.30pm, alt Fri off.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on 571-272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/CHAITANYA R JAYAKUMAR/               Examiner, Art Unit 2122                                                                                                                                                                                         
/ERIC NILSSON/               Primary Examiner, Art Unit 2122