DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
This action is in response to the submission filed 22 August 2022 for application 16/050,176. Claims 7 and 15 are amended. Claim 6 is canceled. Currently claims 1-5 and 7-20 are pending and have been examined. 

Response to Arguments
Applicant’s arguments, see Pages 7-9 of remarks, filed 22 August 2022, with respect to the rejection of claims 1-5 and 7-20 under 35 USC § 101 have been fully considered but are not persuasive because the amended limitations are directed towards abstract ideas without significantly more and they do not integrate the abstract idea into a practical application as shown below.
Regarding applicant’s arguments on page 7, that the claim does not recite any of the judicial exceptions enumerated in the 2019 PEG. On page 8, applicant argues, that the claim is eligible because it does not recite a judicial exception, and on page 9, applicant continues to argue, that the subject claims recite elements of learning which require a processor and a computer memory, that cannot be practically applied or performed as a mental process in the human mind, and also does not recite any method of organizing human activity and therefore, does not recite a judicial exception. 
Examiner’s Response: Examiner respectfully disagrees because although the claims do not recite mathematical concepts or any method of organizing human activity, they do recite mental process because the limitations identified in Step 2A, prong 1 of the 101 analysis, under the broadest reasonable interpretation, include an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer components of a processor or computer memory. For example, one can recommend a decision based on a combination of decision policies that comply with constraints and generate an explanation of the decision, and learn based on feedback, and select a decision policy based on a comparison without a computer. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas. Hence, the claims recite judicial exceptions as shown in detail below.
Regarding applicant’s arguments on page 9, that even if the claims could be considered to be directed to a judicial exception, they are directed to a practical application that provides an explanation of a decision, wherein the explanation comprises one or more factors contributing to the decision entity as discussed throughout the specification, and thus would be eligible under part 2A, Prong 2 of the Alice/Mayo two-step framework. 
Examiners response: Examiner respectfully disagrees because generating an explanation is identified as a mental process in Step 2A, prong 1 and not directed to a practical application. Mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application.
Regarding applicant’s arguments on page 9, that even if the claims could be considered to be directed to a judicial exception, they are directed to a practical application that decreases the number of processing cycles utilized by the system, and thus would be part 2A, Prong 2 of the Alice/Mayo two-step framework. 
Examiners response: Examiner respectfully disagrees because decreasing the number of processing cycles utilized by the system just explains the desirable result and so is not given any patentable weight as it does not limit the claim further. As discussed in MPEP 2106.05(f), the recitation of claim limitations that attempt to cover any solution to an identified problem with no restriction on how the result is accomplished and no description of the mechanism for accomplishing the result, does not integrate a judicial exception into a practical application or provide significantly more because this type of recitation is equivalent to the words "apply it".  
Applicant’s arguments, see Pages 9-17 of remarks, with respect to the rejection of claims under 35 USC § 103 have been fully considered but are not persuasive. 
Regarding applicant’s arguments on page 12, with respect to claims 1, 2, 5, 8, 9 and 10, that while of Fernando et al. may disclose selecting a similar policy of the same type to learn a new policy based on effectiveness of reuse, the cited references, alone or in combination, fail to teach, disclose or suggest ...a difference in similarity between the constrained decision policy and a reward-based decision policy, wherein the selection component selects the constrained decision policy if the similarity threshold is less than the difference ... selects the reward-based decision policy if the similarity threshold is greater than the difference... (emphasis added) as recited in independent claim 1.
Examiners response: Examiner respectfully disagrees because although applicant cites page 723 of Fernando reference, page 723 was not relied upon to teach that limitation. Instead, Fernando teaches that limitation on pages 720, 725, and 726 as shown in the detailed rejection below. Applicant also cites specification paragraph [0076] in the arguments, but it is to be noted that specification is not read into the claims.
Regarding applicant’s arguments on page 13-17, with respect to the rejection of claims 11, 12, 16, 3 and 4, that for at least similar reasons as those provided for claim 1, none of the cited references, alone or in combination, teach, disclose or suggest the elements of claims 11, 16, 3, and 4. 
Examiners response: Examiner respectfully disagrees because of the same reasons as explained above and shown in the detailed rejection below, the cited references do teach each and every element of claim 1, and hence claims 11, 12, 16, 3, and 4.
Regarding applicant’s arguments on page 15, with respect to the rejection of claim 7, that while Henderson et al. may disclose generating a policy using reinforcement learning and supervised learning, the cited references, alone or in combination, fail to teach, disclose or suggest ...blends two or more of the one or more decision policies to generate a hybrid decision policy based on the similarity threshold... (emphasis added) as recited in dependent claim 7. 
Examiners response: Examiner respectfully disagrees because Henderson teaches blends two or more of the one or more decision policies to generate a hybrid decision policy based on the similarity threshold in pages 1 and 498. Page 1, first paragraph states, a hybrid model that combines reinforcement learning with supervised learning. The reinforcement learning is used to optimize a measure of dialogue reward, while the supervised learning is used to restrict the learned policy. Page 498, Section 2.4, Paragraph 1 states, To do this, we propose a novel hybrid approach that combines RL with supervised learning. A discriminant function Qhybrid(s, a) is derived that combines these two criteria in a principled way. The resulting policy can be adjusted to be as similar as necessary to the policy in the data. Page 498, Section 2.4, Paragraph 2 states, Because in general multiple policies were used, we model the data’s policy as a probabilistic policy, using the estimate Sdata(s, a) of P(a|s) presented in the previous section, which when all considered together, under the broadest reasonable interpretation, examiner is interpreting as “blends two or more of the one or more decision policies to generate a hybrid decision policy based on the similarity threshold”. Hence, Henderson teaches that limitation.
Regarding applicant’s arguments on page 15-17, with respect to the rejection of claims 14, 17, 20, 15, 19, 13, and 18 that for at least similar reasons as those provided for claims 11 and 16, none of the cited references, alone or in combination, teach, disclose or suggest the elements of claims 14, 17, 20, 15, 19, 13, and 18. 
Examiners response: Examiner respectfully disagrees because of the same reasons as explained above and shown in the detailed rejection below, the cited references do teach each and every element of claims 11 and 16, and hence claims 14, 17, 20, 15, 19, 13, and 18.

	
Claim Objections
Claim 15 is objected to because of the following informalities: The phrase “…wherein the blending the two or more …” is awkwardly worded.  Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-5 and 7-20 are rejected under 35 U.S.C. § 101 because the claimed invention is directed towards abstract ideas without significantly more. 
Regarding claim 1
According to the first step (Step 1) of the 101 analysis, claim 1 is directed to a system (manufacture) and falls within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter). 
In the next step (Step 2A, prong 1) of the analysis, the limitations of a recommendation component that recommends a decision based on a combination of one or more decision policies, wherein the decision complies with one or more constraints of a constrained decision policy, wherein the compliance with the one or more constraints; an explanation component that generates an explanation of the decision, the explanation comprising one or more factors contributing to the decision; a learning component that learns, a reward- based decision policy, based on feedback corresponding to a plurality of decisions recommended by the recommendation component; and a selection component that selects a decision policy of the one or more decision policies based on a comparison between a similarity threshold and a difference in similarity between the constrained decision policy and a reward-based decision policy, wherein the selection component selects the constrained decision policy if the similarity threshold is less than the difference in similarity between the constrained decision policy and the reward-based decision policy, and wherein the selection component selects the reward-based decision policy if the similarity threshold is greater than the difference in similarity between the constrained decision policy and the reward-based decision policy, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer components. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
 In the next step (Step 2A, prong 2) of the analysis, the limitations of a memory that stores computer executable components; and a processor that executes the computer executable components stored in the memory, wherein the computer executable components comprise, electronically, via machine learning, in an online environment, electronic, in the online environment, are considered to be additional elements. However, the judicial exceptions are not integrated into a practical application because the additional elements are recited so generically (no details whatsoever are provided other than that they are a memory that stores computer executable components and a processor that executes the computer executable components in the memory and electronically learns via machine learning in an online environment based on electronic feedback) that they represent no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application. (Also see MPEP 2106.05(b)). 
In the same step, (Step 2A, prong 2) , the limitation of, facilitates improved processing time and processing efficiency of the processor based on increased accuracy of the decision recommended, is considered to be an additional element as it just explains the desirable result and so is not given any patentable weight as it does not limit the claim further. As discussed in MPEP 2106.05(f), the recitation of claim limitations that attempt to cover any solution to an identified problem with no restriction on how the result is accomplished and no description of the mechanism for accomplishing the result, does not integrate a judicial exception into a practical application or provide significantly more because this type of recitation is equivalent to the words "apply it".  
In the last step (Step 2B) of the analysis, the additional elements do not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two, the limitations of memory that stores computer executable components and the processor that executes the computer executable components in the memory and electronically, via machine learning, in an online environment, electronic, in the online environment, are at best the equivalent of merely adding the words “apply it” to the judicial exception. 
In the same step (step 2B) the limitation of “facilitates improved processing time …” is not indicative of improvement in technology because the additional limitation provided only a result-oriented solution and lacked details as to how the computer performed the modifications, which was equivalent to the words "apply it". Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 2
In step 2A, prong 1, the limitation of wherein the learning component also learns the constrained decision policy implicitly based on a set of example decisions, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer component. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. 
In the last step (Step 2B) of the analysis, it does not add any additional elements that amount to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.

Regarding claim 3
In the Step 2A, prong 1 of the analysis, the limitation of, wherein the learning component also learns the constrained decision policy implicitly, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer components. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2), the limitation of, using classical Thompson sampling, is considered to be an additional element and it does not integrate the abstract idea into a practical application because the additional element is recited so generically (no details whatsoever are provided other than that it uses classical Thomson sampling) that it represents no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application.
In the last step (Step 2B) of the analysis, the additional element does not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two, the system using classical Thomson sampling is at best the equivalent of merely adding the words “apply it” to the judicial exception. See MPEP 2106.05(f). Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 4
In the Step 2A, prong 1 of the analysis, the limitation of, wherein the selection component also selects a decision policy of the one or more decision policies, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer components. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2), the limitation of, using a constrained contextual multi-armed bandit setting, is considered to be an additional element and it does not integrate the abstract idea into a practical application because the additional element is recited so generically (no details whatsoever are provided other than that it uses a constrained contextual multi-armed bandit setting) that it represents no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application.
In the last step (Step 2B) of the analysis, the additional element does not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two, the system using a constrained contextual multi-armed bandit setting is at best the equivalent of merely adding the words “apply it” to the judicial exception. See MPEP 2106.05(f). Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 5
In step 2A, prong 1, the limitation of wherein the one or more decision policies are selected from a group comprising one or more second constrained decision policies; and the reward-based decision policy, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer component. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. 
In the last step (Step 2B) of the analysis, it does not add any additional elements that amount to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.


Regarding claim 7
In step 2A, prong 1, the limitation of, further comprising a blending component that blends two or more of the one or more decision policies to generate a hybrid decision policy based on a similarity threshold, under the broadest reasonable interpretation, covers mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer component. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. 
In the last step (Step 2B) of the analysis, it does not add any additional elements that amounts to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.

Regarding claim 8
In step 2A, prong 1, the limitation of further comprising a reward component adapted to receive a reward signal from an entity based on the decision, wherein the reward signal is indicative of quality of the decision, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer component. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. 
In the last step (Step 2B) of the analysis, it does not add any additional elements that amounts to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.

Regarding claim 9
In step 2A, prong 1, the limitation of further comprising an update component that updates at least one of the one or more decision policies based on a reward signal indicative of feedback corresponding to the decision, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer component. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application as it does not add any additional elements that integrate the abstract idea into a practical application. 
In the last step (Step 2B) of the analysis, it does not add any additional elements that amounts to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.

Regarding claim 10
In step 2A, prong 1, the limitation of, wherein the constrained decision policy is selected from a group consisting of: an ethical decision policy; a legal decision policy; a value decision policy; and a preference decision policy, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer component. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2) of the analysis, it is not integrated into a practical application because it does not add any additional elements that integrate the abstract idea into a practical application. 
In the last step (Step 2B) of the analysis, it does not add any additional elements that amount to significantly more than the abstract idea as the same generic computer components are used to perform the steps and thus fails to add an inventive concept to the claim. The claim is not patent eligible.

Regarding claim 11
According to the first step (Step 1) of the 101 analysis, claim 1 is directed to a method (process) and falls within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter). 
In the next step (Step 2A, prong 1) of the analysis, the limitations of, recommending, a decision based on one or more decision policies, wherein the decision complies with one or more constraints of a constrained decision policy, facilitating concurrent maximization of a cumulative reward 316/050,176received based on a recommended decision and compliance with the constrained decision policy; and generating, an explanation of the decision, the explanation comprising one or more factors contributing to the decision; and wherein the explanation comprises a list of a ranking of a number of most influential features employed to make the decision based on the constrained decision policy or the reward-based decision policy; and selecting, a decision policy of the one or more decision policies based on a comparison between a similarity threshold and a difference in similarity between the constrained decision policy and a reward-based decision policy, wherein the selection component selects the constrained decision policy if the similarity threshold is less than the difference in similarity between the constrained decision policy and the reward-based decision policy, and wherein the selection component selects the reward-based decision policy if the similarity threshold is greater than the difference in similarity between the constrained decision policy and the reward-based decision policy, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer component. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
 In the next step (Step 2A, prong 2) of the analysis, the limitations, a computer-implemented method, by a system operatively coupled to a processor, by the system, are considered to be additional elements. However, the judicial exceptions are not integrated into a practical application because the additional element is recited so generically (no details whatsoever are provided other than that it is a method implemented by a computer and a system operatively coupled to a processor) that it represents no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application. 
In the same step (Step 2A, prong 2), the limitation of, and electronically displaying, by the system, the explanation on the decision via a graphical user interface operably coupled to the system, wherein the explanation comprises an electronic visual list, is considered to be another additional element and as recited represents insignificant extra-solution activity that is data output, because it is a mere nominal or tangential addition to the claim and is therefore not indicative of integration into a practical application. See MPEP 2106.05(g).
In the last step (Step 2B) of the analysis, the additional elements, a computer-implemented method, by a system operatively coupled to a processor, by the system,  do not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two, a method implemented by a computer and a system operatively coupled to a processor is at best the equivalent of merely adding the words “apply it” to the judicial exception. Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. 
In the same step (Step 2B) of the analysis, the recitation of the “electronically displaying…” limitation amounts to insignificant extra solution activity because it is a mere nominal or tangential addition to the claim, amounting to mere data output (see MPEP 2106.05(g)).  The courts have similarly found limitations directed to displaying a result, recited at a high level of generality, to be well-understood, routine, and conventional. See (MPEP 2106.05(d)(II), "presenting offers and gathering statistics.", “determining an estimated outcome and setting a price”). These limitations therefore remain insignificant extra-solution activity even upon reconsideration, and do not amount to significantly more. Even when considered in combination, these additional elements represent mere instructions to apply an exception and insignificant extra-solution activity, which cannot provide an inventive concept. The claim is not patent eligible.

Regarding claim 12
In step 2A, prong 1, the limitation of further comprising learning, the constrained decision policy implicitly, based on a set of example decisions, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer component. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2) of the analysis, the limitation, by the system, is considered to be an additional element. However, the judicial exceptions are not integrated into a practical application because the additional element is recited so generically (no details whatsoever are provided other than that it is a method implemented by a computer and a system) that it represents no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application. 
In the last step (Step 2B) of the analysis, the additional elements, by the system,  does not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two, a method implemented by a computer and a system is at best the equivalent of merely adding the words “apply it” to the judicial exception. See MPEP 2106.05(f). Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. The claim is not patent eligible.



Regarding claim 13
In the Step 2A, prong 1 of the analysis, the limitation of, further comprising selecting, a decision policy of the one or more decision policies, based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer components. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2), the limitations of, by the system and using a constrained contextual multi-armed bandit setting, are considered to be additional elements and it does not integrate the abstract idea into a practical application because the additional elements are recited so generically (no details whatsoever are provided other than that it is a method implemented on a computer by a system and uses a constrained contextual multi-armed bandit setting) that it represents no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application.
In the last step (Step 2B) of the analysis, the additional element does not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two a method implemented on a computer by a system and using a constrained contextual multi-armed bandit setting is at best the equivalent of merely adding the words “apply it” to the judicial exception. See MPEP 2106.05(f). Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 14
In the Step 2A, prong 1 of the analysis, the limitation of, further comprising learning, the reward-based decision policy based on feedback corresponding to a plurality of recommended decisions, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer components. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2), the limitation of, by the system, in an online environment, is considered to be an additional element and it does not integrate the abstract idea into a practical application because the additional element is recited so generically (no details whatsoever are provided other than that it is a method implemented on a computer by a system in an online environment) that it represents no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application.
In the last step (Step 2B) of the analysis, the additional element does not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two, a method implemented on a computer by a system in an online environment, is at best the equivalent of merely adding the words “apply it” to the judicial exception. See MPEP 2106.05(f). Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 15
In the Step 2A, prong 1 of the analysis, the limitation of, further comprising blending, by the system, two or more of the one or more decision policies to generate a hybrid decision policy based on the similarity threshold, wherein the blending the two or more of the one or more decision policies, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer components. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2), the limitation of, by the system, is considered to be an additional element and it does not integrate the abstract idea into a practical application because the additional element is recited so generically (no details whatsoever are provided other than that it is a method implemented on a computer by a system) that it represents no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application.
In the same step (Step 2A, prong 2) of the analysis, the limitation of, facilitates at least one of improved processing efficiency or improved processing time associated with the processor, is considered to be an additional element as it just explains the desirable result and so is not given any patentable weight as it does not limit the claim further. As discussed in MPEP 2106.05(f), the recitation of claim limitations that attempt to cover any solution to an identified problem with no restriction on how the result is accomplished and no description of the mechanism for accomplishing the result, does not integrate a judicial exception into a practical application or provide significantly more because this type of recitation is equivalent to the words "apply it".  
In the last step (Step 2B) of the analysis, the additional element does not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two, a method implemented on a computer by a system, is at best the equivalent of merely adding the words “apply it” to the judicial exception. See MPEP 2106.05(f). Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. 
In the same step (step 2B) the limitation of “facilitates at least one of improved processing efficiency …” is not indicative of improvement in technology because the additional limitation provided only a result-oriented solution and lacked details as to how the computer performed the modifications, which was equivalent to the words "apply it". Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. The claim is not patent eligible.


Regarding claim 16
According to the first step (Step 1) of the 101 analysis, claim 1 is directed to a computer program product (manufacture) and falls within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter). 
In the next step (Step 2A, prong 1) of the analysis, the limitations of recommend, a decision based on one or more decision policies, wherein the decision complies with one or more constraints of a constrained decision policy; generate, an explanation of the decision, the explanation comprising one or more factors contributing to the decision; wherein the ranking comprises decision context vector factors based on the constrained decision policy; and select, a decision policy of the one or more decision policies based on a comparison between a similarity threshold and a difference in similarity between the constrained decision policy and a reward-based decision policy, wherein the selection component selects the constrained decision policy if the similarity threshold is less than the difference in similarity between the constrained decision policy and the reward-based decision policy, and 516/050,176wherein the selection component selects the reward-based decision policy if the similarity threshold is greater than the difference in similarity between the constrained decision policy and the reward-based decision policy, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer component. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
 In the next step (Step 2A, prong 2) of the analysis, the limitation of, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to, by the processor are considered to be additional elements. However, the judicial exceptions are not integrated into a practical application because the additional element is recited so generically (no details whatsoever are provided other than that it is a computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to) that it represents no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application. 
In the same step (Step 2A, prong 2), the limitation of, and electronically render, the explanation on the decision via a graphical user interface operably coupled to the system, wherein the explanation comprises an electronic visual list of a ranking of a number of most influential features employed to make the decision, is considered to be another additional element and as recited represents insignificant extra-solution activity that is data output, because it is a mere nominal or tangential addition to the claim and is therefore not indicative of integration into a practical application. See MPEP 2106.05(g).
In the last step (Step 2B) of the analysis, the additional elements, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to, by the processor,  do not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two, a computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to, is at best the equivalent of merely adding the words “apply it” to the judicial exception. Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. 
In the same step (Step 2B) of the analysis, the recitation of the “electronically render…” limitation amounts to insignificant extra solution activity because it is a mere nominal or tangential addition to the claim, amounting to mere data output (see MPEP 2106.05(g)).  The courts have similarly found limitations directed to displaying a result, recited at a high level of generality, to be well-understood, routine, and conventional. See (MPEP 2106.05(d)(II), "presenting offers and gathering statistics.", “determining an estimated outcome and setting a price”). These limitations therefore remain insignificant extra-solution activity even upon reconsideration, and do not amount to significantly more. Even when considered in combination, these additional elements represent mere instructions to apply an exception and insignificant extra-solution activity, which cannot provide an inventive concept. The claim is not patent eligible.

Regarding claim 17 
In step 2A, prong 1, the limitations of learn, the constrained decision policy implicitly, based on a set of example decisions; and learn, a reward-based decision policy based on feedback corresponding to a plurality of recommended decisions, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer component. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2) of the analysis, the limitation, by the processor, and in an online environment, are considered to be additional elements. However, the judicial exceptions are not integrated into a practical application because the additional elements are recited so generically (no details whatsoever are provided other than that it is a method implemented by a computer and a system) that it represents no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application. 
In the last step (Step 2B) of the analysis, the additional elements, by the processor, and in an online environment, do not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two, a method implemented by a computer and a system is at best the equivalent of merely adding the words “apply it” to the judicial exception. See MPEP 2106.05(f). Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. The claim is not patent eligible.



Regarding claim 18
In the Step 2A, prong 1 of the analysis, the limitation of, further comprising select, a decision policy of the one or more decision policies, based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer components. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2), the limitations of, wherein the program instructions are further executable by the processor to cause the processor to, by the processor, and using a constrained contextual multi-armed bandit setting, are considered to be additional elements and it does not integrate the abstract idea into a practical application because the additional elements are recited so generically (no details whatsoever are provided other than that it the program instructions are further executable by the processor to cause the processor to select and using a constrained contextual multi-armed bandit setting) that it represents no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application.
In the last step (Step 2B) of the analysis, the additional element does not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two the program instructions are further executable by the processor to cause the processor to select and using a constrained contextual multi-armed bandit setting is at best the equivalent of merely adding the words “apply it” to the judicial exception. See MPEP 2106.05(f). Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 19
In step 2A, prong 1, the limitation of, to blend, two or more of the one or more decision policies to generate a hybrid decision policy, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer components. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2), the limitations of, wherein the program instructions are further executable by the processor to cause the processor to and by the processor, are considered to be additional elements and it does not integrate the abstract idea into a practical application because the additional elements are recited so generically (no details whatsoever are provided other than that it the program instructions are further executable by the processor to cause the processor to select and using a constrained contextual multi-armed bandit setting) that it represents no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application.
In the last step (Step 2B) of the analysis, the additional element does not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two the program instructions are further executable by the processor to cause the processor to, is at best the equivalent of merely adding the words “apply it” to the judicial exception. See MPEP 2106.05(f). Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 20
In step 2A, prong 1, the limitation of to update, at least one of the one or more decision policies based on a reward signal indicative of feedback corresponding to the decision, under the broadest reasonable interpretation, cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the mind or with the aid of pencil and paper but for the recitation of a generic computer components. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the “Mental Process grouping of abstract ideas.
In the next step (Step 2A, prong 2), the limitations of, wherein the program instructions are further executable by the processor to cause the processor to, by the processor, and associated with the processor, are considered to be additional elements and it does not integrate the abstract idea into a practical application because the additional elements are recited so generically (no details whatsoever are provided other than that it the program instructions are further executable by the processor to cause the processor to select and using a constrained contextual multi-armed bandit setting) that it represents no more than mere instructions to apply the judicial exception on a computer. As discussed in MPEP 2106.05(f), mere instructions to implement an abstract idea on a computer as a tool to perform an abstract idea is not indicative of integration into a practical application.
In the same step, (Step 2A, prong 2) , the limitation of, thereby facilitating at least one of improved processing accuracy or improved processing efficiency, is considered to be an additional element as it just explains the desirable result and so is not given any patentable weight as it does not limit the claim further. As discussed in MPEP 2106.05(f), the recitation of claim limitations that attempt to cover any solution to an identified problem with no restriction on how the result is accomplished and no description of the mechanism for accomplishing the result, does not integrate a judicial exception into a practical application or provide significantly more because this type of recitation is equivalent to the words "apply it".  
In the last step (Step 2B) of the analysis, the additional element does not amount to significantly more than the judicial exceptions. As explained with respect to Step 2A Prong Two the program instructions are further executable by the processor to cause the processor to, is at best the equivalent of merely adding the words “apply it” to the judicial exception. See MPEP 2106.05(f). Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. 
In the same step (step 2B) the limitation of “thereby facilitating at least one of improved processing accuracy …” is not indicative of improvement in technology because the additional limitation provided only a result-oriented solution and lacked details as to how the computer performed the modifications, which was equivalent to the words "apply it". Mere instructions to apply an exception cannot provide an inventive concept and does not amount to significantly more than the judicial exception. The claim is not patent eligible.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 2, 5, 8, 9, and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Etzioni et al (WO 2012071543 A2) in view of Caron et al (EP 2816511 A1) and further in view of Fernando et al (Probabilistic Policy Reuse in a Reinforcement Learning Agent, 2006).
Regarding claim 1
Etzioni teaches: A system, comprising: a memory that stores computer executable components ([00241] the execution of instructions that may be stored in a system memory 722); 
and a processor that executes the computer executable components stored in the memory, wherein the computer executable components comprise: ([00241] one or more processors 720 to communicate with each subsystem and to control the execution of instructions that may be stored in a system memory 722): 
a recommendation component that recommends a decision based on a combination of one or more decision policies ([0050] The decision support component 300 may include a purchase timing recommendation component 304 configured at least to determine beneficial purchase timing recommendations based at least in part on product price and successor availability predictions. [00158] Another embodiment may create a policy for buy/wait decisions), wherein the decision complies with one or more constraints of a constrained decision policy, wherein the compliance with the one or more constraints ([00158] Another embodiment may create a policy for buy/wait decisions based on a cost-sensitive classifier where the desired prediction is WAIT when the cost of BUYING = 0 and cost of WAITING = profit - penalty for length of waiting period), facilitates improved processing time and processing efficiency of the processor based on increased accuracy of the decision recommended (Note: Claim scope is not limited by claim language that suggests or makes optional but does not require steps to be performed, or by claim language that does not limit a claim to a particular structure. A "‘thereby clause in a method claim is not given weight when it simply expresses the intended result of a process step positively recited. It is an intended result and as discussed in MPEP 2111.04 is not given patentable weight); 
and an explanation component that generates an explanation of the decision, the explanation comprising one or more factors contributing to the decision ([0050] A prediction explanation component 308 may be configured at least to determine one or more human-readable explanations for predictions made by the prediction service 200. For example, such explanations may correlate with most significant factors as determined with factor analysis); 
a learning component that electronically learns, via machine learning, a reward- based decision policy, corresponding to a plurality of decisions recommended by the recommendation component ([0050] The decision support component 300 may include a purchase timing recommendation component 304. [0051] Various components of the prediction service 200 (Figure 2) and/or the user decision support component 300 (Figure 3) may incorporate and/or be incorporated by one or more machine learning components. Such machine learning components may utilize any suitable machine learning technique. [00159] Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data. Such an embodiment may develop algorithms based on reinforcement learning and sequential decision making for this purpose).
However, Etzioni does not explicitly disclose: in an online environment, based on electronic feedback, in the online environment; and a selection component that selects a decision policy of the one or more decision policies based on a comparison between a similarity threshold and a difference in similarity between the constrained decision policy and a reward-based decision policy, wherein the selection component selects the constrained decision policy if the similarity threshold is less than the difference in similarity between the constrained decision policy and the reward-based decision policy, and wherein the selection component selects the reward-based decision policy if the similarity threshold is greater than the difference in similarity between the constrained decision policy and the reward-based decision policy.
Caron teaches, in an analogous system: in an online environment, based on electronic feedback, in the online environment ([0014] This may be from their account on social networking sites such as Facebook™, their email address books, or an interface where users can explicitly friend other users of the recommender system. Once a user is signed in, the recommender system picks an artist and samples a song by that artist to recommend to the user. The user may choose to skip the song if she does not enjoy it and move to the next song. From such repetitive feedback, the system wants to learn as fast as possible a set of artists that the user likes, giving her an incentive to continue to use the service. [0022] learn online the top artists of u. Note: Online is being interpreted as being over the internet).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the learning system of Etzioni to incorporate the teachings of Caron to use feedback in an online environment to learn. One would have been motivated to do this modification because doing so would give the benefit of learning as fast as possible giving users incentive to use the service as taught by Caron paragraph [0014].
Fernando teaches, in an analogous system: and a selection component that selects a decision policy of the one or more decision policies based on a comparison between a similarity threshold and a difference in similarity between the constrained decision policy and a reward-based decision policy ([Page 720, Column 2, Last Paragraph] When solving a new problem by policy reuse, the PLPR algorithm determines how different the learned policy is from the past policies as a function of the effectiveness of the reuse. If the past and new policies are “sufficiently” different, PLPR decides to add the new policy to the library of policies. Otherwise, it does not. PLPR is therefore capable of identifying a set of “core” policies that need to be saved to solve any new task in the domain within a threshold of similarity, ±)
wherein the selection component selects the constrained decision policy if the similarity threshold is less than the difference in similarity between the constrained decision policy and the reward-based decision policy ([Page 720, Column 2, Last Paragraph] If the past and new policies are “sufficiently” different, PLPR decides to add the new policy to the library of policies. Otherwise, it does not. [Page 725, Column 2, Paragraph 2] The algorithm makes a decision on whether to add ¦2 to the Policy Library or not. This decision is based on how similar ¦1 is to ¦2, [Page 726, Column 1, First Paragraph]  where ± 2 [0, 1] defines the similarity threshold, i.e., whether the new policy is ±-similar with respect to the Policy Library. Note: Policy library corresponds to the constrained decision policy),
and wherein the selection component selects the reward-based decision policy if the similarity threshold is greater than the difference in similarity between the constrained decision policy and the reward-based decision policy ([Page 720, Column 2, Last Paragraph] When solving a new problem by policy reuse, the PLPR algorithm determines how different the learned policy is from the past policies as a function of the effectiveness of the reuse. If the past and new policies are “sufficiently” different, PLPR decides to add the new policy to the library of policies. [Page 726, Column 1, First Paragraph] The new policy learned is inserted in the library if Wmax is lower than ± times the gain obtained by using the new policy (W­), where ± 2 [0, 1] defines the similarity threshold, i.e., whether the new policy is ±-similar with respect to the Policy Library. Note: The learned policy corresponds to  the reward-based decision policy).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni and Caron to incorporate the teachings of Fernando to use a selection component that selects a decision policy of the one or more decision policies based on a comparison between a similarity threshold. One would have been motivated to do this modification because doing so would give the benefit of being capable of identifying a set of “core” policies that need to be saved to solve any new task in the domain within a threshold of similarity, as taught by Fernando paragraph [Page 720, Column 2, Last Paragraph].



Regarding claim 2
The system of Etzioni, Caron, and Fernando teaches: The system of claim 1 (as shown above). 
Etzioni further teaches comprising wherein the learning component also learns the constrained decision policy implicitly based on a set of example decisions ([00159] Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data. Such an embodiment may develop algorithms based on reinforcement learning and sequential decision making for this purpose. Create an initial training dataset for learning. Note: Training data corresponds to the set of example decisions).

Regarding claim 5
The system of Etzioni, Caron, and Fernando: The system of claim 1 (as shown above). 
Etzioni further teaches wherein the one or more decision policies are selected from a group comprising: one or more second constrained decision policies; and the reward-based decision policy ([00159] Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data. Note: This corresponds to a reward-based decision policy).

Regarding claim 8
The system of Etzioni, Caron, and Fernando: The system of claim 1 (as shown above).
However, Etzioni does not explicitly disclose: further comprising a reward component adapted to receive a reward signal from an entity based on the decision, wherein the reward signal is indicative of quality of the decision.
Caron teaches in an analogous system: further comprising a reward component adapted to receive a reward signal from an entity based on the decision, wherein the reward signal is indicative of quality of the decision  ([0010] In the MAB model, a decision maker repeatedly chooses among a finite set of K actions. At each step t, the action a chosen yields a reward X.sub.a,t drawn from a probability distribution intrinsic to a and unknown to the decision maker. The goal for the latter is to learn, as fast as possible, which are the actions yielding maximum reward in expectation. Note: Action corresponds to decision and the goal of learning as fast as possible corresponds to improved processing efficiency).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Caron to comprise a reward component adapted to receive a reward signal from an entity based on the decision, wherein the reward signal is indicative of quality of the decision. One would have been motivated to do this modification because doing so would give the benefit of learning as fast as possible the actions yielding maximum reward as taught by Caron paragraph [0010].

Regarding claim 9
The system of Etzioni, Caron, and Fernando: The system of claim 1 (as shown above).
However, Etzioni does not explicitly disclose: further comprising an update component that updates at least one of the one or more decision policies based on a reward signal indicative of feedback corresponding to the decision.
Caron teaches in an analogous system: further comprising an update component that updates at least one of the one or more decision policies based on a reward signal indicative of feedback corresponding to the decision ([0041] At step 330, an update to the user's empirical estimate (Xa) is updated as a result of the feedback received from the new user. This is represented as line 7 of algorithm 2 or line 9 of algorithm 3. The feedback, provided by the new user over time, will eventually enhance the multi-armed bandit model of the new user such that the bandit will begin to generate high reward recommendations to the new user. Note: Generating high reward recommendations corresponds to improved processing).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni to incorporate the teachings of Caron to comprise an update component that updates at least one of the one or more decision policies based on a reward signal indicative of feedback corresponding to the decision. One would have been motivated to do this modification because doing so would give the benefit of enhancing the model such that the bandit will begin to generate high reward recommendations to the new user as taught by Caron paragraph [0041].

Regarding claim 10
The system of Etzioni, Caron, and Fernando: The system of claim 1 (as shown above). 
Etzioni further teaches wherein the constrained decision policy is selected from a group consisting of: an ethical decision policy; a legal decision policy; a value decision policy; and a preference decision policy ([0047] The user decision support component 210 may take into account user preferences when providing predictions, recommendations and decision support information. Such user preferences may be stored in corresponding user profiles in a user account database 216 managed by a user account management component 218. Note: This corresponds to preference decision policy).

Claims 11, 12, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Etzioni et al (WO 2012071543 A2) in view of	Lee et al (US 20170098236 A1) and further in view of Fernando et al (Probabilistic Policy Reuse in a Reinforcement Learning Agent, 2006).
Regarding claim 11
Etzioni teaches: A computer-implemented method, comprising: recommending, by a system operatively coupled to a processor, a decision based on one or more decision policies, wherein the decision complies with one or more constraints of a constrained decision policy ([00241] one or more processors. [0050] The decision support component 300 may include a purchase timing recommendation component 304 configured at least to determine beneficial purchase timing recommendations based at least in part on product price and successor availability predictions. [00158] Another embodiment may create a policy for buy/wait decisions. [00158] Another embodiment may create a policy for buy/wait decisions based on a cost-sensitive classifier where the desired prediction is WAIT when the cost of BUYING = 0 and cost of WAITING = profit - penalty for length of waiting period); facilitating concurrent maximization of a cumulative reward received based on a recommended decision and compliance with the constrained decision policy ([00159] Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data); 
generating, by the system, an explanation of the decision, the explanation comprising one or more factors contributing to the decision ([0050] A prediction explanation component 308 may be configured at least to determine one or more human-readable explanations for predictions made by the prediction service 200. For example, such explanations may correlate with most significant factors as determined with factor analysis); 
and electronically displaying, by the system, the explanation on the decision via a graphical user interface operably coupled to the system, wherein the explanation comprises an electronic visual list ... of a number of most influential features employed to make the decision based on the constrained decision policy or the reward-based decision policy ([0048] For example, the prediction service 200 may include a web-based graphical user interface. [00240] At step 608, a purchase timing recommendation may be determined, for example, by the purchase timing recommendation component 304 (Figure 3). Purchase timing recommendations may be determined based on the price prediction(s) of step 606 and the successor availability date predictions of step 510 and/or step 512. At step 610, one or more factors that significantly contributed to the recommendation of step 608 may be determined, for example, by the prediction explanation component 308, and the factor(s) may be mapped to human-readable explanation(s) at step 612. In accordance with at least one embodiment of the invention, step 612 may be incorporated into step 610. At step 614, the recommendation of step 608 and support information such as the explanation of step 610, may be provided for presentation, for example, with a suitable user interface 220 (Figure 2)).
However, Etzioni does not explicitly disclose: of a ranking of a number of most influential features; and selecting, by the system, a decision policy of the one or more decision policies based on a comparison between a similarity threshold and a difference in similarity between the constrained decision policy and a reward-based decision policy, wherein the selection component selects the constrained decision policy if the similarity threshold is less than the difference in similarity between the constrained decision policy and the reward-based decision policy, and wherein the selection component selects the reward-based decision policy if the similarity threshold is greater than the difference in similarity between the constrained decision policy and the reward-based decision policy.
Lee teaches in an analogous system: of a ranking of a number of most influential features ([0055] to generate a limited set of ranked combinations of contextual and advertiser features yielding high expected response rates).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Etzioni to incorporate the teachings of Lee to use a ranking of a number of most influential features. One would have been motivated to do this modification because doing so would give the benefit of determining an optimal feature combination to provide the highest expected reward as taught by Lee [0055].
Fernando teaches, in an analogous system: and selecting, by the system, a decision policy of the one or more decision policies based on a comparison between a similarity threshold and a difference in similarity between the constrained decision policy and a reward-based decision policy ([Page 720, Column 2, Last Paragraph] When solving a new problem by policy reuse, the PLPR algorithm determines how different the learned policy is from the past policies as a function of the effectiveness of the reuse. If the past and new policies are “sufficiently” different, PLPR decides to add the new policy to the library of policies. Otherwise, it does not. PLPR is therefore capable of identifying a set of “core” policies that need to be saved to solve any new task in the domain within a threshold of similarity, ±)
wherein the selection component selects the constrained decision policy if the similarity threshold is less than the difference in similarity between the constrained decision policy and the reward-based decision policy ([Page 720, Column 2, Last Paragraph] If the past and new policies are “sufficiently” different, PLPR decides to add the new policy to the library of policies. Otherwise, it does not. [Page 725, Column 2, Paragraph 2] The algorithm makes a decision on whether to add ¦2 to the Policy Library or not. This decision is based on how similar ¦1 is to ¦2, [Page 726, Column 1, First Paragraph]  where ± 2 [0, 1] defines the similarity threshold, i.e., whether the new policy is ±-similar with respect to the Policy Library. Note: Policy library corresponds to the constrained decision policy),
and wherein the selection component selects the reward-based decision policy if the similarity threshold is greater than the difference in similarity between the constrained decision policy and the reward-based decision policy ([Page 720, Column 2, Last Paragraph] When solving a new problem by policy reuse, the PLPR algorithm determines how different the learned policy is from the past policies as a function of the effectiveness of the reuse. If the past and new policies are “sufficiently” different, PLPR decides to add the new policy to the library of policies. [Page 726, Column 1, First Paragraph] The new policy learned is inserted in the library if Wmax is lower than ± times the gain obtained by using the new policy (W­), where ± 2 [0, 1] defines the similarity threshold, i.e., whether the new policy is ±-similar with respect to the Policy Library. Note: The learned policy corresponds to  the reward-based decision policy).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined teachings of Etzioni and Lee to incorporate the teachings of Fernando to use a selection component that selects a decision policy of the one or more decision policies based on a comparison between a similarity threshold. One would have been motivated to do this modification because doing so would give the benefit of being capable of identifying a set of “core” policies that need to be saved to solve any new task in the domain within a threshold of similarity, as taught by Fernando paragraph [Page 720, Column 2, Last Paragraph].

Regarding claim 12
The system of Etzioni, Lee, and Fernando teaches: The computer-implemented method of claim 11 (as shown above). 
Etzioni further teaches, further comprising learning, by the system, the constrained decision policy implicitly, based on a set of example decisions ([00159] Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data. Such an embodiment may develop algorithms based on reinforcement learning and sequential decision making for this purpose. Create an initial training dataset for learning. Note: Training data corresponds to the set of example decisions).

Regarding claim 16
Etzioni teaches: A computer program product facilitating a constrained decision-making and explanation of a recommendation process, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to: recommend, by the processor ([00242] The software code may be stored as a series of instructions, or commands on a computer readable medium. [00241] one or more processors), a decision based on one or more decision policies, wherein the decision complies with one or more constraints of a constrained decision policy ([0050] The decision support component 300 may include a purchase timing recommendation component 304 configured at least to determine beneficial purchase timing recommendations based at least in part on product price and successor availability predictions. [00158] Another embodiment may create a policy for buy/wait decisions. [00158] Another embodiment may create a policy for buy/wait decisions based on a cost-sensitive classifier where the desired prediction is WAIT when the cost of BUYING = 0 and cost of WAITING = profit - penalty for length of waiting period); 
and generate, by the processor, an explanation of the decision, the explanation comprising one or more factors contributing to the decision ([0050] A prediction explanation component 308 may be configured at least to determine one or more human-readable explanations for predictions made by the prediction service 200. For example, such explanations may correlate with most significant factors as determined with factor analysis); 
electronically render, by the processor, the explanation on the decision via a graphical user interface operably coupled to the system, wherein the explanation comprises an electronic visual list of a number of most influential features employed to make the decision ([0048] For example, the prediction service 200 may include a web-based graphical user interface. [00240] At step 608, a purchase timing recommendation may be determined, for example, by the purchase timing recommendation component 304 (Figure 3). Purchase timing recommendations may be determined based on the price prediction(s) of step 606 and the successor availability date predictions of step 510 and/or step 512. At step 610, one or more factors that significantly contributed to the recommendation of step 608 may be determined, for example, by the prediction explanation component 308, and the factor(s) may be mapped to human-readable explanation(s) at step 612. In accordance with at least one embodiment of the invention, step 612 may be incorporated into step 610. At step 614, the recommendation of step 608 and support information such as the explanation of step 610, may be provided for presentation, for example, with a suitable user interface 220 (Figure 2)).
However, Etzioni does not explicitly disclose: of a ranking of a number of most influential features, wherein the ranking comprises decision context vector factors based on the constrained decision policy; and select, by the processor, a decision policy of the one or more decision policies based on a comparison between a similarity threshold and a difference in similarity between the constrained decision policy and a reward-based decision policy, wherein the selection component selects the constrained decision policy if the similarity threshold is less than the difference in similarity between the constrained decision policy and the reward-based decision policy, and wherein the selection component selects the reward-based decision policy if the similarity threshold is greater than the difference in similarity between the constrained decision policy and the reward-based decision policy.
Lee teaches in an analogous system: of a ranking of a number of most influential features, wherein the ranking comprises decision context vector factors based on the constrained decision policy ([0055] The present description describes a feature recommendation and bidding system that combines a collaborative filtering system with a multi-arm bandit system to make bidding decisions. The collaborative filtering system may be a part of the feature recommendation part of the system and may include a factorization machine that uses a training data set to generate a limited set of ranked combinations of contextual and advertiser features yielding high expected response rates. The multi-armed bandit system may be a component of the bidding part of the system and may receive the limited set of ranked feature combinations from the feature recommendation part).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the computer program product facilitating a constrained decision-making and explanation of a recommendation process system of Etzioni to incorporate the teachings of Lee to use a ranking of a number of most influential features, wherein the ranking comprises decision context vector factors based on the constrained decision policy. One would have been motivated to do this modification because doing so would give the benefit of determining an optimal feature combination to provide the highest expected reward as taught by Lee [0055].
Fernando teaches, in an analogous system: and select, by the processor, a decision policy of the one or more decision policies based on a comparison between a similarity threshold and a difference in similarity between the constrained decision policy and a reward-based decision policy ([Page 720, Column 2, Last Paragraph] When solving a new problem by policy reuse, the PLPR algorithm determines how different the learned policy is from the past policies as a function of the effectiveness of the reuse. If the past and new policies are “sufficiently” different, PLPR decides to add the new policy to the library of policies. Otherwise, it does not. PLPR is therefore capable of identifying a set of “core” policies that need to be saved to solve any new task in the domain within a threshold of similarity, ±)
wherein the selection component selects the constrained decision policy if the similarity threshold is less than the difference in similarity between the constrained decision policy and the reward-based decision policy ([Page 720, Column 2, Last Paragraph] If the past and new policies are “sufficiently” different, PLPR decides to add the new policy to the library of policies. Otherwise, it does not. [Page 725, Column 2, Paragraph 2] The algorithm makes a decision on whether to add ¦2 to the Policy Library or not. This decision is based on how similar ¦1 is to ¦2, [Page 726, Column 1, First Paragraph]  where ± 2 [0, 1] defines the similarity threshold, i.e., whether the new policy is ±-similar with respect to the Policy Library. Note: Policy library corresponds to the constrained decision policy),
and wherein the selection component selects the reward-based decision policy if the similarity threshold is greater than the difference in similarity between the constrained decision policy and the reward-based decision policy ([Page 720, Column 2, Last Paragraph] When solving a new problem by policy reuse, the PLPR algorithm determines how different the learned policy is from the past policies as a function of the effectiveness of the reuse. If the past and new policies are “sufficiently” different, PLPR decides to add the new policy to the library of policies. [Page 726, Column 1, First Paragraph] The new policy learned is inserted in the library if Wmax is lower than ± times the gain obtained by using the new policy (W­), where ± 2 [0, 1] defines the similarity threshold, i.e., whether the new policy is ±-similar with respect to the Policy Library. Note: The learned policy corresponds to  the reward-based decision policy).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined teachings of Etzioni and Lee to incorporate the teachings of Fernando to use a selection component that selects a decision policy of the one or more decision policies based on a comparison between a similarity threshold. One would have been motivated to do this modification because doing so would give the benefit of being capable of identifying a set of “core” policies that need to be saved to solve any new task in the domain within a threshold of similarity, as taught by Fernando paragraph [Page 720, Column 2, Last Paragraph].

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Etzioni et al (WO 2012071543 A2) in view of Caron et al (EP 2816511 A1) and further in view of Fernando et al (Probabilistic Policy Reuse in a Reinforcement Learning Agent, 2006) and further in view of Xia et al (Thompson Sampling for Budgeted Multi-armed Bandits, 2015).
Regarding claim 3
the system of Etzioni, Caron, and Fernando teaches: The system of claim 1, further comprising a learning component that learns (as shown above).
However, the system of Etzioni, Caron, and Fernando does not explicitly disclose: the constrained decision policy implicitly using classical Thompson sampling.
Xia teaches in an analogous system: the constrained decision policy implicitly using classical Thompson sampling ([Page 1, Abstract] Thompson sampling is one of the earliest randomized algorithms for multi-armed bandits (MAB). In this paper, we extend the Thompson sampling to Budgeted MAB, where there is random cost for pulling an arm and the total cost is constrained by a budget).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined system of Etzioni, Caron, and Fernando to incorporate the teachings of Xia to use classical Thompson sampling. One would have been motivated to do this modification because doing so would give the benefit of achieving good performances in practical applications as taught by Xia [Page 1, Introduction, 2nd column, 2nd paragraph].

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Etzioni et al (WO 2012071543 A2) in view of Caron et al (EP 2816511 A1) and further in view of Fernando et al (Probabilistic Policy Reuse in a Reinforcement Learning Agent, 2006) and further in view of Henderson et al (Hybrid Reinforcement/Supervised Learning of Dialogue Policies from Fixed Data Sets, 2008).
Regarding claim 7
The system of Etzioni, Caron, and Fernando teaches: The system of claim 1 (as shown above).
However, the system of Etzioni, Caron, and Fernando does not explicitly disclose: further comprising a blending component that blends two or more of the one or more decision policies to generate a hybrid decision policy based on the similarity threshold.
Henderson teaches in an analogous system: further comprising a blending component that blends two or more of the one or more decision policies to generate a hybrid decision policy based on the similarity threshold ([Page 1, first paragraph] a hybrid model that combines reinforcement learning with supervised learning. The reinforcement learning is used to optimize a measure of dialogue reward, while the supervised learning is used to restrict the learned policy  [Page 498, Section 2.4, Paragraph 1] To do this, we propose a novel hybrid approach that combines RL with supervised learning. A discriminant function Qhybrid(s, a) is derived that combines these two criteria in a principled way. The resulting policy can be adjusted to be as similar as necessary to the policy in the data.  [Page 498, Section 2.4, Paragraph 2] Because in general multiple policies were used, we model the data’s policy as a probabilistic policy, using the estimate Sdata(s, a) of P(a|s) presented in the previous section).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined system of Etzioni, Caron, and Fernando to incorporate the teachings of Henderson to use a blending component that blends two or more of the one or more decision policies to generate a hybrid decision policy based on the similarity threshold. One would have been motivated to do this modification because doing so would give the benefit of improving techniques for bootstrapping and automatic optimization of dialogue management policies from limited initial data sets as taught by Henderson [Page 1, paragraph 1].

Claims 14, 17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Etzioni et al (WO 2012071543 A2) in view of Lee et al (US 20170098236 A1) and further in view of Fernando et al (Probabilistic Policy Reuse in a Reinforcement Learning Agent, 2006) and further in view of Caron et al (EP 2816511 A1).
Regarding claim 14
The system of Etzioni, Lee, Fernando teaches: The computer-implemented method of claim 11 (as shown above). 
Etzioni further teaches: and the reward-based decision policy ([00159] Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data. Note: This corresponds to a reward-based decision policy.)
However, the system of Etzioni, Lee, and Fernando does not explicitly disclose: further comprising learning, by the system, in an online environment, based on feedback corresponding to a plurality of recommended decisions.
Caron teaches in an analogous system: further comprising learning, by the system, in an online environment, based on feedback corresponding to a plurality of recommended decisions ([Abstract] recommender system to recommend items to a new user includes calculating reward estimates from multiple multi-armed bandit models of a user and her social network friends.  [0014] When new users sign in to a recommender system for the first time, their social graph information is gathered. This may be from their account on social networking sites such as Facebook™, their email address books, or an interface where users can explicitly friend other users of the recommender system. Once a user is signed in, the recommender system picks an artist and samples a song by that artist to recommend to the user. The user may choose to skip the song if she does not enjoy it and move to the next song. From such repetitive feedback, the system wants to learn as fast as possible a set of artists that the user likes, giving her an incentive to continue to use the service. [0022] Learn online the top artists of u).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni, Lee, and Fernando to incorporate the teachings of Caron to learn, by the system, in an online environment, based on feedback corresponding to a plurality of recommended decisions. One would have been motivated to do this modification because doing so would give the benefit of learning as fast as possible giving users incentive to use the service as taught by Caron paragraph [0014].

Regarding claim 17
The system of Etzioni, Lee, and Fernando teaches: The computer program product of claim 16 (as shown above). 
Etzioni further teaches: wherein the program instructions are further executable by the processor to cause the processor to: learn, by the processor, the constrained decision policy implicitly, based on a set of example decisions ([00159] Another embodiment may create a policy for buy/wait decisions that is explicitly optimized to maximize the reward for a customer over a set of training data. Such an embodiment may develop algorithms based on reinforcement learning and sequential decision making for this purpose. Create an initial training dataset for learning. Note: Training data corresponds to the set of example decisions).
However, the system of Etzioni and Lee does not explicitly disclose: and learn, by the processor, in an online environment, a reward-based decision policy based on feedback corresponding to a plurality of recommended decisions.
Caron teaches in an analogous system: and learn, by the processor, in an online environment, a reward-based decision policy based on feedback corresponding to a plurality of recommended decisions ([Abstract] recommender system to recommend items to a new user includes calculating reward estimates from multiple multi-armed bandit models of a user and her social network friends.  [0014] When new users sign in to a recommender system for the first time, their social graph information is gathered. This may be from their account on social networking sites such as Facebook™, their email address books, or an interface where users can explicitly friend other users of the recommender system. Once a user is signed in, the recommender system picks an artist and samples a song by that artist to recommend to the user. The user may choose to skip the song if she does not enjoy it and move to the next song. From such repetitive feedback, the system wants to learn as fast as possible a set of artists that the user likes, giving her an incentive to continue to use the service. [0022] Learn online the top artists of u).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined system of Etzioni, Lee, and Fernando to incorporate the teachings of Caron to learn, by the processor, in an online environment, a reward-based decision policy based on feedback corresponding to a plurality of recommended decisions. One would have been motivated to do this modification because doing so would give the benefit of learning as fast as possible giving users incentive to use the service as taught by Caron paragraph [0014].

Regarding claim 20
The system of Etzioni, Lee, and Fernando teaches: The computer program product of claim 16, wherein the program instructions are further executable by the processor to cause the processor (as shown above).
However, the system of Etzioni, Lee, and Fernando does not explicitly disclose: update, by the processor, at least one of the one or more decision policies based on a reward signal indicative of feedback corresponding to the decision, thereby facilitating at least one of improved processing accuracy or improved processing efficiency associated with the processor.
Caron teaches in an analogous system: update, by the processor, at least one of the one or more decision policies based on a reward signal indicative of feedback corresponding to the decision, thereby facilitating at least one of improved processing accuracy or improved processing efficiency associated with the processor ([0041] At step 330, an update to the user's empirical estimate (Xa) is updated as a result of the feedback received from the new user. This is represented as line 7 of algorithm 2 or line 9 of algorithm 3. The feedback, provided by the new user over time, will eventually enhance the multi-armed bandit model of the new user such that the bandit will begin to generate high reward recommendations to the new user. Note: Generating high reward recommendations corresponds to improved processing).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined system of Etzioni, Lee, and Fernando to incorporate the teachings of Caron to update, by the processor, at least one of the one or more decision policies based on a reward signal indicative of feedback corresponding to the decision, thereby facilitating at least one of improved processing accuracy or improved processing efficiency associated with the processor. One would have been motivated to do this modification because doing so would give the benefit of enhancing the model such that the bandit will begin to generate high reward recommendations to the new user as taught by Caron paragraph [0041].

Claims 15 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Etzioni et al (WO 2012071543 A2) in view of Lee et al (US 20170098236 A1) and further in view of Fernando et al (Probabilistic Policy Reuse in a Reinforcement Learning Agent, 2006) and further in view of Henderson et al (Hybrid Reinforcement/Supervised Learning of Dialogue Policies from Fixed Data Sets, 2008).
Regarding claim 15
The system of Etzioni, Lee, and Fernando teaches: The computer-implemented method of claim 11 (as shown above).
However, the system of Etzioni, Lee, and Fernando does not explicitly disclose: further comprising blending, by the system, two or more of the one or more decision policies to generate a hybrid decision policy based on the similarity threshold, wherein the blending the two or more of the one or more decision policies facilitates at least one of improved processing efficiency or improved processing time associated with the processor.
Henderson teaches in an analogous system: further comprising blending, by the system, two or more of the one or more decision policies to generate a hybrid decision policy based on the similarity threshold, ([Page 1, first paragraph] a hybrid model that combines reinforcement learning with supervised learning. The reinforcement learning is used to optimize a measure of dialogue reward, while the supervised learning is used to restrict the learned policy  [Page 498, Section 2.4, Paragraph 1] To do this, we propose a novel hybrid approach that combines RL with supervised learning. A discriminant function Qhybrid(s, a) is derived that combines these two criteria in a principled way. The resulting policy can be adjusted to be as similar as necessary to the policy in the data.  [Page 498, Section 2.4, Paragraph 2] Because in general multiple policies were used, we model the data’s policy as a probabilistic policy, using the estimate Sdata(s, a) of P(a|s) presented in the previous section), wherein the blending the two or more of the one or more decision policies facilitates at least one of improved processing efficiency or improved processing time associated with the processor ([Page 508, 2nd paragraph]. This gave us two hybrid RL/SL methods, both of which outperform both the RL and SL policies alone. The best hybrid policy performs 302% better than the standard RL policy, and 1.4% better than the SL policy, according to our automatic evaluation method. The best hybrid policy improves over the average COMMUNICATOR system policy by 10% on our metric).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined system of Etzioni, Lee, and Fernando to incorporate the teachings of Henderson to blend two or more of the one or more decision policies to generate a hybrid decision policy, based on the similarity threshold. One would have been motivated to do this modification because doing so would give the benefit of improving techniques for bootstrapping and automatic optimization of dialogue management policies from limited initial data sets as taught by Henderson [Page 1, paragraph 1].

Regarding claim 19
The system of Etzioni, Lee, and Fernando teaches: The computer program product of claim 16, wherein the program instructions are further executable by the processor to cause the processor to (as shown above).
However, the system of Etzioni and Lee does not explicitly disclose: blend, by the processor, two or more of the one or more decision policies to generate a hybrid decision policy.
Henderson teaches in an analogous system: blend, by the processor, two or more of the one or more decision policies to generate a hybrid decision policy ([Page 1, first paragraph] a hybrid model that combines reinforcement learning with supervised learning. The reinforcement learning is used to optimize a measure of dialogue reward, while the supervised learning is used to restrict the learned policy).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined system of Etzioni, Lee, and Fernando to incorporate the teachings of Henderson blend two or more of the one or more decision policies to generate a hybrid decision policy. One would have been motivated to do this modification because doing so would give the benefit of improving techniques for bootstrapping and automatic optimization of dialogue management policies from limited initial data sets as taught by Henderson [Page 1, paragraph 1].

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Etzioni et al (WO 2012071543 A2) in view of Caron et al (EP 2816511 A1) and further in view of Fernando et al (Probabilistic Policy Reuse in a Reinforcement Learning Agent, 2006) as applied to claim 1 and further in view of Huasen et al (Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits, 2015).
Regarding claim 4
The system of Etzioni, Caron, and Fernando teaches: The system of claim 1 (as shown above).
Etzioni further teaches: wherein the selection component also selects a decision policy of the one or more decision policies ([00159] Apply policy to dataset to create new training data).
However, the system of Etzioni, Caron, and Fernando does not explicitly disclose: using a constrained contextual multi-armed bandit setting.
Huasen teaches in an analogous system: using a constrained contextual multi-armed bandit setting ([Page 1, Introduction] The contextual bandit problem is an important extension of the classic multi-armed bandit (MAB) problem. [Page 7, last paragraph] Constrained contextual bandits involve complicated interactions between information acquisition and decision making).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni, Caron, and Fernando to incorporate the teachings of Huasen to use a constrained contextual multi-armed bandit setting. One would have been motivated to do this modification because doing so would give the benefit of studying computationally efficient algorithms that achieve logarithmic or sublinear regret for constrained contextual bandits as taught by Huasen [Page 8, Conclusion].

	
Claims 13 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Etzioni et al (WO 2012071543 A2) in view of Lee et al (US 20170098236 A1) and further in view of Fernando et al (Probabilistic Policy Reuse in a Reinforcement Learning Agent, 2006) as applied to claims 11 and 16 respectively, and Huasen et al (Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits, 2015).
Regarding claim 13
The system of Etzioni, Lee, and Fernando teaches: The computer-implemented method of claim 11 (as shown above).
Etzioni further teaches: further comprising selecting, by the system, a decision policy of the one or more decision policies ([00159] Apply policy to dataset to create new training data).
However, the system of Etzioni and Lee does not explicitly disclose: based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy, using a constrained contextual multi-armed bandit setting.
Fernando further teaches in an analogous system: based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy ([Page 726, Column 1, First Paragraph]  where ± 2 [0, 1] defines the similarity threshold, i.e., whether the new policy is ±-similar with respect to the Policy Library).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined system of Etzioni and Lee to incorporate the teachings of Fernando to use based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy. One would have been motivated to do this modification because doing so would give the benefit of defining the “resolution” of the library as taught by Fernando [Page 726, Column 1, Paragraph 2].
Huasen teaches in an analogous system: using a constrained contextual multi-armed bandit setting ([Page 7, last paragraph] Constrained contextual bandits involve complicated interactions between information acquisition and decision making).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined system of Etzioni, Lee, and Fernando to incorporate the teachings of Huasen to use a constrained contextual multi-armed bandit setting. One would have been motivated to do this modification because doing so would give the benefit of studying computationally efficient algorithms that achieve logarithmic or sublinear regret for constrained contextual bandits as taught by Huasen [Page 8, Conclusion].

Regarding claim 18
The system of Etzioni, Lee, and Fernando teaches: The computer program product of claim 16 (as shown above).
Etzioni further teaches, wherein the program instructions are further executable by the processor to cause the processor to select, by the processor, a decision policy of the one or more decision policies ([00159] Apply policy to dataset to create new training data).
However, the system of Etzioni and Lee does not explicitly disclose: based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy, using a constrained contextual multi-armed bandit setting.
Fernando teaches in an analogous system: based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy ([Page 726, Column 1, First Paragraph]  where ± 2 [0, 1] defines the similarity threshold, i.e., whether the new policy is ±-similar with respect to the Policy Library).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined system of Etzioni and Lee to incorporate the teachings of Fernando to use based on a similarity threshold value indicative of a similarity between the constrained decision policy and the decision policy. One would have been motivated to do this modification because doing so would give the benefit of defining the “resolution” of the library as taught by Fernando [Page 726, Column 1, Paragraph 2].
Huasen teaches in an analogous system: using a constrained contextual multi-armed bandit setting (Constrained contextual bandits involve complicated interactions between information acquisition and decision making [Page 7, last paragraph]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Etzioni, Lee, and Fernando to incorporate the teachings of Huasen to use a constrained contextual multi-armed bandit setting. One would have been motivated to do this modification because doing so would give the benefit of studying computationally efficient algorithms that achieve logarithmic or sublinear regret for constrained contextual bandits as taught by Huasen [Page 8, Conclusion].


Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Lebanon et al (2009) discloses Beyond k-Anonymity: A Decision Theoretic Framework for Assessing Privacy Risk.
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHAITANYA RAMESH JAYAKUMAR whose telephone number is (571)272-3369. The examiner can normally be reached Mon-Fri 7am-1pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on (571)272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/C.R.J./Examiner, Art Unit 2128                                                                                                                                                                                                        
/OMAR F FERNANDEZ RIVAS/Supervisory Patent Examiner, Art Unit 2128