DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The present application, filed on 03/21/2019. Claim 1-12 are pending and have been examined. Claims 1, 10 and 11 are independent claim.
The present application claims foreign priority to application no. EP18163225.8 (field on 03/22/2018).

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 03/21/2019. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner. 

Priority
Receipt is acknowledged certified copies of papers required by 27 CFR 1.55.

Claim Rejections - 35 USC § 102

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: 
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention. 
Claims 1-7 and 9-12 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Serban (“A Deep Reinforcement Learning Chatbot”).
Claim 1.
Serban teaches A method implemented by one or more processors as part of a human-to-computer dialog between a subject and a counseling chatbot, the method comprising (Page 1, 3rd paragraph “200, 000 labels on Amazon Mechanical Turk and to maintain over 32 dedicated Tesla K80 GPUs for running our live system” teaches processor and Page 2, 1st paragraph “These models are combined into an ensemble, which generates a candidate set of dialogue responses” teaches dialogue (human to computer)):
Determining a first state of the subject based on one or more signals (2 System Overview & Page 3, Paragraph 1 “As input, the dialogue manager expects to be given a dialogue history (i.e. all utterances recorded in the dialogue so far, including the current user utterance)” teaches dialogue history (first state) from the dialogue manager);
selecting, from a plurality of candidate natural language responses, based on the first state and a decision model, a given natural language response (2 System Overview & Page 3, Paragraph 1 “After generating the candidate response set, the dialogue manager uses a model selection policy to select the response it returns to the user……. we consider selecting the appropriate response as a sequential decision making problem” teaches selecting from a set of candidate response based on the state and a decision);
providing, by the counseling chatbot, at one or more output components of one or more computing devices operated by the subject to engage in the human-to-computer dialog with the counseling chatbot, the given natural language response (Page 4, Table 1 “dialogues and corresponding candidate responses generated by response models. The response of the final system is marked in bold” teaches System marked response bold (output) of the chatbot in the natural language response);
receiving, at one or more input components of one or more of the computing devices, a free-form natural language input from the subject (4.1 Input Features & Page 11, 5th paragraph “As input to the scoring model we compute 1458 features based on the given dialogue history” teaches receiving input);
determining a second state of the subject based on speech recognition output generated from the free-from natural language input (2 System Overview & Page 3, Paragraph 1 “As input, the dialogue manager expects to be given a dialogue history (i.e. all utterances recorded in the dialogue so far, including the current user utterance) and confidence values of the automatic speech recognition system (ASR confidences). To generate a response, the dialogue manager follows a three-step procedure” and Table 1 teaches based on speech recognition which generate response (output) to the system), 
wherein the second state comprises a positive or negative valance towards a target behavior change (4.6 Off-policy REINFORCE with Learned Reward Function & Page 21, Paragraph 3, Equation 16 “if user utterance at time t + 1 has negative sentiment” teaches negative sentiment (valance));
calculating an instant reward based on the second state (4.6 Off-policy REINFORCE with Learned Reward Function & Page 21, Paragraph 3rd “We use the reward model to compute a new estimate for the reward at each time step in each dialogue” teaches calculating reward); 
and training the decision model based on the instant reward (4.7 Q-learning with the Abstract Discourse Markov Decision Process & Page 21 & 22, Paragraph 4th and 1st “4.7 Q-learning with the Abstract Discourse Markov Decision Process Learned Reward” teaches training the decision on the reward).

Claim 2. 
Serban teaches The method of claim 1, 
Serban further teaches wherein the decision model comprises a decision matrix, and training the decision model comprises updating the decision matrix based on the instant reward (4.7 Q-learning with the Abstract Discourse Markov Decision Process & Page 21 & 22, Paragraph 4th and 1st “4.7 Q-learning with the Abstract Discourse Markov Decision Process Learned Reward” and Figure 9 teaches matrix and updating the matrix after training based on the reward) .

Claim 3. 
Serban teaches The method of claim 1, 
Serban further teaches wherein the decision model comprises a neural network (4.7 Q-learning with the Abstract Discourse Markov Decision Process & Page 24, Paragraph 2nd “Q-learning is a simple off-policy reinforcement learning algorithm, which has been shown to be effective for training policies parametrized by neural networks”).

Claim 4. 
Serban teaches The method of claim 3, 
Serban further teaches wherein training the neural network comprises applying back propagation to adjust one or more weights associated with one or more hidden layers of the neural network, wherein applying the back propagation is based on the instant reward (4.3 Supervised AMT: Learning with Crowdsourced Labels & Page 16, Paragraph 2nd “For the first hidden layer, we experiment with layer sizes in the set: {500, 200, 50}” and 4.4 Supervised Learned Reward: Learning with a Learned Reward Function & Page 18, Paragraph 4th “We learn φ by minimizing the squared error between the model’s prediction and the observed return….. As before, we optimize the model parameters with mini-batch stochastic gradient descent (SGD) using Adam” teaches minimizing the squared stochastic gradient descent (back propagation) in the layer of the neural network).

Claim 5. 
Serban teaches The method of claim 1, 
Serban further teaches wherein the one or more signals comprise speech recognition generated from a first free-form natural language input, and the free-form natural language input comprises a second free-form natural language input (2 System Overview & Page 3, Paragraph 1 “As input, the dialogue manager expects to be given a dialogue history (i.e. all utterances recorded in the dialogue so far, including the current user utterance) and confidence values of the automatic speech recognition system (ASR confidences)” teaches comprising speech recognition as input and 4.5 Off-policy REINFORCE & Page 20, Paragraph 1 “where hdt is the dialogue history for dialogue d at time t, adt is the agent’s action for dialogue d at time t” teaches second free-from natural language).

Claim 6. 
Serban teaches The method of claim 1, further comprising 
Serban further teaches determining, based at least in part on the instant reward and other instance rewards calculated during the human-to-computer dialog, a cumulative reward (4 Model Selection Policy & Page 10, Paragraph 5 “The agent’s goal is to maximize the discounted sum of rewards 
    PNG
    media_image1.png
    43
    391
    media_image1.png
    Greyscale
 which is referred to as the expected cumulative return (or simply expected return)” teaches cumulative reward).

Claim 7. 
Serban teaches The method of claim 6, further comprising 
Serban further teaches providing, at one or more visual output components of one or more of the computing devices operated by the subject, a visual indication of the cumulative reward (4 Model Selection Policy & Page 10, Paragraph 5 “The agent’s goal is to maximize the discounted sum of rewards 
    PNG
    media_image1.png
    43
    391
    media_image1.png
    Greyscale
 which is referred to as the expected cumulative return (or simply expected return)” teaches computing cumulative reward).

Claim 9. 
Serban teaches The method of claim 1,
Serban further teaches wherein the plurality of candidate natural language responses include: a first set of informational candidate responses; a second set of candidate responses designed to stimulate a response from the subject; and a third set of candidate responses designed to simulate listening or reflection on part of the counseling chatbot (3.1 Template-based Models & Page 4, Table 1 teaches candidate responses including informational (“male rabbits are called bucks, females are does”) responses, responses to stimulate a response from the subject (“OK, but can you elaborate a bit?”), and responses designed to simulate listening (“Hurrah!  Two is a good number of rabbits”).

Claim 10.
 Claim 10 recites A system comprising one or more processors and memory operably coupled with the one or more processors, wherein the memory stores instructions that, in response to execution of the instructions by one or more processors, cause the one or more processors to the program for performing precisely the method of Claim 1, As Serban performs their method on a computer (Serban, Page 1, 3rd paragraph “200, 000 labels on Amazon Mechanical Turk and to maintain over 32 dedicated Tesla K80 GPUs for running our live system”) in which a system in inherent, Claim 10 is rejected for reason set forth in the rejection of claim 1, respectively. 

Claim 11-12.
Claim 11-12 recites At least one non-transitory computer-readable medium comprising instructions that, in response to execution of the instructions by one or more processors, cause the one or more processors to performing precisely the method of Claim 1-2, As Serban performs their method on a computer (Serban, Page 1, 3rd paragraph “200, 000 labels on Amazon Mechanical Turk and to maintain over 32 dedicated Tesla K80 GPUs for running our live system”) in which non-transitory computer-readable medium in inherent, Claim 11-12 is rejected for reason set forth in the rejection of claim 1-2, respectively.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim 8 is  rejected under 35 U.S.C. 103 as being unpatentable over Serban in view of Williams (“Partially Observable Markov Decision Processes for Spoken Dialogue Management”).
Claim 8. 
Serban teaches The method of claim 1, 
Serban further teaches wherein training the decision model includes maximizing a cumulative mean reward Rc given by the following equation:
    PNG
    media_image2.png
    66
    139
    media_image2.png
    Greyscale
……and ak represents an action at a given turn k, and R(ak) represents an instant reward at a given turn k (4 Model Selection Policy & Page 10, Paragraph 5 “The agent’s goal is to maximize the discounted sum of rewards 
    PNG
    media_image1.png
    43
    391
    media_image1.png
    Greyscale
 which is referred to as the expected cumulative return (or simply expected return)” teaches cumulative reward).

While Serban teaches reward for the human dialog interaction, Serban does not teach cumulative reward divided by positive integer.
However, Williams teaches given by the following equation:
    PNG
    media_image2.png
    66
    139
    media_image2.png
    Greyscale
 wherein K is a positive integer corresponding to a number of turns in the human-to-computer dialog (5.2 SPBVI method description, Page 73, “MDP’s transition function and reward function on those points as…ρ(ˆbn, aˆ): 
    PNG
    media_image3.png
    51
    536
    media_image3.png
    Greyscale
” teaches integer K). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Serban by using positive integer in the reward model, as does Williams, as a cumulative mean reward. The motivation to do so is that “optimization” can be observed on samples K, result of observation is “K ≥ 20 are within bounds of error estimation” and “overall K = 50 appears to achieve asymptotic performance for all concept error rates (perr)”. Reward function are similar to that of Serban (Williams, 5.3 Example SPBVI application: MAXITRAVEL & Page 76, 3rd paragraph and Page 78, Figure 5.3). 
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LOKESHA G PATEL whose telephone number is (571)272-6267. The examiner can normally be reached Monday-Friday 8am-5pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Afshar, Kamran can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/LOKESHA G PATEL/Examiner, Art Unit 2125     
/BRIAN M SMITH/Primary Examiner, Art Unit 2122