Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

DETAILED ACTION
2.	This Office Action is in response to the amendment filed on 08/05/2022.  Claims 1-20 were pending. Claims 1-20 are rejected.

Response to Arguments
3.1.	Applicant's arguments filed 08/05/2022 with respect to claims 1, 6-8, 10-12, 14-16, and 18-19 have been fully considered but they are moot in view of the new grounds of rejection, Serban et al., “A Deep Reinforcement Learning Chatbot”, 5 Nov 2017, Montreal Institute for Learning Algorithms, Montreal, Quebec, Canada.
3.2.	Examiner apologized that Applicant’s representative did not receive the voice message left on 08/01/2022. However, Examiner’s noted in the docket as “interview request received on 08/01/2022. Left a VM and waited for information (e.g. interview agenda and/or proposed amendment).
3.3.	On 09/06/2022, discussed with Applicant's representative amending independent claims 1, 12, and 16 to include limitations of dependent claims 2, 13, and 17. 
On 09/07/2022, Applicant's representative declined the proposal and requested an office action.
Therefore, a Final Office Action is issued.

Claim Rejections - 35 USC § 103
4.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

4.1.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


4.2.	Claims 1, 6-8, 10-12, 14-16, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Cai et al., (“Cai”, US 2018/0316791 A1) in view of Krishnamurthy et al. (“Krishnamurthy”, US 2018/0218080 A1), and further in view of Serban et al., (“Serban”, “A Deep Reinforcement Learning Chatbot”, 5 Nov 2017).

Regarding Claim 1, Cai discloses a computer-implemented method comprising: 
selecting a plurality of conversations, wherein each conversation includes an agent and a user (Cai, FIG.1, agent 180, customer, 190, [0018]: receiving historical data from past calls related to the type(s) of conversation between an agent 180 and a customer 190. For example, call center conversation data can include a length of conversation about a topic, a type of topic (e.g., the weather, sports, family, etc.), a tone of the conversation, a reaction of the customer or agent to the conversation, etc.); 
identifying, in each of the plurality of conversations, a set of turns and one or more topics (Cai, [0019-20]: the topics of conversation are detected and segmented, and the responses of the customer to each segmented topic are detected. The response can be a positive response or negative response (“turn”)); 
associating the one or more topics to each turn of the set of turns (Cai, [0020]: through the customer's answers to the agent, the customer's state can be detected and if the conversation is successful continued or not (“turn”)); 
generating, based on the set of turns, a conversation flow for each conversation, wherein the conversation flow identifies a sequence of the one or more topics (Cai, [0019]: the raw data files for each phone conversation between the agent 180 and the customer 190 are filtered to detect the topics of conversation and segment call, and the responses of customer to each segmented topic are detected into the conversation); 
applying an outcome score to each conversation (Cai, [0023]: the conversation process can include a Markov Decision Process, <S, A, P, R, γ>, A is a finite set of actions, R is reward (“score”). The agent can take the action of which a reward value is feedback to either increase or decrease the confidence in the action; [0027-29]: a value (Q) (“outcome score”) is determined for every state and action pair (Q(S, A)) with a reward (R) factored into the next pair (Q(S’, A′))); 
creating a reinforced learning (RL) model, wherein the RL model includes a Markov chain and wherein the RL model is based on the conversation flow of each conversation and the outcome score of each conversation (Cai, [0022]: reinforcement learning is leveraged to build the conversation model (i.e., a model predicting what conversation topic will mostly likely to lead to a positive result from the customers responses to conversation)).
However, Cai does not disclose
deploying the RL model, wherein the deploying includes sending the RL model to a chatbot and the RL model is configured to maximize a future outcome score for a future conversation.
Krishnamurthy discloses
deploying the RL model, wherein the deploying includes sending the RL model to a chatbot (Krishnamurthy, FIG.1 user device 102, reinforcement learning (RL) agent 114, [0027, 58]: deploying the RL model to RL agent 114).  
Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to incorporate the “conversational agent” of Krishnamurthy into the invention of Cai. The suggestion/motivation would have been to facilitate training a conversational agent to be a reinforcement learning (RL) by using a user model in which the RL agent selects agent actions in response to user actions sampled using the conditional probabilities from the user model (Krishnamurthy, Abstract, [0001-5]).
However, Cai-Krishnamurthy does not disclose

Serban discloses
 (Serban, Page 10, #4. Model Selection Policy, [2]: Use the reinforcement learning framework. The dialogue manager is an agent, which takes actions in an environment in order to maximize rewards. Page 18, #4.4. Supervised Learned Reward: Learning with a Learned Reward Function, [2]: We call this a reward model, since it directly models the Alexa user score, which we aim to maximize. Page 22, [2]: A Markov decision process (MDP) is a framework for modeling sequential decision making . .. An agent aims to maximize its reward during each episode). 
Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to incorporate the “learned reward function” of Serban into the invention of Cai-Krishnamurthy. The suggestion/motivation would have been incorporate learning to predict the Alexa user scores based on previously recorded dialogues. This a reward model, since it directly models the Alexa user score, which we aim to maximize. (Serban, Page 18, #4.4, Supervised Learned Reward: Learning with a Learned Reward Function).

Regarding Claim 6, Cai-Krishnamurthy-Serban discloses the method of claim 1, wherein the Markov chain includes a current state, two or more subsequent states, and a decision probability for each of the two or more states, wherein a summation of each decision probability equals one (Cai, [0023]: Markov Decision Process, <S, A, P, R, γ>, where, S is a finite set of states; A is a finite set of actions, could be discussion topics; P is the probability that action in state s at time t will lead to another state s′ at time t+1; R is reward, successful ending is 1, unsuccessful ending is −1).  

Regarding Claim 7, Cai-Krishnamurthy-Serban discloses the method of claim 1, further comprising: 
generating additional RL models for each of the one or more topics in the plurality of conversations (Cai, FIG.1, conversation data 101, [0022, 31]: the conversation models are continuously trained and updated using historical call conversation data 101, and generates additional reinforcement learning models).  

Regarding Claim 8, Cai-Krishnamurthy-Serban discloses the method of claim 7, wherein applying the outcome score to each conversation includes applying a topic outcome score for each of the one or more topics (Cai, [0037-38]: Therefore, in the run time, the next actions (topics) to take is based on the learned model and the current state where they are, not only based on the customer's feedback on the topic the agent talked about. For example, during the conversation if a customer has a particular response to conversation matching (or similar to) that of a conversation model, the agent is notified of this similarity and suggested to engage the customer in the topic, because past customers have had a positive outcome when further engaging conversation around the topic).  

Regarding Claim 10, Cai-Krishnamurthy-Serban discloses the method of claim 1, wherein the outcome score is based on feedback from the user (Cai, [0023]: The agent can take the action of which a reward value is feedback to either increase or decrease the confidence in the action).  

Regarding Claim 11, Cai-Krishnamurthy-Serban discloses the method of claim 1, wherein the method is performed by the agent, executing program instructions, and wherein the program instructions are downloaded from a remote data processing system (Cai, FIG.4, computer system 12, [0016, 77]: computer readable program instructions can be downloaded to respective computing/processing devices 12).  

Regarding Claim 12, Cai discloses a system comprising: 
a processor (Cai, FIG.4, computer system/server 12, processor 16, [0129]: a computing system 12 includes one or more processors 16); and 
a computer-readable storage medium communicatively coupled to the processor and storing program instructions which, when executed by the processor, are configured to cause the processor to (Cai, FIG.4, memory 28, [0066]: system 28 includes computer system storage media and a computer program product comprising computer readable instructions configured to cause the processor 16 to carry out one or more features): 
select a plurality of conversations, wherein each conversation includes an agent and a user (Cai, FIG.1, agent 180, customer, 190, [0018]: receiving historical data from past calls related to the type(s) of conversation between an agent 180 and a customer 190. For example, call center conversation data can include a length of conversation about a topic, a type of topic (e.g., the weather, sports, family, etc.), a tone of the conversation, a reaction of the customer or agent to the conversation, etc.); 
identify, in each of the plurality of conversations, a set of turns and on or more topics (Cai, [0019-20]: the topics of conversation are detected and segmented, and the responses of the customer to each segmented topic are detected. The response can be a positive response or negative response (“turn”)); 
associate the one or more topics to each turn of the set of turns (Cai, [0020]: through the customer's answers to the agent, the customer's state can be detected and if the conversation is successful continued or not (“turn”));
generate, based on the set of turns, a conversation flow for each conversation, wherein the conversation flow identifies a sequence of the one or more topics (Cai, [0019]: the raw data files for each phone conversation between the agent 180 and the customer 190 are filtered to detect the topics of conversation and segment call, and the responses of customer to each segmented topic are detected into the conversation); 
apply an outcome score to each conversation (Cai, [0023]: the conversation process can include a Markov Decision Process, <S, A, P, R, γ>, A is a finite set of actions, R is reward (“score”). The agent can take the action of which a reward value is feedback to either increase or decrease the confidence in the action; [0027-29]: a value (Q) (“outcome score”) is determined for every state and action pair (Q(S, A)) with a reward (R) factored into the next pair (Q(S’, A′))); 
create a reinforced learning (RL) model, wherein the RL model includes a Markov chain and wherein the RL model is based on the conversation flow of each conversation and the outcome score of each conversation (Cai, [0022]: reinforcement learning is leveraged to build the conversation model (i.e., a model predicting what conversation topic will mostly likely to lead to a positive result from the customers responses to conversation)).
However, Cai does not disclose
deploy the RL model, wherein the deploying includes sending the RL model to a chatbot and the RL model is configured to maximize a future outcome score for a future conversation. 
Krishnamurthy discloses
deploy the RL model, wherein the deploying includes sending the RL model to a chatbot  (Krishnamurthy, FIG.1 user device 102, reinforcement learning (RL) agent 114, [0027, 58]: deploying the RL model to RL agent 114).  
Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to incorporate the “conversational agent” of Krishnamurthy into the invention of Cai. The suggestion/motivation would have been to facilitate training a conversational agent to be a reinforcement learning (RL) by using a user model in which the RL agent selects agent actions in response to user actions sampled using the conditional probabilities from the user model (Krishnamurthy, Abstract, [0001-5]).
However, Cai-Krishnamurthy does not disclose

Serban discloses
(Serban, Page 10, #4 Model Selection Policy, [2]: Use the reinforcement learning framework. The dialogue manager is an agent, which takes actions in an environment in order to maximize rewards. Page 18, #4.4. Supervised Learned Reward: Learning with a Learned Reward Function, [2]: We call this a reward model, since it directly models the Alexa user score, which we aim to maximize. Page 22, [2]: A Markov decision process (MDP) is a framework for modeling sequential decision making . .. An agent aims to maximize its reward during each episode).
Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to incorporate the “learned reward function” of Serban into the invention of Cai-Krishnamurthy. The suggestion/motivation would have been incorporate learning to predict the Alexa user scores based on previously recorded dialogues. We call this a reward model, since it directly models the Alexa user score, which we aim to maximize. (Serban, Page 18, #4.4, Supervised Learned Reward: Learning with a Learned Reward Function).

Regarding Claim 14, Cai-Krishnamurthy-Serban discloses the system of claim 12, wherein the program instructions are further configured to cause the processor to: 
generate additional RL models for each of the one or more topics in the plurality of conversations (Cai, FIG.1, conversation data 101, [0022, 31]: the conversation models are continuously trained and updated using historical call conversation data 101, and generates additional reinforcement learning models).  

Regarding Claim 15, Cai-Krishnamurthy-Serban discloses the system of claim 14, wherein applying the outcome score to each conversation includes applying a topic outcome score for each of the one or more topics (Cai, [0037-38]: Therefore, in the run time, the next actions (topics) to take is based on the learned model and the current state where they are, not only based on the customer's feedback on the topic the agent talked about. For example, during the conversation if a customer has a particular response to conversation matching (or similar to) that of a conversation model, the agent is notified of this similarity and suggested to engage the customer in the topic, because past customers have had a positive outcome when further engaging conversation around the topic).  

Regarding Claim 16, Cai-Krishnamurthy-Serban discloses a computer program product, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processing unit to cause the processing unit to (Cai, FIG.4, computer system/server 12, processor 16, [0129]: a computing system 12 includes one or more processors 16; FIG.4, memory 28, [0066]: system 28 includes computer system storage media and a computer program product comprising computer readable instructions configured to cause the processor 16 to carry out one or more features): 
select a plurality of conversations, wherein each conversation includes an agent and a user (Cai, FIG.1, agent 180, customer, 190, [0018]: receiving historical data from past calls related to the type(s) of conversation between an agent 180 and a customer 190. For example, call center conversation data can include a length of conversation about a topic, a type of topic (e.g., the weather, sports, family, etc.), a tone of the conversation, a reaction of the customer or agent to the conversation, etc.); 
identify, in each of the plurality of conversations, a set of turns and on or more topics (Cai, [0019-20]: the topics of conversation are detected and segmented, and the responses of the customer to each segmented topic are detected. The response can be a positive response or negative response (“turn”)); 
associate the one or more topics to each turn of the set of turns (Cai, [0020]: through the customer's answers to the agent, the customer's state can be detected and if the conversation is successful continued or not (“turn”));  
generate, based on the set of turns, a conversation flow for each conversation, wherein the conversation flow identifies a sequence of the one or more topics (Cai, [0019]: the raw data files for each phone conversation between the agent 180 and the customer 190 are filtered to detect the topics of conversation and segment call, and the responses of customer to each segmented topic are detected into the conversation); 
apply an outcome score to each conversation (Cai, [0023]: the conversation process can include a Markov Decision Process, <S, A, P, R, γ>, A is a finite set of actions, R is reward (“score”). The agent can take the action of which a reward value is feedback to either increase or decrease the confidence in the action; [0027-29]: a value (Q) (“outcome score”) is determined for every state and action pair (Q(S, A)) with a reward (R) factored into the next pair (Q(S’, A′))); 
create a reinforced learning (RL) model, wherein the RL model includes a Markov chain and wherein the RL model is based on the conversation flow of each conversation and the outcome score of each conversation (Cai, [0022]: reinforcement learning is leveraged to build the conversation model (i.e., a model predicting what conversation topic will mostly likely to lead to a positive result from the customers responses to conversation).
However, Cai does not disclose 
deploy the RL model, wherein the deploying includes sending the RL model to a chatbot and the RL model is configured to maximize a future outcome score for a future conversation.  
Krishnamurthy discloses
deploy the RL model, wherein the deploying includes sending the RL model to a chatbot (Krishnamurthy, FIG.1 user device 102, reinforcement learning (RL) agent 114, [0027, 58]: deploying the RL model to RL agent 114).  
Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to incorporate the “conversational agent” of Krishnamurthy into the invention of Cai. The suggestion/motivation would have been to facilitate training a conversational agent to be a reinforcement learning (RL) by using a user model in which the RL agent selects agent actions in response to user actions sampled using the conditional probabilities from the user model (Krishnamurthy, Abstract, [0001-5]).
However, Cai-Krishnamurthy does not disclose

Serban discloses
(Serban, Page 10, #4 Model Selection Policy, [2]: Use the reinforcement learning framework. The dialogue manager is an agent, which takes actions in an environment in order to maximize rewards. Page 18, #4.4. Supervised Learned Reward: Learning with a Learned Reward Function, [2]: We call this a reward model, since it directly models the Alexa user score, which we aim to maximize. Page 22, [2]: A Markov decision process (MDP) is a framework for modeling sequential decision making . .. An agent aims to maximize its reward during each episode).
Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to incorporate the “learned reward function” of Serban into the invention of Cai-Krishnamurthy. The suggestion/motivation would have been incorporate learning to predict the Alexa user scores based on previously recorded dialogues. We call this a reward model, since it directly models the Alexa user score, which we aim to maximize. (Serban, Page 18, #4.4, Supervised Learned Reward: Learning with a Learned Reward Function).

Regarding Claim 18, Cai-Krishnamurthy-Serban discloses the computer program product of claim 16, wherein the program instructions are further configured to cause the processing unit to: 
generating additional RL models for each of the one or more topics in the plurality of conversations (Cai, FIG.1, conversation data 101, [0022, 31]: the conversation models are continuously trained and updated using historical call conversation data 101, and generates additional reinforcement learning models).  

Regarding Claim 19, Cai-Krishnamurthy-Serban discloses the computer program product of claim 16, wherein applying the outcome score to each conversation includes applying a topic outcome score for each of the one or more topics (Cai, [0037-38]: Therefore, in the run time, the next actions (topics) to take is based on the learned model and the current state where they are, not only based on the customer's feedback on the topic the agent talked about. For example, during the conversation if a customer has a particular response to conversation matching (or similar to) that of a conversation model, the agent is notified of this similarity and suggested to engage the customer in the topic, because past customers have had a positive outcome when further engaging conversation around the topic).  

Regarding Claim 20, Cai-Krishnamurthy-Serban discloses the computer program product of claim 16, wherein the outcome score is applied by a subject matter expert.  


4.3.	Claims 2-5, 13, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Cai et al., (“Cai”, US 2018/0316791 A1) in view of Krishnamurthy et al. (“Krishnamurthy”, US 2018/0218080 A1) and Serban et al., (“Serban”, “A Deep Reinforcement Learning Chatbot”, 5 Nov 2017) as applied to claim 1, and further in view of Amirloo Abolfathi et al. (“Amirloo”, US 2021/0004647A1).

Regarding Claim 2, Cai-Krishnamurthy-Serban discloses the method of claim 1 as set forth above, further comprising: 
initiating, by the chatbot, a new conversation (Serban, Page 5, #Initiatorbot, [2]: If the user gives a greeting (e.g. "hi"), then Initiatorbot will return a response with priority. This is important because we observed that greetings often indicate the beginning of a conversation, where the user does not have a particular topic they would like to talk about); 
developing, based on the RL model and based on a set of topics in the new conversation, a new conversation sequence (Cai, [0034]: the topic suggestion may include multiple topics of which multiple responses (or just the single best) can be expected); and 
completing, (Cai, [0036]: in the run time, the next actions (topics) to take is based on the learned model and the current state ().  
However, Cai-Krishnamurthy-Serban does not disclose
initiating, by the chatbot, 
completing, by the chatbot, one or more tasks
Amirloo discloses
initiating, by the chatbot, (Amirloo, FIG.1, vehicle 100, reinforcement learning (RL) agent 108, [0023]: the vehicle 100 includes a RL agent that is trained to perform a desired task); 
completing, by the chatbot, one or more tasks(Amirloo, [0023]: The RL agent 108 may be trained to drive the vehicle 100 in a safe manner (e.g., collision-free, free of sudden large changes in speed or acceleration, etc.) to reach a target destination. [0034]: The RL agent 108 is expected to have been trained to output actions that the vehicle 100 executes safely).
Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to incorporate the “predetermined performance goal” of Amirloo into the invention of Cai-Krishnamurthy-Serban. The suggestion/motivation would have been to evaluate whether the failure or success of the RL agent in performing the task in a certain time horizon by determining a predetermined performance goal has been achieved (Amirloo, Abstract, [0001-12], FIG.2,  [0043-49]).

Regarding Claim 3, Cai-Krishnamurthy-Serban-Amirloo discloses the method of claim 2 as set forth above, further comprising:  
determining, based on the RL model, that a probability of a positive outcome falls below a threshold (Cai, [0022]: a model predicting what conversation topic will mostly likely to lead to a positive result (“outcome”) from the customers responses to conversation. Amirloo, FIG.1, RL agent 108, [0034]; Amirloo, [0047]: evaluation of the RL agent 108 involves determining that a probability of failure, ε, is below a predetermined threshold. Accordingly, the determining that the predetermined performance goal has been achieved); and 
transferring, in response to the probability falling below the threshold, the new conversation to a human agent (Cai, [0034]: Should the customer respond negatively (or not as expected in the decision tree), the agent can dynamically switch the topic. It is obvious for one of ordinary skill in the art to incorporate the logic that transfer the new conversation to a human agent when the customer response is negative).  

Regarding Claim 4, Cai-Krishnamurthy-Serban-Amirloo discloses the method of claim 3, further comprising: 
updating, in response to the probability falling below the threshold, the RL model, wherein the update alters the new conversation sequence and is configured to prevent the probability from falling below the threshold (Amirloo, FIG.1, learning controller 110, [0027]: learning controller 110 is used to improve RL agent 108. Amirloo [0044-49]: Responsive to determining that the predetermined performance goal has not been achieved, the learning controller 110 returns to train ().  

Regarding Claim 5, Cai-Krishnamurthy-Serban-Amirloo discloses the method of claim 2, further comprising: 
completing the new conversation (Amirloo, [0034]: The RL agent 108 is expected to have been trained to output actions that the vehicle 100 executes safely); 
determining, in response to completing the new conversation, that the new conversation includes a negative overall outcome (Amirloo, [0049]: Responsive to determining that the predetermined performance goal has not been achieved, the learning controller 110 returns to train (); and 
updating, in response to the negative overall outcome, the RL model (Amirloo, FIG.1, failure predictor 126, [0027]: failure predictor 126 is used to improve operation of the RL agent 108. [0050]: The failure predictor 126 is updated during further failure-prediction-based training of the RL agent 108).  

Regarding Claim 13, Cai-Krishnamurthy-Serban discloses the system of claim 12 as set forth above, wherein the program instructions are further configured to cause the processor to: 
initiate, by the chatbot, a new conversation (Serban, Page 5, #Initiatorbot, [2]: If the user gives a greeting (e.g. "hi"), then Initiatorbot will return a response with priority. This is important because we observed that greetings often indicate the beginning of a conversation, where the user does not have a particular topic they would like to talk about); 
develop, based on the RL model and based on a set of topics in the new conversation, a new conversation sequence (Cai, [0034]: the topic suggestion may include multiple topics of which multiple responses (or just the single best) can be expected); and 
complete, (Cai, [0036]: in the run time, the next actions (topics) to take is based on the learned model and the current state where they are, not only based on the customer's feedback on the topic the agent talked about).  
However, Cai-Krishnamurthy-Serban does not disclose
initiate, by the chatbot, 
complete, by the chatbot, one or more tasks, 
Amirloo discloses
initiate, by the chatbot, (Amirloo, FIG.1, vehicle 100, reinforcement learning (RL) agent 108, [0023]: the vehicle 100 includes a RL agent that is trained to perform a desired task);
complete, by the chatbot, one or more tasks, (Amirloo, [0023]: The RL agent 108 may be trained to drive the vehicle 100 in a safe manner (e.g., collision-free, free of sudden large changes in speed or acceleration, etc.) to reach a target destination. [0034]: The RL agent 108 is expected to have been trained to output actions that the vehicle 100 executes safely).
Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to incorporate the “predetermined performance goal” of Amirloo into the invention of Cai-Krishnamurthy. The suggestion/motivation would have been to evaluate whether the failure or success of the RL agent in performing the task in a certain time horizon by determining a predetermined performance goal has been achieved (Amirloo, Abstract, [0001-12], FIG.2,  [0043-49]).

Regarding Claim 17, Cai-Krishnamurthy-Serban discloses the computer program product of claim 16 as set forth above, wherein the program instructions are further configured to cause the processing unit to: 
initiate, by the chatbot, a new conversation (Serban, Page 5, #Initiatorbot, [2]: If the user gives a greeting (e.g. "hi"), then Initiatorbot will return a response with priority. This is important because we observed that greetings often indicate the beginning of a conversation, where the user does not have a particular topic they would like to talk about); 
develop, based on the RL model and based on a set of topics in the new conversation, a new conversation sequence (Cai, [0034]: the topic suggestion may include multiple topics of which multiple responses (or just the single best) can be expected); and 
complete, (Cai, [0036]: in the run time, the next actions (topics) to take is based on the learned model and the current state where they are, not only based on the customer's feedback on the topic the agent talked about).  
However, Cai-Krishnamurthy-Serban does not disclose
initiate, by the chatbot, 
complete, by the chatbot, one or more tasks, 
Amirloo discloses
initiate, by the chatbot, (Amirloo, FIG.1, vehicle 100, reinforcement learning (RL) agent 108, [0023]: the vehicle 100 includes a RL agent that is trained to perform a desired task);
complete, by the chatbot, one or more tasks, ((Amirloo, [0023]: The RL agent 108 may be trained to drive the vehicle 100 in a safe manner (e.g., collision-free, free of sudden large changes in speed or acceleration, etc.) to reach a target destination. [0034]: The RL agent 108 is expected to have been trained to output actions that the vehicle 100 executes safely).
Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to incorporate the “predetermined performance goal” of Amirloo into the invention of Cai-Krishnamurthy-Serban. The suggestion/motivation would have been to evaluate whether the failure or success of the RL agent in performing the task in a certain time horizon by determining a predetermined performance goal has been achieved (Amirloo, Abstract, [0001-12], FIG.2,  [0043-49]).

4.4.	Claims 9, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Cai et al., (“Cai”, US 2018/0316791 A1) in view of Krishnamurthy et al. (“Krishnamurthy”, US 2018/0218080 A1) and Serban et al., (“Serban”, “A Deep Reinforcement Learning Chatbot”, 5 Nov 2017) as applied to claim 1, and further in view of Mroczka, (US 2021/0012288 A1).

Regarding Claim 9, Cai-Krishnamurthy-Serban discloses the method of claim 1 as set forth above.
However, Cai-Krishnamurthy-Serban does not disclose
applying the outcome score further includes incorporating a subject matter expert outcome score.  
Mroczka discloses
applying the outcome score further includes incorporating a subject matter expert outcome score (Mroczka, FIG.1, engine 104, subject matter experts (SMEs) 106, top level scores 112, [0023]: The engine 104 includes reinforced learning in conjunction with the manual interactions of the subject matter experts 106 results a set of top level scores 112).  
Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to incorporate the “top level score” of Mroczka into the invention of Cai-Krishnamurthy-Serban. The suggestion/motivation would have been to improve the automated process of culling out the most relevant fundamental measurement factors for a given program provides reliability and accuracy since it saves critical amounts of time. The subject matter expert and/or artificial intelligence (AI) module can choose either material depending on the relative importance of reliability versus cost, analyze the effect on the product profile matrix, and adjust the material selection accordingly (Mroczka, Abstract, [0001-12], FIG.1,  [0022-23]).

Regarding Claim 20, Cai-Krishnamurthy-Serban discloses the computer program product of claim 16 as set forth above.
However, Cai-Krishnamurthy-Serban does not disclose
the outcome score is applied by a subject matter expert.  
Mroczka discloses
the outcome score is applied by a subject matter expert (Mroczka, FIG.1, engine 104, subject matter experts (SMEs) 106, top level scores 112, [0023]: The engine 104 includes reinforced learning in conjunction with the manual interactions of the subject matter experts 106 results a set of top level scores 112).  
Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to incorporate the “top level score” of Mroczka into the invention of Cai-Krishnamurthy. The suggestion/motivation would have been to improve the automated process of culling out the most relevant fundamental measurement factors for a given program provides reliability and accuracy since it saves critical amounts of time. The subject matter expert and/or artificial intelligence (AI) module can choose either material depending on the relative importance of reliability versus cost, analyze the effect on the product profile matrix, and adjust the material selection accordingly (Mroczka, Abstract, [0001-12], FIG.1,  [0022-23]).

Conclusion
5.	The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Marecki et al., US 2012/0047103 A1, Method for optimally sharing information from sender to recipients, involves computing policy for sender, and sharing information among recipients in subsequent time interval according to policy, [0059]: Markov Decision Process, yields a maximum expected reward for an agent.
Coden et al., US 2015/0286819 A1, Method for predicting insider threat, involves scoring entity according to matches of classified features to patterns of insider threat and predicting insider threat corresponding to entity according to score, [0005]: Markov chain, taking the maximal value over all input scores.

6.	THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). 
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHHIAN (AMY) LING whose telephone number is (571)270-1074.  The examiner can normally be reached on M-F 9-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, BRIAN J GILLIS can be reached on (571) 272-7952.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/C.L/Examiner, Art Unit 2446


/ARVIN ESKANDARNIA/Primary Patent Examiner, Art Unit 2446