DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments with respect to claim(s) 15 and 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
U.S.C 101 Rejection
Applicant argues: “In asserting the rejection of claim 15 under 35 U.S.C. § 101, the Examiner contends that the entirety of the above-reproduced claim can "be performed by the human mind". Applicant respectfully disagrees. 
It is respectfully asserted that "implement[ing] a dialog with at least one user based on the policy using an interface of an associated device" could not "be performed by the mind", and further, is not simply "insignificant extra-solution activity for displaying data" at least because a stated goal of the present invention is using a "learned policy...in dialog systems to interact with a user in a series of conversations" (Specification as filed, paragraph [0020]. Thus, such a dialog is not extra-solution activity, but rather is an important feature of the present invention. “
Examiner Response: Examiner respectfully disagrees. Examiner maintains that the claimed element “implement a dialog with the at least one user device based on the personalized policy using an interface of the at least one user device” is an extra-
Furthermore, “the interface being configured for user-specific personalized communication with the dialog system.” is a well-known, understood, routine, and conventional activity. “Previous research on related topics can be roughly broken up into three areas, the first focusing on personalized recommendation systems, the second on conversational interfaces, and the third on adaptive dialogue systems.” [Thomson et al.; A Personalized System for Conversational Recommendations; pg. 413] And “Although research in personalized recommendation systems has become widespread only in recent years, the basic idea can be traced back to Rich (1979), discussed below with other work on conversational interfaces. Langley (1999) gives a more thorough review of recent research on the topic of adaptive interfaces and personalization.” [Thomson et al.; A Personalized System for Conversational Recommendations; pg. 414] (see MPEP 2106.05(d)).
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 15-20 rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract ideas and mathematical concepts without significantly more. 
When considering subject matter eligibility under 35 U.S.C. 101, it must be determined whether the claim is directed to one of the four statutory categories of 2019 PEG for more details of the analysis.
Step 1:
According to the first part of the analysis, in the instant case, claims 15-19 are directed to a system comprising at least a processor and claim 20 is directed to a computer program product comprising a computer readable storage medium which is defined to exclude transitory signals in paragraph [0096]. Thus, each of the claims falls within one of the four statutory categories (i.e. process, machine, manufacture, 
Claim 15 recites: 
	Step 2A, Prong 1
“A computer system for learning a policy for selection of an associative topic using a processor-based dialog system configured for interacting with at least one user device, by executing program instructions, the computer system comprising: a memory tangibly storing the program instructions; a processor in communications with the memory for executing the program instructions, wherein the processor is configured to: obtain a policy base indicating a topic transition from a source topic to a destination topic and a short-term reward for the topic transition by analyzing data from a corpus, wherein the short-term reward is defined as probability of associating a positive response”. Save for the recitation of generic computer equipment (“computer system”, “memory”, and “processor”), this step appears to be practically implementable in the human mind and is understood to be a recitation of a mental process. 
“calculate an expected long-term reward for the topic transition using the short- term reward for the topic transition with taking into account a discounted reward for a subsequent topic transition”. This step is understood to be a recitation of a mathematical concept. 
“generate a personalized policy using the policy base and the expected long-term reward for the topic transition, wherein the personalized policy indicates selection of the destination topic for the source topic as an associative topic for a current topic.” This step appears to be practically implementable in the human mind and is understood to be a recitation of a mental process. 
Step 2A, Prong 2
“A computer system for learning a policy for selection of an associative topic, by executing program instructions, the computer system comprising: a memory tangibly storing the program instructions; a processor in communications with the memory for executing the program instructions, wherein the processor is configured to”. The “memory” and “processor” are understood to be generic computer equipment used to execute computer instructions. See MPEP 2106.05(f).
“and implement a dialog with the at least one user device based on the personalized policy using an interface of the at least one user device, the interface being configured for user-specific personalized communication with the dialog system.” (This step appears to be directed to displaying data, which is understood to be insignificant extra-solution activity. See MPEP 2106.05(g).).
The additional elements in the claim does not integrate the judicial exception into a practical application.
Step 2B:
“A computer system for learning a policy for selection of an associative topic using a processor-based dialog system configured for interacting with at least one user device, by executing program instructions, the computer system comprising: a memory tangibly storing the program instructions; a processor in communications with the memory for executing the program instructions, wherein the processor is configured to”. The “memory” and “processor” are understood to be generic computer equipment used to execute computer instructions. See MPEP 2106.05(f).
“and implement a dialog with the at least one user device based on the personalized policy using an interface of the at least one user device, the interface being configured for user-specific personalized communication with the dialog system.” (This step appears to be directed to displaying data, which is understood to be insignificant extra-solution activity. See MPEP 2106.05(g).).
“…the interface being configured for user-specific personalized communication with the dialog system.” (This claim element is a well-known, understood, routine, and conventional. “Previous research on related topics can be roughly broken up into three areas, the first focusing on personalized recommendation systems, the second on conversational interfaces, and the third on adaptive dialogue systems.” [Thomson et al.; A Personalized System for Conversational Recommendations; pg. 413] And “Although research in personalized recommendation systems has become widespread only in recent years, the basic idea can be traced back to Rich (1979), discussed below with other work on conversational interfaces. Langley (1999) gives a more thorough review of recent research on the topic of adaptive interfaces and personalization.” [Thomson et al.; A Personalized System for Conversational Recommendations; pg. 414]).
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim is not patent eligible. 
Claim 16 recites:
	Step 2A, Prong 1:
“evaluate a maximum long-term reward received from available subsequent topic transitions to calculate the discounted reward.” This step is understood to be a recitation of a mathematical concept. 
Step 2A, Prong 2:
This claim does not appear to recite any additional elements not already discussed.
Step 2B:
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim is not patent eligible. 
Claim 17 recites:
	Step 2A, Prong 1:
“solve, by a dynamic programming or Monte Carlo method, an equation representing the expected long-term reward for the topic transition in order to calculate the expected long-term reward.” This step is understood to be a recitation of a mathematical concept. 
Step 2A, Prong 2:
This claim does not appear to recite any additional elements not already discussed.
Step 2B:
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim is not patent eligible. 
Claim 18 recites:
	Step 2A, Prong 1:
“convert from the expected long-term reward to probability by using a softmax function”. This step is understood to be a recitation of a mathematical concept. 
	“merge the occurrence probability of the policy and the probability converted from the expected long-term reward.” This step appears to be practically implementable in the human mind and is understood to be a recitation of a mental process.
	Step 2A, Prong 2:
This claim does not appear to recite any additional elements not already discussed.
Step 2B:
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim is not patent eligible. 
Claim 19 recites:
	Step 2A, Prong 1:
	“select the destination topic as the associative topic for the current topic using the policy”. This step appears to be practically implementable in the human mind and is understood to be a recitation of a mental process.
	“observe a positive or negative actual response from user environment to obtain a user- provided reward”. This step appears to be practically implementable in the human mind and is understood to be a recitation of a mental process.
“update the expected long-term reward and the policy by using the user-provided reward.” This step appears to be practically implementable in the human mind and is understood to be a recitation of a mental process.
	Step 2A, Prong 2:
This claim does not appear to recite any additional elements not already discussed.
Step 2B:
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim is not patent eligible. 
Claim 20 recites:
	Step 2A, Prong 1:
	“A computer program product for learning a policy for selection of an associative topic, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform the method of: analyzing data from a corpus, obtaining a policy base indicating a topic transition from a source topic to a destination topic and a short-term reward for the topic transition, the short- term reward being defined as probability of associating a positive response” Save for the recitation of generic computer equipment (“computer readable storage medium” and “program instructions”), this step appears to be practically implementable in the human mind and is understood to be a recitation of a mental process.
“calculating an expected long-term reward for the topic transition using the short-term reward for the topic transition with taking into account a discounted reward for a subsequent topic transition”. This step is understood to be a recitation of a mathematical concept.
	“generating a personalized policy using the policy base and the expected long-term reward for the topic transition, the personalized policy indicating selection of the destination topic for the source topic as an associative topic for a current topic of conversation including at least one user;” This step appears to be practically implementable in the human mind and is understood to be a recitation of a mental process.
Step 2A, Prong 2:
“A computer program product for learning a policy for selection of an associative topic, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform the method of”. The “computer readable storage medium” is understood to be generic computer equipment used to execute computer instructions. See MPEP 2106.05(f).
“implementing a dialog with the at least one user device based on the personalized policy using an interface of the at least one user device, the interface being configured for user-specific personalized communication with the dialog system.” (This step appears to be directed to displaying data, which is understood to be insignificant extra-solution activity. See MPEP 2106.05(g).).

	Step 2B:
“A computer program product for learning a policy for selection of an associative topic using a processor-based dialog system configured for interacting with at least one user device, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform the method of”. The “computer readable storage medium”, device, and “processor-based system” are understood to be generic computer equipment used to execute computer instructions. See MPEP 2106.05(f).
“implementing a dialog with the at least one user device based on the personalized policy using an interface of the at least one user device, the interface being configured for user-specific personalized communication with the dialog system.” (This step appears to be directed to displaying data, which is understood to be insignificant extra-solution activity. See MPEP 2106.05(g).).
“…the interface being configured for user-specific personalized communication with the dialog system.” (This claim element is a well-known, understood, routine, and conventional. “Previous research on related topics can be roughly broken up into three areas, the first focusing on personalized recommendation systems, the second on conversational interfaces, and the third on adaptive dialogue systems.” [Thomson et al.; A Personalized System for Conversational Recommendations; pg. 413] And “Although research in personalized recommendation systems has become widespread only in recent years, the basic idea can be traced back to Rich (1979), discussed below with other work on conversational interfaces. Langley (1999) gives a more thorough review of recent research on the topic of adaptive interfaces and personalization.” [Thomson et al.; A Personalized System for Conversational Recommendations; pg. 414])
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim is not patent eligible. 
Claim 21 recites:
Step 2A, Prong 1:
	“wherein the generating the personalized policy comprises adapting a general policy for a particular user based on temporal difference learning with a user environment.” (This step appears to be practically implementable in the human mind and is understood to be a recitation of a mental process.)
Step 2A, Prong 2:
This claim does not appear to recite any additional elements not already addressed.
Step 2B:
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim is not patent eligible. 
Claim 22 recites:
Step 2A, Prong 1:
	“wherein the processor is further configured to generate conversational text based on the associative topic and user-specific interests.” (This step appears 
Step 2A, Prong 2:
The “processor” is understood to be generic computer equipment used to execute computer instructions. See MPEP 2106.05(f).
Step 2B:
Claim 23 recites:
Step 2A, Prong 1:
	“wherein the generating the personalized policy comprises adapting a general policy for a particular user based on temporal difference learning with a user environment.” (This step appears to be practically implementable in the human mind and is understood to be a recitation of a mental process.)
Step 2A, Prong 2:
This claim does not appear to recite any additional elements not already addressed.
Step 2B:
	“wherein the generating the personalized policy comprises adapting a general policy for a particular user based on temporal difference learning with a user environment.” (This step appears to be practically implementable in the human mind and is understood to be a recitation of a mental process.)
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim is not patent eligible. 
Claim 24 recites:
Step 2A, Prong 1:
	“generating conversational text based on the associative topic and user-specific interests.” (This step appears to be practically implementable in the human mind and is understood to be a recitation of a mental process.)
Step 2A, Prong 2:
This claim does not appear to recite any additional elements not already addressed.
Step 2B:
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim is not patent eligible. 
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 15, 19, 20, 22, and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides.") in view of Genevay et al. ("Transfer Learning for User Adaptation in Spoken Dialogue Systems.").
Regarding Claim 15,
Misu teaches a computer system for learning a policy for selection of an associative topic using a processor-based dialog system configured for interacting with at least one user device, by executing program instructions, the computer system comprising: 
a memory tangibly storing the program instructions; 
a processor in communications with the memory for executing the program instructions, wherein the processor is configured to: 
obtain a policy base (pg. 85; Then we compare our learned policy with two baselines, one of which is the dialogue policy of the original system that was used for collecting our corpus and that is currently installed at the Museum of Science in Boston.) indicating a topic transition from a source topic to a destination topic (pg. 84; The goal of RL is to learn a dialogue policy, i.e. the optimal action that the system should take at each possible dialogue state. Typically rewards depend on the domain and can include factors such as task completion, dialogue length, and user satisfaction.  And pg. 88; User’s reaction: The user has to decide on one of the following. Go to the next topic (Go-on); cease the dialogue if there are no more questions in the stock of queries (Out-ofstock); rephrase the previous query (Rephrase); abandon the dialogue (Give-up) regardless of the remaining questions in the stock; generate a query based on a system recommendation, OT2 prompt (Refill). We calculate the user type dependent probability for these actions from the corpus. And pg. 89; Topic for next user query (e.g. introduction, personal, etc.): The SU selects a new topic based on user type dependent topic transition bigram probabilities estimated from the corpus. And pg. 91; One could argue that this favors the learned policy over the baselines. Because our SU is based on general corpus statistics (probability that the user is child or male or female, number of questions the user is planning to ask, probability of moving to the next topic or ceasing the dialogue, utterance timing statistics) rather than sequential information we believe that this is acceptable. We only use sequential information when we calculate the next topic that the user will choose.) and a short-term reward for the topic transition (pg. 86; A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function) by analyzing data from a corpus, wherein the short-term reward is defined as probability of associating a positive response (pg. 89; So for example, responding correctly to an in-domain user question is rewarded (+23.2) whereas providing an erroneous response to a junk question, i.e. treating junk questions as if they were in-domain questions, is penalized (-14.7).); 
calculate an expected long-term reward for the topic transition using the short- term reward for the topic transition (pg. 86; The quality of the policy π followed by the agent is measured by the expected future reward also called Q-function, Qπ : S × A → <.) with taking into account a discounted reward for a subsequent topic transition (pg. 89; In this paper we follow a POMDP-based approach. A POMDP is defined as a tuple (S, A, P, R, O, Z, γ, b0) where S is the set of states (representing different contexts) which the system may be in (the system’s world), A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, O is a set of observations that the system can receive about the world, Z is a set of observation probabilities Z : S × A → Z(S, A), and γ a discount factor weighting long-term rewards.); 
generate a personalized policy using the policy base (pg. 84; A simulated user is built based on this model and used for learning the dialogue policy of the virtual characters using RL. Our learned policy outperforms two baselines (including the original dialogue policy that was used for collecting the corpus) in a simulation setting.) and the expected long-term reward for the topic transition (pg. 86; and γ a discount factor weighting longterm rewards.), wherein the personalized policy indicates selection of the destination topic for the source topic as an associative topic for a current topic (pg. 89; We use the following features to optimize our dialogue policy (see section 3). We use the 6 retrieval scores of the NPCEditor (the 2 best scores for each user type ASR result), the previous system action, the ASR confidence scores, the voting scores (calculated by adding the scores of the results that agree), the system’s belief on the user type and user change, and the system’s belief on the user’s previous topic. So we need to learn a POMDP-based policy using these 42 features. And pg. 86; A POMDP is defined as a tuple (S, A, P, R, O, Z, γ, b0) where S is the set of states (representing different contexts) which the system may be in (the system’s world), A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, … At any given time step i the world is in some unobserved state si ∈ S. Because si is not known exactly, we keep a distribution over states called a belief state b, thus b(si) is the probability of being in state si , with initial belief state b0.); and 
implement a dialog with the at least one user device based on the personalized policy using an interface of the at least one user device (pg. 84; We analyze a corpus of interactions of museum visitors with two virtual characters that serve as guides at the Museum of Science in Boston, in order to build a realistic model of user behavior when interacting with these characters. A simulated user is built based on this model and used for learning the dialogue policy of the virtual characters using RL.), the interface being configured for user-specific personalized communication with the dialog system.
Misu does not explicitly disclose
a personalized policy…
… the interface being configured for user-specific personalized communication with the dialog system
However, Genevay (Transfer Learning for User Adaptation in Spoken Dialogue Systems) teaches 
	a personalized policy…(pg. 2; While previous work in spoken dialogue systems mostly focused on learning a policy which is good for all users in average [27, 39], the goal here is to learn a personalised policy which is adapted to the user’s characteristics.)
… the interface being configured for user-specific personalized communication with the dialog system (pg. 2; Spoken dialogue systems have the ability to interact directly with a human through speech. Reinforcement Learning [35] is a popular framework for dialogue management in spoken dialogue systems [28, 34]. This allows the system to learn its optimal behaviour by exploring different options and getting a delayed reward denoting the performance of its actions.  And pg. 2; While previous work in spoken dialogue systems mostly focused on learning a policy which is good for all users in average [27, 39], the goal here is to learn a personalised policy which is adapted to the user’s characteristics.).
	It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the reinforcement learning of Misu with the reinforcement learning of Genevay.
	Doing so would allow for adapting the dialogue system quickly to new users (pg. 2; The goal is to use and transfer this prior knowledge to adapt the system to a new user as quickly as possible without impacting asymptotic performance.)
Regarding Claim 19,
Misu and Genevay teach the computer system of claim 15. Misu further teaches wherein the processor is further configured to: 
select the destination topic as the associative topic for the current topic using the policy (pg. 91; One could argue that this favors the learned policy over the baselines. Because our SU is based on general corpus statistics (probability that the user is child or male or female, number of questions the user is planning to ask, probability of moving to the next topic or ceasing the dialogue, utterance timing statistics) rather than sequential information we believe that this is acceptable. We only use sequential information when we calculate the next topic that the user will choose.); 
pg. 89; So for example, responding correctly to an in-domain user question is rewarded (+23.2) whereas providing an erroneous response to a junk question, i.e. treating junk questions as if they were in-domain questions, is penalized (-14.7).); and
 update the expected long-term reward and the policy by using the user-provided reward (pg. 86; The quality of the policy π followed by the agent is measured by the expected future reward also called Q-function, Qπ : S × A → <.).
Regarding Claim 20,
Misu teaches a computer program product for learning a policy for selection of an associative topic using a processor-based dialog system configured for interacting with at least one user device, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform the method of: 
by analyzing data from a corpus, obtaining a policy base (pg. 85; Then we compare our learned policy with two baselines, one of which is the dialogue policy of the original system that was used for collecting our corpus and that is currently installed at the Museum of Science in Boston.) indicating a topic transition from a source topic to a destination topic (pg. 84; The goal of RL is to learn a dialogue policy, i.e. the optimal action that the system should take at each possible dialogue state. Typically rewards depend on the domain and can include factors such as task completion, dialogue length, and user satisfaction.  And pg. 88; User’s reaction: The user has to decide on one of the following. Go to the next topic (Go-on); cease the dialogue if there are no more questions in the stock of queries (Out-ofstock); rephrase the previous query (Rephrase); abandon the dialogue (Give-up) regardless of the remaining questions in the stock; generate a query based on a system recommendation, OT2 prompt (Refill). We calculate the user type dependent probability for these actions from the corpus. And pg. 89; Topic for next user query (e.g. introduction, personal, etc.): The SU selects a new topic based on user type dependent topic transition bigram probabilities estimated from the corpus. And pg. 91; One could argue that this favors the learned policy over the baselines. Because our SU is based on general corpus statistics (probability that the user is child or male or female, number of questions the user is planning to ask, probability of moving to the next topic or ceasing the dialogue, utterance timing statistics) rather than sequential information we believe that this is acceptable. We only use sequential information when we calculate the next topic that the user will choose.) and a short-term reward for the topic transition (pg. 86; A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function), the short-term reward being defined as probability of associating a positive response (pg. 89; So for example, responding correctly to an in-domain user question is rewarded (+23.2) whereas providing an erroneous response to a junk question, i.e. treating junk questions as if they were in-domain questions, is penalized (-14.7).); 
calculating an expected long-term reward for the topic transition using the short-term reward for the topic transition (pg. 86; The quality of the policy π followed by the agent is measured by the expected future reward also called Q-function, Qπ : S × A → <.) with taking into account a discounted reward for a subsequent topic transition (pg. 89; In this paper we follow a POMDP-based approach. A POMDP is defined as a tuple (S, A, P, R, O, Z, γ, b0) where S is the set of states (representing different contexts) which the system may be in (the system’s world), A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, O is a set of observations that the system can receive about the world, Z is a set of observation probabilities Z : S × A → Z(S, A), and γ a discount factor weighting long-term rewards.); 
generating a personalized policy using the policy base (pg. 84; A simulated user is built based on this model and used for learning the dialogue policy of the virtual characters using RL. Our learned policy outperforms two baselines (including the original dialogue policy that was used for collecting the corpus) in a simulation setting.) and the expected long-term reward for the topic transition (pg. 86; and γ a discount factor weighting longterm rewards.), the personalized policy indicating selection of the destination topic for the source topic as an associative topic for a current topic of conversation including at least one user (pg. 89; We use the following features to optimize our dialogue policy (see section 3). We use the 6 retrieval scores of the NPCEditor (the 2 best scores for each user type ASR result), the previous system action, the ASR confidence scores, the voting scores (calculated by adding the scores of the results that agree), the system’s belief on the user type and user change, and the system’s belief on the user’s previous topic. So we need to learn a POMDP-based policy using these 42 features. And pg. 86; A POMDP is defined as a tuple (S, A, P, R, O, Z, γ, b0) where S is the set of states (representing different contexts) which the system may be in (the system’s world), A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, … At any given time step i the world is in some unobserved state si ∈ S. Because si is not known exactly, we keep a distribution over states called a belief state b, thus b(si) is the probability of being in state si , with initial belief state b0.); and 
Page 4 of 19implementing a dialog with the at least one user device based on the personalized policy using an interface of the at least one user device (pg. 84; We analyze a corpus of interactions of museum visitors with two virtual characters that serve as guides at the Museum of Science in Boston, in order to build a realistic model of user behavior when interacting with these characters. A simulated user is built based on this model and used for learning the dialogue policy of the virtual characters using RL.), the interface being configured for user-specific personalized communication with the dialog system.
Misu does not explicitly disclose
a personalized policy… 
… the interface being configured for user-specific personalized communication with the dialog system.
However, Genevay (Transfer Learning for User Adaptation in Spoken Dialogue Systems) teaches
personalized policy… (pg. 2; While previous work in spoken dialogue systems mostly focused on learning a policy which is good for all users in average [27, 39], the goal here is to learn a personalised policy which is adapted to the user’s characteristics.)
… the interface being configured for user-specific personalized communication with the dialog system (pg. 2; Spoken dialogue systems have the ability to interact directly with a human through speech. Reinforcement Learning [35] is a popular framework for dialogue management in spoken dialogue systems [28, 34]. This allows the system to learn its optimal behaviour by exploring different options and getting a delayed reward denoting the performance of its actions.  And pg. 2; While previous work in spoken dialogue systems mostly focused on learning a policy which is good for all users in average [27, 39], the goal here is to learn a personalised policy which is adapted to the user’s characteristics.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the reinforcement learning of Misu with the reinforcement learning of Genevay.
	Doing so would allow for adapting the dialogue system quickly to new users (pg. 2; The goal is to use and transfer this prior knowledge to adapt the system to a new user as quickly as possible without impacting asymptotic performance.)
Regarding Claim 22,
Misu and Genevay teach the computer system of claim 15. Misu further teaches wherein the processor is further configured to generate conversational text based on the pg. 85; Figure 1: Example dialogue between the Twins virtual characters and a museum visitor.).
Regarding Claim 24,
Misu and Genevay teach the computer program product of claim 20. Misu further teaches further comprising generating conversational text based on the associative topic and user-specific interests (pg. 85; Figure 1: Example dialogue between the Twins virtual characters and a museum visitor.).
Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides.") in view of Genevay et al. ("Transfer Learning for User Adaptation in Spoken Dialogue Systems.") and Stacy et al. (US-20130288222-A1; hereinafter Stacy).
Regarding Claim 16,
Misu and Genevay teach the computer system of claim 15.
Misu and Genevay do not explicitly disclose
wherein the processor is further configured to: evaluate a maximum long-term reward received from available subsequent topic transitions to calculate the discounted reward.
However, Stacy et al. teaches
wherein the processor is further configured to: 
evaluate a maximum long-term reward received from available subsequent topic transitions to calculate the discounted reward ([0077] Given an MDP (ignoring the partial observability aspect for the moment), the object is to construct a stationary policy .pi.: S.fwdarw.A, where .pi.(s) denotes the action to be executed in state s, that maximizes the expected accumulated reward over a horizon T of interest: 
E ( t = 0 T r t ) , ##EQU00002##).
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the method of finding the most optimal state transition of Misu et al. with the method of finding the most optimal state transition of Stacy et al.
Doing so would allow for revision of the learning model (para [0084] Some embodiments of the learning model are able learn and revise the learning model to improve the model.).
Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides.") in view of Genevay et al. ("Transfer Learning for User Adaptation in Spoken Dialogue Systems.") and Uchibe et al. (US-20170147949-A1; hereinafter Uchibe).
Regarding Claim 17,
Misu and Genevay teach the computer system of claim 15. 	Misu and Genevay do not explicitly disclose
solve, by a dynamic programming or Monte Carlo method, an equation representing the expected long-term reward for the topic transition in order to calculate the expected long-term reward.
However, Uchibe further teaches 
wherein the processor is further configured to: 
solve, by a dynamic programming or Monte Carlo method (para [0217] In contrast, some previous methods use a stochastic gradient method or a Markov chain Monte Carlo method, which usually take time to optimize as compared with least-squares methods.), an equation representing the expected long-term reward for the topic transition in order to calculate the expected long-term reward (Para [0071] is called the discount factor. It is known that the optimal value function satisfies the following Bellman equation: 
V ( x ) = min u [ c ( x , u ) + .gamma. y .about. P T ( ' x , u ) [ V ( y ) ] ] ( 2 ) ##EQU00006##).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the reinforcement learning of Misu with the reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0044] An object of the present invention is to provide a new and improved inverse reinforcement learning system and method so as to obviate one or more of the problems of the existing art.)
Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides.") in view of Genevay et al. ("Transfer Learning for User Adaptation in Spoken Dialogue Systems."), Uchibe et al. (US-20170147949-A1; hereinafter Uchibe) and Rasmussen et al. (US-20170061283-A1).
Regarding Claim 18,
Misu and Genevay teach the computer system of claim 15. 
Misu and Genevay do not explicitly disclose

convert from the expected long-term reward to probability by using a softmax function; and 
merge the occurrence probability of the policy and the probability converted from the expected long-term reward.
However, Uchibe teaches
wherein the policy base includes occurrence probability of the topic transition in the corpus, the processor being further configured to: 
merge the occurrence probability of the policy and the probability converted from the expected long-term reward (para [0074] c(x,u)=q(x)+KL(.pi.(|x).parallel.p(|x)), (3) The p(|x) denotes the occurrence probability. para [0075] In this case, the Bellman equation (2) is simplified to the following equation: 
exp(-V(x))=exp(-q(x)).intg.p(y|x)exp(-.gamma.V(y))dy (4). The long-term reward is merged with the occurance probability).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the reinforcement learning of Misu with the reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0044] An object of the present invention is to provide a new and improved inverse reinforcement learning system and method so as to obviate one or more of the problems of the existing art.)


convert from the expected long-term reward to probability by using a softmax function (para [0006] Those allow for an approximation of the value of action a, which is compared to the predicted value Q(s, a) in order to compute the update to the prediction. The agent can then determine a policy by selecting the highest valued action in each state (with occasional random exploration, e.g., using e-greedy or softmax methods).);
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the method of computing state transition of Uchibe et al. with the method of computing state transition of Rasmussen et al.
Doing so would allow for robustness against noise (para [0017] In some cases, the error module computes an error that may include an integrative discount. It is typical in RL implementations to use exponential discounting. However, in systems with uncertain noise, or that are better able to represent more linear functions, integrative discount can be more effective.).
Claim 21 and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides.") in view of Genevay et al. ("Transfer Learning for User Adaptation in Spoken Dialogue Systems."), and Arel et al. (US-9536191-B1).
Regarding Claim 21,
Misu and Genevay teach the computer system of claim 15. 
	Misu and Genevay do not explicitly disclose

However, Arel (US 9536191 B1) teaches
wherein the generating the personalized policy comprises adapting a general policy for a particular user based on temporal difference learning with a user environment (col. 3 lines 1-6; By using a confidence function representation to adjust temporal difference learning updates and to select actions to be performed by an agent interacting with an environment, a reinforcement learning system can decrease the amount of agent interaction required to determine a proficient action selection policy.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the reinforcement learning of Misu with the reinforcement learning of Arel.
Doing so would allow for selecting high confidence actions (col. 3 lines 11-17; Moreover, using the confidence function representation in selecting actions can increase the state space visited by the agent during learning in a principled manner and avoid unnecessarily prolonging the learning process by forcing the reinforcement learning system to favor selecting higher-confidence actions.).
Regarding Claim 23,
Misu and Genevay teach the computer program product of claim 20. 
	Misu and Genevay do not explicitly disclose

However, Arel (US 9536191 B1) teaches
wherein the generating the personalized policy comprises adapting a general policy for a particular user based on temporal difference learning with a user environment (col. 3 lines 1-6; By using a confidence function representation to adjust temporal difference learning updates and to select actions to be performed by an agent interacting with an environment, a reinforcement learning system can decrease the amount of agent interaction required to determine a proficient action selection policy.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the reinforcement learning of Misu with the reinforcement learning of Arel.
Doing so would allow for selecting high confidence actions (col. 3 lines 11-17; Moreover, using the confidence function representation in selecting actions can increase the state space visited by the agent during learning in a principled manner and avoid unnecessarily prolonging the learning process by forcing the reinforcement learning system to favor selecting higher-confidence actions.).
Relevant Prior Art not cited in the Rejection
Iglesias, Ana, et al. "Learning teaching strategies in an adaptive and intelligent educational system through reinforcement learning."
Prior art discloses reinforcement learning in an educational system. 


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HENRY K NGUYEN whose telephone number is (571)272-0217.  The examiner can normally be reached on Mon - Fri 7:00am-4:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 5712723768.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/HENRY NGUYEN/Examiner, Art Unit 2121                                                                                                                                                                                                        


/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121