DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Acknowledgement is made of Applicant’s claim amendments on 03/23/2022. The claim amendments are entered. Presently, claims 15-28 remain pending. Claims 15 and 20 have been amended.
Response to Arguments
Applicant's arguments filed 03/23/2022 have been fully considered but they are not persuasive.
Applicant argues: Applicant argues that the cited references do no teach "in response to receiving a request by at least one user, analyze data from a corpus stored on a memory associated with a hardware processor to obtain a policy base indicating a topic transition in the conversation from a source topic to a destination topic and a short-term reward for the topic transition, the short-term reward being defined as a determined probability of appearances of one or more types of positive expressions associated with the topic transition in the conversation" (page 12-17 or remarks)
Examiner Response: Examiner respectfully disagrees. The claims to not further define what a “type positive expressions” actually is nor do the above citied paragraphs detail what a “type of positive expressions” is. A positive response IS a “type of positive expression. 
	Misu teaches analyzing a corpus to obtain a policy base transition in the conversation from a source topic to a destination topic (pg. 85; Then we compare our learned policy with two baselines, one of which is the dialogue policy of the original system that was used for collecting our corpus and that is currently installed at the Museum of Science in Boston.). The policy indicates a transition from one topic to another in a dialogue (i.e. conversation) (pg. 89; Topic for next user query (e.g. introduction, personal, etc.): The SU selects a new topic based on user type dependent topic transition bigram probabilities estimated from the corpus. And pg. 91; One could argue that this favors the learned policy over the baselines. Because our SU is based on general corpus statistics (probability that the user is child or male or female, number of questions the user is planning to ask, probability of moving to the next topic or ceasing the dialogue, utterance timing statistics) rather than sequential information we believe that this is acceptable. We only use sequential information when we calculate the next topic that the user will choose.). Misu also teaches a short-term reward (pg. 86; A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function). The short-term reward is calculated (i.e. defined) based on answering the user’s questions correctly (i.e. positive response) in a dialogue (pg. 89; So for example, responding correctly to an in-domain user question is rewarded (+23.2) whereas providing an erroneous response to a junk question, i.e. treating junk questions as if they were in-domain questions, is penalized (-14.7).). Table 2 and table 3 show that there are more than one (i.e. counts) types of positive expressions. The rewards have a dependency on a on a destination topic because they are used on the reward function to determine the topic transition.
    PNG
    media_image1.png
    879
    440
    media_image1.png
    Greyscale

SCHATZMANN further teaches the short-term Page 2 of 20reward being (pg. 9, section 3.2; It executes the discrete action at ∈ A, transitions into the next state st+1 according to the transition probability p(st+1|st, at) and receives a reward rt+1. The Markov Property ensures that the state and reward at time t + 1 only depend on the state and action at time t. P(st+1, rt+1|st, at, st−1, at−1, rt−1, ..., s0, a0) = P(st+1, rt+1|st, at) (1)) defined as a determined probability (pg. 11, section 3.3; If T (s 0 , a, s) and R(s 0 , a, s) are unknown, we cannot systematically account for every possible combination of state and action and instead the agent needs to interact with its environment to learn these probabilities. Reward defined as a probability).
Applicant argues: Applicant argues Claims 25 and 26.
Examiner Response: Applicant’s arguments with respect to the rejection of claims 25 and 26 have been fully considered and are persuasive. Therefore, the 103 rejection has been withdrawn. Claim 26 are objected to as being allowable subject matter while claim 25 is still rejected under double patenting. 

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159.  See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 15-19, 25, and 27 provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-4, 6, 15, and 17 of copending Application No. 15/800,465 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because the only difference appears to be that claims from the copending application 15/800,465 are the system claims 15-19, 25, and 27 corresponding to the method claims 1-4, 6, 15, and 17 of the instant application. It is obvious that the computer system comprising a memory, computer-readable program code, and processor are needed in order to carry out the instructions for the method claims of the copending application.
The “the personalized policy optimizing occurrence probability for the topic transition based on the learning using the corpus data” is obvious in view of Misu (pg. 86; There are several algorithms for learning the optimal dialogue policy and we use Natural Actor Critic (NAC) (Peters and Schaal, 2008), which adopts a natural policy gradient method for policy optimization, also used by (Thomson and Young, 2010; Jurcˇ´ıcek et al., 2012). And pg. 89; The SU selects a new topic based on user type dependent topic transition bigram probabilities estimated from the corpus.). It would have been obvious to combine the personalized policy of Misu with the instant application. Doing so would allow for a policy specifically tailored for a certain user. The “the interface being configured for user-specific personalized communication with the dialog system” is also obvious in view of Misu (pg. 84; We analyze a corpus of interactions of museum visitors with two virtual characters that serve as guides at the Museum of Science in Boston, in order to build a realistic model of user behavior when interacting with these characters. A simulated user is built based on this model and used for learning the dialogue policy of the virtual characters using RL.). It is obvious that an interface is needed for a user to interact with the system. 
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.
Corresponding Application: 15800465
Instant Application: 15450709
Claim 1: A computer-implemented method for selection of an associative topic in a conversation, the method comprising: 












learning a policy for the selection of an associative topic using a machine learning process, the machine learning process including online learning and offline learning, comprising: 



in response to receiving a request by at least one user, analyzing data from a corpus stored on a memory associated with a hardware processor to obtain a policy base indicating a topic transition in the conversation from a source topic to a destination topic and a short-term reward for the topic transition, the short-term reward being defined as a determined probability of appearances of one or more types of positive expressions associated with the topic transition in the conversation based on an appearance count of any of the one or more types of positive expressions having a dependency to the destination topic in the conversation;


calculating, using the hardware processor, an expected long-term reward for the topic transition in the conversation using the short-term reward for the topic transition with taking into account a discounted reward for a subsequent topic transition in the conversation;

 generating a policy using the policy base and the expected long-term reward for the topic transition in the conversation, the policy indicating selection of the destination topic for the source topic as an associative topic for a current topic of conversation including at least one user; and 



implementing a dialogue between a remote computing device with at least one user based on the policy using an interface of an associated device to obtain a user- Page 2 of 25provided reward during the conversation, the online learning and offline learning being implemented on separate computing systems including the remote developer computing device and the associated device, respectively.
Claim 15: A computer system for selection of an associative topic using a processor-based dialog system configured for interacting with at least one user device, by executing program instructions, the computer system comprising: 

a memory tangibly storing the program instructions; 

a processor in communications with the memory for executing the program instructions, wherein the processor is configured to: 

learn a policy for the selection of an associative topic using a machine learning process, the machine learning process including online learning and offline learning, comprising: 


in response to receiving a request by at least one user, analyze data from a corpus stored on a memory associated with a hardware processor to obtain a policy base indicating a topic transition in the conversation from a source topic to a destination topic and a short-term reward for the topic transition, the short-term Page 2 of 20reward being defined as a determined probability of appearances of one or more types of positive expressions associated with the topic transition in the conversation based on an appearance count of any of the one or more types of positive expressions having a dependency to the destination topic in the conversation;


calculate an expected long-term reward for the topic transition using the short-term reward for the topic transition with taking into account a discounted reward for a subsequent topic transition; 




generate a personalized policy using the policy base and the expected Page 2 of 27long-term reward for the topic transition, wherein the personalized policy indicates selection of the destination topic for the source topic as an associative topic for a current topic, the personalized policy optimizing occurrence probability for the topic transition based on the learning using the corpus data; and 

implement a dialog between a remote computing device with the at least one user device based on the personalized policy using an interface of the at least one user device, the interface being configured for user-specific personalized communication with the dialog system, the online learning and offline learning being implemented on separate computing systems including the remote developer computing device and the associated device, respectively.
Claim 2: The method of claim 1, wherein the calculating comprises: 


evaluating a maximum long-term reward received from available subsequent topic transitions to calculate the discounted reward.
Claim 16: The computer system of claim 15, wherein the processor is further configured to: 

evaluate a maximum long-term reward received from available subsequent topic transitions to calculate the discounted reward.
Claim 3: The method of claim 1, wherein the calculating comprises: 


solving, by a dynamic programming or Monte Carlo method, an equation representing the expected long-term reward for the topic transition.
Claim 17: The computer system of claim 15, wherein the processor is further configured to: 

solve, by a dynamic programming or Monte Carlo method, an equation representing the expected long-term reward for the topic transition in order to calculate the expected long-term reward.
Claim 4: The method of claim 1, wherein the policy base includes occurrence probability of the JP920160169US2 (1695C) Page 28 of 33 topic transition in the corpus, the generating comprising: converting from the expected long-term reward to probability by using a softmax function; and merging the occurrence probability of the policy base and the probability converted from the expected long-term reward.
Claim 18: The computer system of claim 15, wherein the policy base includes occurrence probability of the topic transition in the corpus, the processor being further configured to: convert from the expected long-term reward to probability by using a softmax function; and merge the occurrence probability of the policy and the probability converted from the expected long-term reward.
Claim 6: The method of claim 1, wherein the method further comprises: selecting the destination topic as the associative topic for the current topic using the policy; observing a positive or negative actual response from user environment to obtain a user-provided reward; and updating the expected long-term reward and the policy by using the user-provided reward.
Claim 19: The computer system of claim 15, wherein the processor is further configured to: select the destination topic as the associative topic for the current topic using the policy; observe a positive or negative actual response from user environment to obtain a user-provided reward; and update the expected long-term reward and the policy by using the user-provided reward.
Claim 15:
The method of claim 1, wherein the expected long-term reward for each topic transition (Q (t, t')) using the immediate reward for each topic transition (R(t, t')) while taking into account discounted reward for a subsequent topic transition is determined as follows: 
    PNG
    media_image2.png
    29
    386
    media_image2.png
    Greyscale
 where y denotes a discount factor (y<1) for evaluating a discounted value of the expected long- term reward for the subsequent topic transition, where O represents the expected long-term reward function, R represents the immediate reward function, t represents an initial topic feature. t ' represents a destination topic feature, and t " represents a certain subsequent topic feature.
Claim 25:
The computer system of claim 15, wherein the expected long-term reward for each topic transition (Q (t, t)) using the immediate reward for each topic transition (R(t, t)) while taking into account discounted reward for a subsequent topic transition is determined as follows: 
    PNG
    media_image3.png
    29
    386
    media_image3.png
    Greyscale
 where y denotes a discount factor (7<1) for evaluating a discounted value of the expected long- term reward for the subsequent topic transition, where Q represents the expected long-term reward function, R represents the immediate reward function, Q(t' t ") represents a discounted life-long reward from a selection of a subsequent topic feature, t represents an initial topic feature, /' represents a destination topic feature, and t " represents the subsequent topic feature.
Claim 17: 
The method of claim 1, further comprising generating conversational text based on the associative topic and user-specific interests.
Claim 27:
The system of claim 15, wherein the processor is further configured for generating conversational text based on the associative topic and user-specific interests.



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 15, 19, 20, 22, 24, and 27-28 are rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides.") in view of SCHATZMANN et al. ("A Survey of Statistical User Simulation Techniques for Reinforcement-Learning of Dialogue Management Strategies"), and Bane et al. (US-20150139074-A1).
Regarding Claim 15,
Misu et al. teaches a computer system for selection of an associative topic using a processor-based dialog system configured for interacting with at least one user device, by executing program instructions, the computer system comprising: 
a memory tangibly storing the program instructions; 
a processor in communications with the memory for executing the program instructions, wherein the processor is configured to: 
learn a policy for the selection of an associative topic using a machine learning process (pg. 86, section 3; A dialogue policy is a function from contexts to (possibly probabilistic) decisions that the dialogue system will make in those contexts. Reinforcement Learning (RL) is a machine learning technique used to learn the policy of the system.), …, comprising: 
in response to receiving a request by at least one user, analyze data from a corpus stored on a memory associated with a hardware processor to obtain a policy base (pg. 85; Then we compare our learned policy with two baselines, one of which is the dialogue policy of the original system that was used for collecting our corpus and that is currently installed at the Museum of Science in Boston.) indicating a topic transition in the conversation from a source topic to a destination topic (pg. 84; The goal of RL is to learn a dialogue policy, i.e. the optimal action that the system should take at each possible dialogue state. Typically rewards depend on the domain and can include factors such as task completion, dialogue length, and user satisfaction.  And pg. 88; User’s reaction: The user has to decide on one of the following. Go to the next topic (Go-on); cease the dialogue if there are no more questions in the stock of queries (Out-ofstock); rephrase the previous query (Rephrase); abandon the dialogue (Give-up) regardless of the remaining questions in the stock; generate a query based on a system recommendation, OT2 prompt (Refill). We calculate the user type dependent probability for these actions from the corpus. And pg. 89; Topic for next user query (e.g. introduction, personal, etc.): The SU selects a new topic based on user type dependent topic transition bigram probabilities estimated from the corpus. And pg. 91; One could argue that this favors the learned policy over the baselines. Because our SU is based on general corpus statistics (probability that the user is child or male or female, number of questions the user is planning to ask, probability of moving to the next topic or ceasing the dialogue, utterance timing statistics) rather than sequential information we believe that this is acceptable. We only use sequential information when we calculate the next topic that the user will choose.) and a short-term reward for the topic transition (pg. 86; A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function), the short-term Page 2 of 20reward being defined as … appearances of one or more types of positive expressions associated with the topic transition in the conversation based on an appearance count (table 2 & table 3) of any of the one or more types of positive expressions having a dependency to the destination topic in the conversation (pg. 89; So for example, responding correctly to an in-domain user question is rewarded (+23.2) whereas providing an erroneous response to a junk question, i.e. treating junk questions as if they were in-domain questions, is penalized (-14.7).); 
calculate an expected long-term reward for the topic transition using the short- term reward for the topic transition (pg. 86; The quality of the policy π followed by the agent is measured by the expected future reward also called Q-function, Qπ : S × A → <.) with taking into account a discounted reward for a subsequent topic transition (pg. 89; In this paper we follow a POMDP-based approach. A POMDP is defined as a tuple (S, A, P, R, O, Z, γ, b0) where S is the set of states (representing different contexts) which the system may be in (the system’s world), A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, O is a set of observations that the system can receive about the world, Z is a set of observation probabilities Z : S × A → Z(S, A), and γ a discount factor weighting long-term rewards.); 
generate a personalized policy using the policy base (pg. 84; A simulated user is built based on this model and used for learning the dialogue policy of the virtual characters using RL. Our learned policy outperforms two baselines (including the original dialogue policy that was used for collecting the corpus) in a simulation setting.) and the expected long-term reward for the topic transition (pg. 86; and γ a discount factor weighting longterm rewards.), wherein the personalized policy indicates selection of the destination topic for the source topic as an associative topic for a current topic (pg. 89; We use the following features to optimize our dialogue policy (see section 3). We use the 6 retrieval scores of the NPCEditor (the 2 best scores for each user type ASR result), the previous system action, the ASR confidence scores, the voting scores (calculated by adding the scores of the results that agree), the system’s belief on the user type and user change, and the system’s belief on the user’s previous topic. So we need to learn a POMDP-based policy using these 42 features. And pg. 86; A POMDP is defined as a tuple (S, A, P, R, O, Z, γ, b0) where S is the set of states (representing different contexts) which the system may be in (the system’s world), A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, … At any given time step i the world is in some unobserved state si ∈ S. Because si is not known exactly, we keep a distribution over states called a belief state b, thus b(si) is the probability of being in state si , with initial belief state b0.) , the personalized policy optimizing occurrence probability for the topic transition based on the learning using the corpus data (pg. 86; There are several algorithms for learning the optimal dialogue policy and we use Natural Actor Critic (NAC) (Peters and Schaal, 2008), which adopts a natural policy gradient method for policy optimization, also used by (Thomson and Young, 2010; Jurcˇ´ıcek et al., 2012). And pg. 89; The SU selects a new topic based on user type dependent topic transition bigram probabilities estimated from the corpus.); and 
implement a dialog between a remote computing device with the at least one user device based on the personalized policy using an interface of the at least one user device, the interface being configured for user-specific personalized communication with the dialog system (pg. 84; We analyze a corpus of interactions of museum visitors with two virtual characters that serve as guides at the Museum of Science in Boston, in order to build a realistic model of user behavior when interacting with these characters. A simulated user is built based on this model and used for learning the dialogue policy of the virtual characters using RL.),…
Misu does not explicitly disclose
the short-term Page 2 of 20reward being defined as a determined probability…; 
…the online learning and offline learning being implemented on separate computing systems including the remote developer computing device and the associated device, respectively.
	However, SCHATZMANN teaches
implement a dialog between a remote computing device with the at least one user device based on the personalized policy using an interface of the at least one user device (pg. 4, section 2.1; The general architecture of a spoken dialogue system (SDS) is best visualised using a block diagram depicting the flow of information between the user and the major system components. As shown in Figure 2, the dialogue manager (DM) is the central component of a dialogue system and interfaces with the input- as well as the output-processing side of the system.), the interface being configured for user-specific personalized communication with the dialog system (pg. 2, section 1; In recent years, however, the application of machine-learning approaches to dialogue system design has created a need for simple statistical user models which are probabilistic in nature and trainable on existing human-computer dialogue data),
the short-term Page 2 of 20reward being (pg. 9, section 3.2; It executes the discrete action at ∈ A, transitions into the next state st+1 according to the transition probability p(st+1|st, at) and receives a reward rt+1. The Markov Property ensures that the state and reward at time t + 1 only depend on the state and action at time t. P(st+1, rt+1|st, at, st−1, at−1, rt−1, ..., s0, a0) = P(st+1, rt+1|st, at) (1)) defined as a determined probability (pg. 11, section 3.3; If T (s 0 , a, s) and R(s 0 , a, s) are unknown, we cannot systematically account for every possible combination of state and action and instead the agent needs to interact with its environment to learn these probabilities. Reward defined as a probability) of appearances of one or more types of positive expressions associated with the topic transition in the conversation based on an appearance count of any of the one or more types of positive expressions having a dependency to the destination topic in the conversation (pg. 8, section 3.1; The learning agent interacts with its dynamic environment, receives feedback in the form of positive or negative rewards according to some reward function and tries to optimise its actions so as to maximize the overall reward. By assigning small negative rewards for every action and a large positive reward for successful completion, an agent can be trained to act in a goal-oriented and efficient manner without providing explicit examples of ideal behaviour or specifying how a given task is to be completed (Sutton and Barto, 1998; Kaelbling et al., 1996)… However, it is usually possible to define what constitutes the successful completion of a dialogue. In a travel booking scenario, for instance, a positive reward can be associated with the completion of a flight booking.); 
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the dialogue reinforcement learning of Misu with the dialogue reinforcement learning of SCHATZMANN.
Doing so would allow for simulating user responses instead of training based off a static (fixed) corpus. This would enable the dialogue system to be more dynamic and more realistically emulate user responses (pg. 2; In addition, no guarantee can be given that the truly optimal strategy is indeed present in a given training corpus and it may thus be argued that an optimal strategy cannot be learned from a fixed corpus, regardless of its size. An interesting solution to this problem is to train a statistical, predictive user model for simulating user responses which can then be used to learn dialogue strategies through trial-and-error interaction between the dialogue manager and the simulated user (Levin et al., 2000; Scheffler, 2002; Pietquin, 2004; Henderson et al., 2005; Filisko and Seneff, 2005).).
Bane (US 20150139074 A1) teaches
a memory tangibly storing the program instructions (para [0019] In some embodiments, the computing device 202 has at least one processor 204 and a memory area 206.); 
a processor in communications with the memory for executing the program instructions (para [0019] In some embodiments, the processor 204 is programmed to execute instructions such as those illustrated in the figures (e.g., FIG. 3 and FIG. 4).), wherein the processor is configured to: 
…the machine learning process including online learning and offline learning (para [0057] The online learning system 508 communicates with an offline learning system 512. The offline learning system 512 is discussed below with reference to FIG. 8.)
… the online learning and offline learning being implemented on separate computing systems including the remote developer computing device and the associated device, respectively (para [0071] The data collection module 802 further sends feedback and other data to the offline learning system 512. The offline learning system 512 generates models from the received feedback and data, which is then used to generate tiles. The models and tiles are made available to the online learning system 508. The online learning system 508 loads the tiles to a cache on the front-end module 502.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of implementing devices for reinforcement learning of Misu with the method of implementing devices for reinforcement learning of Bane (para [0032]).
Doing so would allow for implementing the machine learning devices on a cloud network system to improve network traffic (para [0037] Further, the cloud service 104 may manipulate the connection quality data 214 to implement congestion control to shape network traffic to prevent overloading particular networks 108.)
Regarding Claim 19,
Misu, SCHATZMANN, and Bane teach the computer system of claim 15. Misu further teaches wherein the processor is further configured to: 
select the destination topic as the associative topic for the current topic using the policy (pg. 91; One could argue that this favors the learned policy over the baselines. Because our SU is based on general corpus statistics (probability that the user is child or male or female, number of questions the user is planning to ask, probability of moving to the next topic or ceasing the dialogue, utterance timing statistics) rather than sequential information we believe that this is acceptable. We only use sequential information when we calculate the next topic that the user will choose.); 
observe a positive or negative actual response from user environment to obtain a user-provided reward (pg. 89; So for example, responding correctly to an in-domain user question is rewarded (+23.2) whereas providing an erroneous response to a junk question, i.e. treating junk questions as if they were in-domain questions, is penalized (-14.7).); and
 update the expected long-term reward and the policy by using the user-provided reward (pg. 86; The quality of the policy π followed by the agent is measured by the expected future reward also called Q-function, Qπ : S × A → <.).
Regarding Claim 20,
Misu et al. teaches a computer program product for selection of an associative topic using a processor-based dialog system configured for interacting with at least one user device, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform the method of: 
learning a policy for the selection of an associative topic using a machine learning process (pg. 86, section 3; A dialogue policy is a function from contexts to (possibly probabilistic) decisions that the dialogue system will make in those contexts. Reinforcement Learning (RL) is a machine learning technique used to learn the policy of the system.), … comprising:
 in response to receiving a request by at least one user, analyzing data from a corpus stored on a memory associated with a hardware processor to obtain a policy base (pg. 85; Then we compare our learned policy with two baselines, one of which is the dialogue policy of the original system that was used for collecting our corpus and that is currently installed at the Museum of Science in Boston.) indicating a topic transition in the conversation from a source topic to a destination topic (pg. 84; The goal of RL is to learn a dialogue policy, i.e. the optimal action that the system should take at each possible dialogue state. Typically rewards depend on the domain and can include factors such as task completion, dialogue length, and user satisfaction.  And pg. 88; User’s reaction: The user has to decide on one of the following. Go to the next topic (Go-on); cease the dialogue if there are no more questions in the stock of queries (Out-ofstock); rephrase the previous query (Rephrase); abandon the dialogue (Give-up) regardless of the remaining questions in the stock; generate a query based on a system recommendation, OT2 prompt (Refill). We calculate the user type dependent probability for these actions from the corpus. And pg. 89; Topic for next user query (e.g. introduction, personal, etc.): The SU selects a new topic based on user type dependent topic transition bigram probabilities estimated from the corpus. And pg. 91; One could argue that this favors the learned policy over the baselines. Because our SU is based on general corpus statistics (probability that the user is child or male or female, number of questions the user is planning to ask, probability of moving to the next topic or ceasing the dialogue, utterance timing statistics) rather than sequential information we believe that this is acceptable. We only use sequential information when we calculate the next topic that the user will choose.) and a short-term reward for the topic transition (pg. 86; A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function),… short-term reward being defined as… appearances of one or more types of positive expressions associated with the topic transition in the conversation based on an appearance count (table 2 & table 3) of any of the one or more types of positive expressions having a dependency to the destination topic in the conversation (pg. 89; So for example, responding correctly to an in-domain user question is rewarded (+23.2) whereas providing an erroneous response to a junk question, i.e. treating junk questions as if they were in-domain questions, is penalized (-14.7).)
calculating an expected long-term reward for the topic transition using the short-term reward for the topic transition (pg. 86; The quality of the policy π followed by the agent is measured by the expected future reward also called Q-function, Qπ : S × A → <.) with taking into account a discounted reward for a subsequent topic transition (pg. 89; In this paper we follow a POMDP-based approach. A POMDP is defined as a tuple (S, A, P, R, O, Z, γ, b0) where S is the set of states (representing different contexts) which the system may be in (the system’s world), A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, O is a set of observations that the system can receive about the world, Z is a set of observation probabilities Z : S × A → Z(S, A), and γ a discount factor weighting long-term rewards.); 
generating a personalized policy using the policy base (pg. 84; A simulated user is built based on this model and used for learning the dialogue policy of the virtual characters using RL. Our learned policy outperforms two baselines (including the original dialogue policy that was used for collecting the corpus) in a simulation setting.) and the expected long-term reward for the topic transition (pg. 86; and γ a discount factor weighting longterm rewards.), the personalized policy indicating selection of the destination topic for the source topic as an associative topic for a current topic of conversation including at least one user (pg. 89; We use the following features to optimize our dialogue policy (see section 3). We use the 6 retrieval scores of the NPCEditor (the 2 best scores for each user type ASR result), the previous system action, the ASR confidence scores, the voting scores (calculated by adding the scores of the results that agree), the system’s belief on the user type and user change, and the system’s belief on the user’s previous topic. So we need to learn a POMDP-based policy using these 42 features. And pg. 86; A POMDP is defined as a tuple (S, A, P, R, O, Z, γ, b0) where S is the set of states (representing different contexts) which the system may be in (the system’s world), A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, … At any given time step i the world is in some unobserved state si ∈ S. Because si is not known exactly, we keep a distribution over states called a belief state b, thus b(si) is the probability of being in state si , with initial belief state b0.), the personalized policy optimizing occurrence probability for the topic transition based on the learning using the corpus data (pg. 86; There are several algorithms for learning the optimal dialogue policy and we use Natural Actor Critic (NAC) (Peters and Schaal, 2008), which adopts a natural policy gradient method for policy optimization, also used by (Thomson and Young, 2010; Jurcˇ´ıcek et al., 2012). And pg. 89; The SU selects a new topic based on user type dependent topic transition bigram probabilities estimated from the corpus.); and 
implementing a dialog between a remote computing device with the at least one user device based on the personalized policy using an interface of the at least one user device, the interface being configured for user-specific personalized communication with the dialog system (pg. 84; We analyze a corpus of interactions of museum visitors with two virtual characters that serve as guides at the Museum of Science in Boston, in order to build a realistic model of user behavior when interacting with these characters. A simulated user is built based on this model and used for learning the dialogue policy of the virtual characters using RL.),…
Misu does not explicitly disclose
the short-term reward being defined as a determined probability…; 
the online learning and offline learning being implemented on separate computing systems including the remote developer computing device and the associated device, respectively.
However, SCHATZMANN teaches
implementing a dialog between a remote computing device with the at least one user device based on the personalized policy using an interface of the at least one user device (pg. 4, section 2.1; The general architecture of a spoken dialogue system (SDS) is best visualised using a block diagram depicting the flow of information between the user and the major system components. As shown in Figure 2, the dialogue manager (DM) is the central component of a dialogue system and interfaces with the input- as well as the output-processing side of the system.), the interface being configured for user-specific personalized communication with the dialog system (pg. 2, section 1; In recent years, however, the application of machine-learning approaches to dialogue system design has created a need for simple statistical user models which are probabilistic in nature and trainable on existing human-computer dialogue data),
the short-term reward (pg. 9, section 3.2; It executes the discrete action at ∈ A, transitions into the next state st+1 according to the transition probability p(st+1|st, at) and receives a reward rt+1. The Markov Property ensures that the state and reward at time t + 1 only depend on the state and action at time t. P(st+1, rt+1|st, at, st−1, at−1, rt−1, ..., s0, a0) = P(st+1, rt+1|st, at) (1)) being defined as a determined probability (pg. 11, section 3.3; If T (s 0 , a, s) and R(s 0 , a, s) are unknown, we cannot systematically account for every possible combination of state and action and instead the agent needs to interact with its environment to learn these probabilities. Reward defined as a probability) of appearances of one or more types of positive expressions associated with the topic transition in the conversation based on an appearance count of any of the one or more types of positive expressions having a dependency to the destination topic in the conversation (pg. 8, section 3.1; The learning agent interacts with its dynamic environment, receives feedback in the form of positive or negative rewards according to some reward function and tries to optimise its actions so as to maximize the overall reward. By assigning small negative rewards for every action and a large positive reward for successful completion, an agent can be trained to act in a goal-oriented and efficient manner without providing explicit examples of ideal behaviour or specifying how a given task is to be completed (Sutton and Barto, 1998; Kaelbling et al., 1996)… However, it is usually possible to define what constitutes the successful completion of a dialogue. In a travel booking scenario, for instance, a positive reward can be associated with the completion of a flight booking.); 
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the dialogue reinforcement learning of Misu with the dialogue reinforcement learning of SCHATZMANN.
Doing so would allow for simulating user responses instead of training based off a static (fixed) corpus. This would enable the dialogue system to be more dynamic and more realistically emulate user responses (pg. 2; In addition, no guarantee can be given that the truly optimal strategy is indeed present in a given training corpus and it may thus be argued that an optimal strategy cannot be learned from a fixed corpus, regardless of its size. An interesting solution to this problem is to train a statistical, predictive user model for simulating user responses which can then be used to learn dialogue strategies through trial-and-error interaction between the dialogue manager and the simulated user (Levin et al., 2000; Scheffler, 2002; Pietquin, 2004; Henderson et al., 2005; Filisko and Seneff, 2005).).
Bane (US 20150139074 A1) teaches
a memory tangibly storing the program instructions (para [0019] In some embodiments, the computing device 202 has at least one processor 204 and a memory area 206.); 
a processor in communications with the memory for executing the program instructions (para [0019] In some embodiments, the processor 204 is programmed to execute instructions such as those illustrated in the figures (e.g., FIG. 3 and FIG. 4).), wherein the processor is configured to: 
…the machine learning process including online learning and offline learning (para [0057] The online learning system 508 communicates with an offline learning system 512. The offline learning system 512 is discussed below with reference to FIG. 8.)
… the online learning and offline learning being implemented on separate computing systems including the remote developer computing device and the associated device, respectively (para [0071] The data collection module 802 further sends feedback and other data to the offline learning system 512. The offline learning system 512 generates models from the received feedback and data, which is then used to generate tiles. The models and tiles are made available to the online learning system 508. The online learning system 508 loads the tiles to a cache on the front-end module 502.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of implementing devices for reinforcement learning of Misu with the method of implementing devices for reinforcement learning of Bane (para [0032]).
Doing so would allow for implementing the machine learning devices on a cloud network system to improve network traffic (para [0037] Further, the cloud service 104 may manipulate the connection quality data 214 to implement congestion control to shape network traffic to prevent overloading particular networks 108.)
Regarding Claim 22,
Misu, SCHATZMANN, and Bane teach the computer system of claim 15. Misu further teaches wherein the processor is further configured to generate conversational text based on the associative topic and user-specific interests (pg. 85; Figure 1: Example dialogue between the Twins virtual characters and a museum visitor.).
Regarding Claim 24,
Misu, SCHATZMANN, and Bane teach the computer program product of claim 20. Misu further teaches further comprising generating conversational text based on the associative topic and user-specific interests (pg. 85; Figure 1: Example dialogue between the Twins virtual characters and a museum visitor.).
Regarding Claim 27,
Misu, SCHATZMANN, and Bane teach the system of claim 15. Misu further teaches wherein the processor is further configured for generating conversational text based on the associative topic and user-specific interests (pg. 85; Figure 1: Example dialogue between the Twins virtual characters and a museum visitor.).
Regarding Claim 28,
Claim 27 is the computer program product corresponding to the system of claim 15. Claim 27 is substantially similar to claim 27 and is rejected on the same grounds. 

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides.") in view of SCHATZMANN et al. ("A Survey of Statistical User Simulation Techniques for Reinforcement-Learning of Dialogue Management Strategies"), Bane et al. (US-20150139074-A1), and Stacy et al. (US-20130288222-A1; hereinafter Stacy)..
Regarding Claim 16,
Misu, SCHATZMANN, and Bane teach the computer system of claim 15.
Misu, SCHATZMANN, and Bane do not explicitly disclose
wherein the processor is further configured to: evaluate a maximum long-term reward received from available subsequent topic transitions to calculate the discounted reward.
However, Stacy et al. teaches
wherein the processor is further configured to: 
evaluate a maximum long-term reward received from available subsequent topic transitions to calculate the discounted reward ([0077] Given an MDP (ignoring the partial observability aspect for the moment), the object is to construct a stationary policy .pi.: S.fwdarw.A, where .pi.(s) denotes the action to be executed in state s, that maximizes the expected accumulated reward over a horizon T of interest: 
E ( t = 0 T r t ) , ##EQU00002##).
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the method of finding the most optimal state transition of Misu, SCHATZMANN, and Bane with the method of finding the most optimal state transition of Stacy et al.
Doing so would allow for revision of the learning model (para [0084] Some embodiments of the learning model are able learn and revise the learning model to improve the model.).
Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides.") in view of SCHATZMANN et al. ("A Survey of Statistical User Simulation Techniques for Reinforcement-Learning of Dialogue Management Strategies"), Bane et al. (US-20150139074-A1), and Uchibe et al. (US-20170147949-A1; hereinafter Uchibe).
Regarding Claim 17,
Misu, SCHATZMANN, and Bane teach the computer system of claim 15. 	Misu, SCHATZMANN, and Bane do not explicitly disclose
solve, by a dynamic programming or Monte Carlo method, an equation representing the expected long-term reward for the topic transition in order to calculate the expected long-term reward.
However, Uchibe further teaches 
wherein the processor is further configured to: 
solve, by a dynamic programming or Monte Carlo method (para [0217] In contrast, some previous methods use a stochastic gradient method or a Markov chain Monte Carlo method, which usually take time to optimize as compared with least-squares methods.), an equation representing the expected long-term reward for the topic transition in order to calculate the expected long-term reward (Para [0071] is called the discount factor. It is known that the optimal value function satisfies the following Bellman equation: 
V ( x ) = min u [ c ( x , u ) + .gamma. y .about. P T ( ' x , u ) [ V ( y ) ] ] ( 2 ) ##EQU00006##).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the reinforcement learning of Misu, SCHATZMANN, and Bane with the reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0044] An object of the present invention is to provide a new and improved inverse reinforcement learning system and method so as to obviate one or more of the problems of the existing art.)

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides.") in view of SCHATZMANN et al. ("A Survey of Statistical User Simulation Techniques for Reinforcement-Learning of Dialogue Management Strategies"), Bane et al. (US-20150139074-A1), Uchibe et al. (US-20170147949-A1; hereinafter Uchibe) and Rasmussen et al. (US-20170061283-A1).
Regarding Claim 18,
Misu, SCHATZMANN, and Bane teach the computer system of claim 15. 
Misu, SCHATZMANN, and Bane do not explicitly disclose
wherein the policy base includes occurrence probability of the topic transition in the corpus, the processor being further configured to: 
convert from the expected long-term reward to probability by using a softmax function; and 
merge the occurrence probability of the policy and the probability converted from the expected long-term reward.
However, Uchibe teaches
wherein the policy base includes occurrence probability of the topic transition in the corpus, the processor being further configured to: 
merge the occurrence probability of the policy and the probability converted from the expected long-term reward (para [0074] c(x,u)=q(x)+KL(.pi.(|x).parallel.p(|x)), (3) The p(|x) denotes the occurrence probability. para [0075] In this case, the Bellman equation (2) is simplified to the following equation: 
exp(-V(x))=exp(-q(x)).intg.p(y|x)exp(-.gamma.V(y))dy (4). The long-term reward is merged with the occurance probability).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the reinforcement learning of Misu with the reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0044] An object of the present invention is to provide a new and improved inverse reinforcement learning system and method so as to obviate one or more of the problems of the existing art.)
However, Rasmussen et al. teaches
convert from the expected long-term reward to probability by using a softmax function (para [0006] Those allow for an approximation of the value of action a, which is compared to the predicted value Q(s, a) in order to compute the update to the prediction. The agent can then determine a policy by selecting the highest valued action in each state (with occasional random exploration, e.g., using e-greedy or softmax methods).);
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the method of computing state transition of Uchibe et al. with the method of computing state transition of Rasmussen et al.
Doing so would allow for robustness against noise (para [0017] In some cases, the error module computes an error that may include an integrative discount. It is typical in RL implementations to use exponential discounting. However, in systems with uncertain noise, or that are better able to represent more linear functions, integrative discount can be more effective.).
Claims 21 and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides.") in view of SCHATZMANN et al. ("A Survey of Statistical User Simulation Techniques for Reinforcement-Learning of Dialogue Management Strategies"), Bane et al. (US-20150139074-A1), Arel et al. (US-9536191-B1).
Regarding Claim 21,
Misu, SCHATZMANN, and Bane teach the computer system of claim 15. 
	Misu, SCHATZMANN, and Bane do not explicitly disclose
wherein the generating the personalized policy comprises adapting a general policy for a particular user based on temporal difference learning with a user environment.
However, Arel (US 9536191 B1) teaches
wherein the generating the personalized policy comprises adapting a general policy for a particular user based on temporal difference learning with a user environment (col. 3 lines 1-6; By using a confidence function representation to adjust temporal difference learning updates and to select actions to be performed by an agent interacting with an environment, a reinforcement learning system can decrease the amount of agent interaction required to determine a proficient action selection policy.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the reinforcement learning of Misu, SCHATZMANN, and Bane with the reinforcement learning of Arel.
Doing so would allow for selecting high confidence actions (col. 3 lines 11-17; Moreover, using the confidence function representation in selecting actions can increase the state space visited by the agent during learning in a principled manner and avoid unnecessarily prolonging the learning process by forcing the reinforcement learning system to favor selecting higher-confidence actions.).
Regarding Claim 23,
Misu, SCHATZMANN, and Bane teach the computer program product of claim 20. 
	Misu, SCHATZMANN, and Bane do not explicitly disclose
wherein the generating the personalized policy comprises adapting a general policy for a particular user based on temporal difference learning with a user environment.
However, Arel (US 9536191 B1) teaches
wherein the generating the personalized policy comprises adapting a general policy for a particular user based on temporal difference learning with a user environment (col. 3 lines 1-6; By using a confidence function representation to adjust temporal difference learning updates and to select actions to be performed by an agent interacting with an environment, a reinforcement learning system can decrease the amount of agent interaction required to determine a proficient action selection policy.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the reinforcement learning of Misu with the reinforcement learning of Arel.
Doing so would allow for selecting high confidence actions (col. 3 lines 11-17; Moreover, using the confidence function representation in selecting actions can increase the state space visited by the agent during learning in a principled manner and avoid unnecessarily prolonging the learning process by forcing the reinforcement learning system to favor selecting higher-confidence actions.).

Allowable Subject Matter
Claim 26 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Kluckner – (US 20180032841 A1) – discloses reinforcement learning with positive rewards.
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to HENRY K NGUYEN whose telephone number is (571)272-0217. The examiner can normally be reached Mon - Fri 7:00am-4:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 5712723768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/H.N./Examiner, Art Unit 2121                                                                                                                                                                                                        

/Jue Louie/
Primary Examiner, Art Unit 2121