Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Acknowledgement is made of Applicant’s claim amendments on 04/07/2021. The claim amendments are entered. Presently, claims 1-16 remain pending. Claims 1 and 12 have been amended and claims 15-16 are newly added.
Regarding the 35 U.S.C 101 rejection of claims 1-16, Applicant has sufficiently amended the claims to overcome the 101 rejection. Accordingly, the 35 U.S.C 101 rejections are withdrawn.
Response to Arguments
Applicant’s arguments with respect to claim(s) 1 and 12 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159.  See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-4, and 6 provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 15-19 of copending Application No. 15450709 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because the only difference appears to be that claims from the copending application 15450709 are the system claims 15-19 corresponding to the method claims 1-4, and 6 of the instant application. It is obvious that the computer system comprising a memory, computer-readable program code, and processor are needed in order to carry out the instructions for the method claims of the copending application.
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.
Instant Application: 15800465
Corresponding Application: 15450709
Claim 1: A computer-implemented method for selection of an associative topic in a conversation, the method comprising: 












learning a policy for the selection of an associative topic using a machine learning process, the machine learning process including online learning and offline learning, comprising: 

in response to receiving a request by at least one user, analyzing data from a corpus stored on a memory associated with a hardware processor to obtain a policy base indicating a topic transition in the conversation from a source topic to a destination topic and a short-term reward for the topic transition, the short-term reward being defined as a determined probability of associating a positive response in the conversation; 



calculating, using the hardware processor, an expected long-term reward for the topic transition in the conversation using the short-term reward for the topic transition with taking into account a discounted reward for a subsequent topic transition in the conversation;

 generating a policy using the policy base and the expected long-term reward for the topic transition in the conversation, the policy indicating selection of the destination topic for the source topic as an associative topic for a current topic of conversation including at least one user; and 



between a remote computing device with at least one user based on the policy using an interface of an associated device to obtain a user- Page 2 of 25provided reward during the conversation, the online learning and offline learning being implemented on separate computing systems including the remote developer computing device and the associated device, respectively.
A computer system for selection of an associative topic using a processor-based dialog system configured for interacting with at least one user device, by executing program instructions, the computer system comprising: 

a memory tangibly storing the program instructions; 

a processor in communications with the memory for executing the program instructions, wherein the processor is configured to: 

learn a policy for the selection of an associative topic using a machine learning process, the machine learning process including online learning and offline learning, comprising: 

obtain a policy base indicating a topic transition from a source topic to a destination topic and a short-term reward for the topic transition by analyzing data from a corpus data stored in the memory associated with the processor to obtain a policy base indicating a topic transition in the conversation from a source topic to a destination topic and a short-term reward for the topic transition, wherein the short-term 

calculate an expected long-term reward for the topic transition using the short-term reward for the topic transition with taking into account a discounted reward for a subsequent topic transition; 



generate a personalized policy using the policy base and the expected Page 2 of 27long-term reward for the topic transition, wherein the personalized policy indicates selection of the destination topic for the source topic as an associative topic for a current topic, the personalized policy optimizing occurrence probability for the topic transition based on the learning using the corpus data; and 

between a remote computing device with the at least one user device based on the personalized policy using an interface of the at least one user device, the interface being configured for user-specific personalized communication with the dialog system, the online learning and offline learning being implemented on separate computing systems including the remote developer computing device and the associated device, respectively.
The method of claim 1, wherein the calculating comprises: 


evaluating a maximum long-term reward received from available subsequent topic transitions to calculate the discounted reward.
Claim 16: The computer system of claim 15, wherein the processor is further configured to: 

evaluate a maximum long-term reward received from available subsequent topic transitions to calculate the discounted reward.
Claim 3: The method of claim 1, wherein the calculating comprises: 


solving, by a dynamic programming or Monte Carlo method, an equation representing the expected long-term reward for the topic transition.
The computer system of claim 15, wherein the processor is further configured to: 

solve, by a dynamic programming or Monte Carlo method, an equation representing the expected long-term reward for the topic transition in order to calculate the expected long-term reward.
The method of claim 1, wherein the policy base includes occurrence probability of the JP920160169US2 (1695C) Page 28 of 33 topic transition in the corpus, the generating comprising: converting from the expected long-term reward to probability by using a softmax function; and merging the occurrence probability of the policy base and the probability converted from the expected long-term reward.
Claim 18: The computer system of claim 15, wherein the policy base includes occurrence probability of the topic transition in the corpus, the processor being further configured to: convert from the expected long-term reward to probability by using a softmax function; and merge the occurrence probability of the policy and the probability converted from the expected long-term reward.
Claim 6: The method of claim 1, wherein the method further comprises: selecting the destination topic as the associative topic for the current topic using the policy; observing a positive or updating the expected long-term reward and the policy by using the user-provided reward.
The computer system of claim15, wherein the processor is further configured to: select the destination topic as the associative topic for the current topic using the policy; observe a positive or negative actual response from user environment to obtain a user-provided reward; and update the expected long-term reward and the policy by using the user-provided reward.




Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 15 and 16 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Claims 15 and 16 recite the limitation “wherein the expected long-term reward for each topic transition (Q (t, t')) using the immediate reward for each topic transition (R(t, t)) while taking into account discounted reward for a subsequent topic transition is determined as follows: 
    PNG
    media_image1.png
    27
    347
    media_image1.png
    Greyscale
 where y denotes a discount factor (y<l) for evaluating a discounted value of the expected long- 
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1, 12, 15, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides."; hereinafter Misu) in view of Bane et al. (US-20150139074-A1; hereinafter Bane) and Schneegass et al. (US-20100205974-A1; hereinafter Schneegass).
Regarding Claim 1,
Misu teaches a computer-implemented method for learning a policy for selection of an associative topic in a conversation (pg. 84; The goal of RL is to learn a dialogue policy, i.e. the optimal action that the system should take at each possible dialogue state. Typically rewards depend on the domain and can include factors such as task completion, dialogue length, and user satisfaction.), the method comprising: 
 	learning a policy for the selection of an associative topic using a machine learning process… (pg. 86, section 3; A dialogue policy is a function from contexts to (possibly probabilistic) decisions that the dialogue system will make in those contexts. Reinforcement Learning (RL) is a machine learning technique used to learn the policy of the system.)
in response to receiving a request by at least one user, analyzing data from a corpus stored on a memory associated with a hardware processor to obtain a policy base (pg. 85; Then we compare our learned policy with two baselines, one of which is the dialogue policy of the original system that was used for collecting our corpus and that is currently installed at the Museum of Science in Boston.) indicating a topic transition in the conversation from a source topic to a destination topic (pg. 84; The goal of RL is to learn a dialogue policy, i.e. the optimal action that the system should take at each possible dialogue state. Typically rewards depend on the domain and can include factors such as task completion, dialogue length, and user satisfaction.  And pg. 88; User’s reaction: The user has to decide on one of the following. Go to the next topic (Go-on); cease the dialogue if there are no more questions in the stock of queries (Out-ofstock); rephrase the previous query (Rephrase); abandon the dialogue (Give-up) regardless of the remaining questions in the stock; generate a query based on a system recommendation, OT2 prompt (Refill). We calculate the user type dependent probability for these actions from the corpus. And pg. 89; Topic for next user query (e.g. introduction, personal, etc.): The SU selects a new topic based on user type dependent topic transition bigram probabilities estimated from the corpus. And pg. 91; One could argue that this favors the learned policy over the baselines. Because our SU is based on general corpus statistics (probability that the user is child or male or female, number of questions the user is planning to ask, probability of moving to the next topic or ceasing the dialogue, utterance timing statistics) rather than sequential information we believe that this is acceptable. We only use sequential information when we calculate the next topic that the user will choose.) and a short-term reward for the topic transition (pg. 86; A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function), the short-term reward being defined as … associating a positive response in the conversation (pg. 89; So for example, responding correctly to an in-domain user question is rewarded (+23.2) whereas providing an erroneous response to a junk question, i.e. treating junk questions as if they were in-domain questions, is penalized (-14.7).); 
calculating, using the hardware processor, an expected long-term reward for the topic transition in the conversation using the short-term reward for the topic transition (pg. 86; The quality of the policy π followed by the agent is measured by the expected future reward also called Q-function, Qπ : S × A → <.) with taking into account a discounted reward for a subsequent topic transition in the conversation (pg. 89; In this paper we follow a POMDP-based approach. A POMDP is defined as a tuple (S, A, P, R, O, Z, γ, b0) where S is the set of states (representing different contexts) which the system may be in (the system’s world), A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, O is a set of observations that the system can receive about the world, Z is a set of observation probabilities Z : S × A → Z(S, A), and γ a discount factor weighting long-term rewards.); 
generating a policy using the policy base (pg. 84; A simulated user is built based on this model and used for learning the dialogue policy of the virtual characters using RL. Our learned policy outperforms two baselines (including the original dialogue policy that was used for collecting the corpus) in a simulation setting.) and the expected long-term reward for the topic transition in the conversation (pg. 86; and γ a discount factor weighting longterm rewards.), the policy indicating selection of the destination topic for the source topic as an associative topic for a current topic of conversation including at least one user (pg. 89; We use the following features to optimize our dialogue policy (see section 3). We use the 6 retrieval scores of the NPCEditor (the 2 best scores for each user type ASR result), the previous system action, the ASR confidence scores, the voting scores (calculated by adding the scores of the results that agree), the system’s belief on the user type and user change, and the system’s belief on the user’s previous topic. So we need to learn a POMDP-based policy using these 42 features. And pg. 86; A POMDP is defined as a tuple (S, A, P, R, O, Z, γ, b0) where S is the set of states (representing different contexts) which the system may be in (the system’s world), A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, … At any given time step i the world is in some unobserved state si ∈ S. Because si is not known exactly, we keep a distribution over states called a belief state b, thus b(si) is the probability of being in state si , with initial belief state b0.); and 
implementing a dialogue between a remote computing device with at least one user based on the policy using an interface of an associated device (pg. 84; We analyze a corpus of interactions of museum visitors with two virtual characters that serve as guides at the Museum of Science in Boston, in order to build a realistic model of user behavior when interacting with these characters. A simulated user is built based on this model and used for learning the dialogue policy of the virtual characters using RL.) to obtain a user-provided reward during the conversation (pg. 90; As we can see, providing an OT2 as the first offtopic response is a poor action (-7.9); it is preferable to ask the user to rephrase her question (OT1) as a first attempt to recover from the error (+13.9). On the other hand, providing an OT2 prompt, after an off-topic prompt has occured in the previous system prompt, is a reasonable action (+4.2).).
Misu does not explicitly disclose
…the machine learning process including online learning and offline learning,
…the short-term reward being defined as determined probability…
the online learning and offline learning being implemented on separate computing systems including the remote developer computing device and the associated device, respectively.
However, Bane (US 20150139074 A1) teaches 
…the machine learning process including online learning and offline learning (para [0057] The online learning system 508 communicates with an offline learning system 512. The offline learning system 512 is discussed below with reference to FIG. 8.),
…the online learning and offline learning being implemented on separate computing systems including the remote developer computing device and the associated device, respectively (para [0071] The data collection module 802 further sends feedback and other data to the offline learning system 512. The offline learning system 512 generates models from the received feedback and data, which is then used to generate tiles. The models and tiles are made available to the online learning system 508. The online learning system 508 loads the tiles to a cache on the front-end module 502.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of implementing devices for reinforcement learning of Misu with the method of implementing devices for reinforcement learning of Bane (para [0032]).
Doing so would allow for implementing the machine learning devices on a cloud network system (para [0037] Further, the cloud service 104 may manipulate the connection quality data 214 to implement congestion control to shape network traffic to prevent overloading particular networks 108.)
Schneegass (US 20100205974 A1) further teaches
…the short-term reward being defined as determined probability… (para [0031] The transition from a state to the sequential state is characterized by what are known as rewards R(s,a,s'), which are functions of the instantaneous state, the action and the sequential state. The rewards are defined by a reward probability distribution P.sub.R with the expected value of the reward).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Schneegass.
Doing so would allow for an adaptable optimality criterion (para [0022] In a particularly preferred embodiment of the inventive method the optimality criterion comprises an adjustable parameter, the change in which causes the optimality criterion to be adapted. This provides a flexible means of tailoring the inventive method to the most appropriate optimality criterion for the predetermined data record.).
Regarding Claim 12,
Misu teaches a computer-implemented method for learning a policy for selection of an associative topic in a conversation (pg. 84; The goal of RL is to learn a dialogue policy, i.e. the optimal action that the system should take at each possible dialogue state. Typically rewards depend on the domain and can include factors such as task completion, dialogue length, and user satisfaction.), the method comprising: 
 	learning a policy for the selection of an associative topic using a machine learning process… (pg. 86, section 3; A dialogue policy is a function from contexts to (possibly probabilistic) decisions that the dialogue system will make in those contexts. Reinforcement Learning (RL) is a machine learning technique used to learn the policy of the system.)
in response to receiving a request by at least one user (pg. 93; Table 7: List of features used in predicting when the user will cease a session (Cease Dialogue), what the user will say next (Say Next 1), and what the user will say next after removing repeated user queries (Say Next 2). Example query 1 is “who are you named after?”; example query 2 is “are you a computer?”; example query 3 is “what do you like to do for fun?”; example query 4 is “what is artificial intelligence?”.), preparing, using a hardware processor, an expected long-term reward for a topic transition from a source topic to a destination topic in the conversation (pg. 84; The goal of RL is to learn a dialogue policy, i.e. the optimal action that the system should take at each possible dialogue state. Typically rewards depend on the domain and can include factors such as task completion, dialogue length, and user satisfaction.  And pg. 88; User’s reaction: The user has to decide on one of the following. Go to the next topic (Go-on); cease the dialogue if there are no more questions in the stock of queries (Out-ofstock); rephrase the previous query (Rephrase); abandon the dialogue (Give-up) regardless of the remaining questions in the stock; generate a query based on a system recommendation, OT2 prompt (Refill). We calculate the user type dependent probability for these actions from the corpus. And pg. 89; Topic for next user query (e.g. introduction, personal, etc.): The SU selects a new topic based on user type dependent topic transition bigram probabilities estimated from the corpus. And pg. 91; One could argue that this favors the learned policy over the baselines. Because our SU is based on general corpus statistics (probability that the user is child or male or female, number of questions the user is planning to ask, probability of moving to the next topic or ceasing the dialogue, utterance timing statistics) rather than sequential information we believe that this is acceptable. We only use sequential information when we calculate the next topic that the user will choose.), the expected long-term reward for the topic transition including… associating a positive expression in the conversation with the topic transition in a corpus stored on a memory associated with the hardware processor as a short- term reward (pg. 86; P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function And pg. 89; So for example, responding correctly to an in-domain user question is rewarded (+23.2) whereas providing an erroneous response to a junk question, i.e. treating junk questions as if they were in-domain questions, is penalized (-14.7).), and a discounted reward for a subsequent topic transition (pg. 89; In this paper we follow a POMDP-based approach. A POMDP is defined as a tuple (S, A, P, R, O, Z, γ, b0) where S is the set of states (representing different contexts) which the system may be in (the system’s world), A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, O is a set of observations that the system can receive about the world, Z is a set of observation probabilities Z : S × A → Z(S, A), and γ a discount factor weighting long-term rewards.); 
Page 5 of 18implementing a dialogue between a remote computing device with at least one user based on the topic transition using an interface of an associated device (pg. 84; We analyze a corpus of interactions of museum visitors with two virtual characters that serve as guides at the Museum of Science in Boston, in order to build a realistic model of user behavior when interacting with these characters. A simulated user is built based on this model and used for learning the dialogue policy of the virtual characters using RL.) to obtain a user-provided reward in the conversation (pg. 90; As we can see, providing an OT2 as the first offtopic response is a poor action (-7.9); it is preferable to ask the user to rephrase her question (OT1) as a first attempt to recover from the error (+13.9). On the other hand, providing an OT2 prompt, after an off-topic prompt has occured in the previous system prompt, is a reasonable action (+4.2).); 
updating the expected long-term reward for the topic transition (pg. 89; A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, O is a set of observations that the system can receive about the world, Z is a set of observation probabilities Z : S × A → Z(S, A), and γ a discount factor weighting long-term rewards.) by using the user- provided reward obtained from the interface of the associated device (pg. 89; So for example, responding correctly to an in-domain user question is rewarded (+23.2) whereas providing an erroneous response to a junk question, i.e. treating junk questions as if they were in-domain questions, is penalized (-14.7).); and 
generating a policy using the expected long-term reward for the topic transition (pg. 86; and γ a discount factor weighting longterm rewards.), the policy indicating selection of the destination topic for the source topic as an associative topic for a current topic (pg. 89; We use the following features to optimize our dialogue policy (see section 3). We use the 6 retrieval scores of the NPCEditor (the 2 best scores for each user type ASR result), the previous system action, the ASR confidence scores, the voting scores (calculated by adding the scores of the results that agree), the system’s belief on the user type and user change, and the system’s belief on the user’s previous topic. So we need to learn a POMDP-based policy using these 42 features. And pg. 86; A POMDP is defined as a tuple (S, A, P, R, O, Z, γ, b0) where S is the set of states (representing different contexts) which the system may be in (the system’s world), A is the set of actions of the system, P : S × A → P(S, A) is the set of transition probabilities between states after taking an action, R : S × A → < is the reward function, … At any given time step i the world is in some unobserved state si ∈ S. Because si is not known exactly, we keep a distribution over states called a belief state b, thus b(si) is the probability of being in state si , with initial belief state b0.).
Misu does not explicitly disclose
…the machine learning process including online learning and offline learning…
…including a determined probability… as a short- term reward…
the online learning and offline learning being implemented on separate computing systems including the remote developer computing device and the associated device, respectively.
However, Bane (US 20150139074 A1) teaches 
…the machine learning process including online learning and offline learning (para [0057] The online learning system 508 communicates with an offline learning system 512. The offline learning system 512 is discussed below with reference to FIG. 8.),
…the online learning and offline learning being implemented on separate computing systems including the remote developer computing device and the associated device, respectively (para [0071] The data collection module 802 further sends feedback and other data to the offline learning system 512. The offline learning system 512 generates models from the received feedback and data, which is then used to generate tiles. The models and tiles are made available to the online learning system 508. The online learning system 508 loads the tiles to a cache on the front-end module 502.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of implementing devices for reinforcement learning of Misu with the method of implementing devices for reinforcement learning of Bane (para [0032]).
Doing so would allow for implementing the machine learning devices on a cloud network system (para [0037] Further, the cloud service 104 may manipulate the connection quality data 214 to implement congestion control to shape network traffic to prevent overloading particular networks 108.)
Schneegass (US 20100205974 A1) further teaches
…including a determined probability… as a short- term reward… (para [0031] The transition from a state to the sequential state is characterized by what are known as rewards R(s,a,s'), which are functions of the instantaneous state, the action and the sequential state. The rewards are defined by a reward probability distribution P.sub.R with the expected value of the reward).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Schneegass.
Doing so would allow for an adaptable optimality criterion (para [0022] In a particularly preferred embodiment of the inventive method the optimality criterion comprises an adjustable parameter, the change in which causes the optimality criterion to be adapted. This provides a flexible means of tailoring the inventive method to the most appropriate optimality criterion for the predetermined data record.).
Regarding Claim 15,
Misu, Bane, and Schneegass teach the method of claim 1. Misu further teaches wherein the expected long-term reward for each topic transition (Q (t, t')) using the immediate reward for each topic transition (R(t, t)) while taking into account discounted reward for a subsequent topic transition is determined as follows: 
    PNG
    media_image1.png
    27
    347
    media_image1.png
    Greyscale
 where y denotes a discount factor pg. 86; …and γ a discount factor weighting longterm rewards. At any given time step i the world is in some unobserved state si ∈ S. Because si is not known exactly, we keep a distribution over states called a belief state b, thus b(si) is the probability of being in state si , with initial belief state b0. When the system performs an action αi ∈ A based on b, following a policy π : S → A, it receives a reward ri(si , αi) ∈ < and transitions to state si+1 according to P(si+1|si , αi) ∈ P. The system then receives an observation oi+1 according to P(oi+1|si+1, αi). The quality of the policy π followed by the agent is measured by the expected future reward also called Q-function, Qπ : S × A → <.) for the subsequent topic transition (pg. 91; One could argue that this favors the learned policy over the baselines. Because our SU is based on general corpus statistics (probability that the user is child or male or female, number of questions the user is planning to ask, probability of moving to the next topic or ceasing the dialogue, utterance timing statistics) rather than sequential information we believe that this is acceptable. We only use sequential information when we calculate the next topic that the user will choose.).
	Schneegass further teaches 
    PNG
    media_image1.png
    27
    347
    media_image1.png
    Greyscale
(para [0037] Generally the solutions for Q functions that are smoother for sequential states of the stochastic transitions are subject to prejudice. If s.sub.i+1 and r.sub.i are unbiased estimates of subsequent states or rewards, the expression (Q(s.sub.i, a.sub.i)-.gamma.V(s.sub.i+1)-r.sub.i).sup.2 is not an unbiased estimation of the true quadratic Bellman residual (Q(s,a)-(TQ)(s,a)).sup.2, but of (Q(s,a)-(TQ)(s,a)).sup.2+(T'Q)(s,a).sup.2. T and T' are defined here as follows: 
(T,Q)(s,a)= .sub.s'(R(s,a,s')+.gamma.max.sub.a'Q(s',a')) 
(T'Q)(s,a).sup.2=Var.sub.s'(R(s,a,s')+.gamma.max.sub.a'Q(s',a')) 
T is also referred to as the Bellman operator.).
Regarding Claim 16,
Claim 16 is the substantially similar to claim 15 and is rejected on the same grounds. 
Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides."; hereinafter Misu) in view of Bane et al. (US-20150139074-A1; hereinafter Bane), Schneegass et al. (US-20100205974-A1; hereinafter Schneegass), and Stacy et al. (US-20130288222-A1).
Regarding Claim 2,
Misu, Bane, and Schneegass teach the method of claim 1,
Misu, Bane, and Schneegass teach does not explicitly disclose
wherein the calculating comprises: evaluating a maximum long-term reward received from available subsequent topic transitions to calculate the discounted reward.
However, Stacy et al. teaches
wherein the calculating comprises: evaluating a maximum long-term reward received from available subsequent topic transitions to calculate the discounted reward (para [0077] Given an MDP (ignoring the partial observability aspect for the moment), the object is to construct a stationary policy .pi.: S.fwdarw.A, where .pi.(s) denotes the action to be executed in state s, that maximizes the expected accumulated reward over a horizon T of interest: 
E ( t = 0 T r t ) , ##EQU00002##).
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the method of finding the most optimal state transition of Misu et al. with the method of finding the most optimal state transition of Stacy et al.
Doing so would allow for revision of the learning model (para [0084] Some embodiments of the learning model are able learn and revise the learning model to improve the model.).
Claims 3, 6, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides."; hereinafter Misu) in view of Bane et al. (US-20150139074-A1; hereinafter Bane), Schneegass et al. (US-20100205974-A1; hereinafter Schneegass), and Uchibe et al. (US-20170147949-A1; hereinafter Uchibe).
Regarding Claim 3,
Misu, Bane, and Schneegass teach the method of claim 1. Uchibe et al. further teaches wherein the calculating comprises:
solving, by a dynamic programming or Monte Carlo method (para [0217] In contrast, some previous methods use a stochastic gradient method or a Markov chain Monte Carlo method, which usually take time to optimize as compared with least-squares methods.), an equation representing the expected long-term reward for the topic transition (Para [0071] is called the discount factor. It is known that the optimal value function satisfies the following Bellman equation: 
V ( x ) = min u [ c ( x , u ) + .gamma. y .about. P T ( ' x , u ) [ V ( y ) ] ] ( 2 ) ##EQU00006##).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0006] Since the likelihood of the optimal trajectory is parameterized by the cost function, the parameters of the cost can be optimized by maximizing likelihood. However, their methods require the entire trajectory data. A model-based IRL method is proposed by Dvijotham and Todorov (2010) (NPL 6) based on the framework of LMDP, in which the likelihood of the optimal state transition is represented by the value function. As opposed to path-integral approaches of IRL, it can be optimized from any dataset of state transitions.)
Regarding Claim 6,
Misu, Bane, and Schneegass teach the method of claim 1, Uchibe et al. further teaches wherein the method further comprises:
selecting the destination topic (Fig. 10; para [0228] Here, the state variables that define the behaviors of the user include topics of articles selected by the user while browsing each webpage. The article with the new topic that the user selects is the destination topic.) as the associative topic for the current topic using the policy (para [0226] The topic that the visitor is reading is regarded as the state and clicking the link is considered as the action. This is the topic the user is reading is the current topic.); 
para [0226] The topic that the visitor is reading is regarded as the state and clicking the link is considered as the action. Then, inverse reinforcement learning according to an embodiment of the present invention can analyze the decision-making in the user's net surfing. Since the estimated cost function represents the preference of the visitor, it becomes possible to recommend a list of articles for the user. The user shows a preference for a topic by clicking a link which will lead them to a new article/topic. This preference can be represented by the cost (reward) function.); and 
updating the expected long-term reward and the policy by using the user-provided reward (para [0071] V ( x ) = min u [ c ( x , u ) + .gamma. y .about. P T ( ' x , u ) [ V ( y ) ] ] ( 2 ) ##EQU00006##. The short-term reward is used to update the long-term reward. The long-term reward is used to updated the policy.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0006] Since the likelihood of the optimal trajectory is parameterized by the cost function, the parameters of the cost can be optimized by maximizing likelihood. However, their methods require the entire trajectory data. A model-based IRL method is proposed by Dvijotham and Todorov (2010) (NPL 6) based on the framework of LMDP, in which the likelihood of the optimal state transition is represented by the value function. As opposed to path-integral approaches of IRL, it can be optimized from any dataset of state transitions.).
Regarding Claim 14,
Misu, Bane, and Schneegass teach the method of claim 12. Uchibe further teaches wherein the method further comprises:
selecting an associative topic for a current topic (Fig. 10; para [0228] Here, the state variables that define the behaviors of the user include topics of articles selected by the user while browsing each webpage.) as the associative topic for the current topic using the policy (para [0226] The topic that the visitor is reading is regarded as the state and clicking the link is considered as the action.); and 
observing a positive or negative actual response from the user environment to obtain the user-provided reward (para [0226] The topic that the visitor is reading is regarded as the state and clicking the link is considered as the action. Then, inverse reinforcement learning according to an embodiment of the present invention can analyze the decision-making in the user's net surfing. Since the estimated cost function represents the preference of the visitor, it becomes possible to recommend a list of articles for the user.  The cost is also called a reward as shown in para [0216] and summary. The cost is represents a preference of the user. The preference for a topic is a positive response.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Uchibe.
para [0006] Since the likelihood of the optimal trajectory is parameterized by the cost function, the parameters of the cost can be optimized by maximizing likelihood. However, their methods require the entire trajectory data. A model-based IRL method is proposed by Dvijotham and Todorov (2010) (NPL 6) based on the framework of LMDP, in which the likelihood of the optimal state transition is represented by the value function. As opposed to path-integral approaches of IRL, it can be optimized from any dataset of state transitions.).

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides."; hereinafter Misu) in view of Bane et al. (US-20150139074-A1; hereinafter Bane), Schneegass et al. (US-20100205974-A1; hereinafter Schneegass), Uchibe et al. (US-20170147949-A1; hereinafter Uchibe), and Rasmussen et al. (US-20170061283-A1).
Regarding Claim 4,
Misu, Bane, and Schneegass teach the method of claim 1, Uchibe et al. further teaches wherein the policy base includes occurrence probability of the JP920160169US2 (1695C) Page 28 of 33 topic transition in the corpus, the generating comprising:
merging the occurrence probability of the policy base and the probability converted from the expected long-term reward (para [0074] c(x,u)=q(x)+KL(.pi.(|x).parallel.p(|x)), (3) The p(|x) denotes the occurrence probability. para [0075] In this case, the Bellman equation (2) is simplified to the following equation: 
exp(-V(x))=exp(-q(x)).intg.p(y|x)exp(-.gamma.V(y))dy (4). The long-term reward is merged with the occurance probability).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0006] Since the likelihood of the optimal trajectory is parameterized by the cost function, the parameters of the cost can be optimized by maximizing likelihood. However, their methods require the entire trajectory data. A model-based IRL method is proposed by Dvijotham and Todorov (2010) (NPL 6) based on the framework of LMDP, in which the likelihood of the optimal state transition is represented by the value function. As opposed to path-integral approaches of IRL, it can be optimized from any dataset of state transitions.).
Uchibe and Misu does not explicitly disclose 
converting from the expected long-term reward to probability by using a softmax function; 
However, Rasmussen et al. teaches 
converting from the expected long-term reward to probability by using a softmax function (para [0006] Those allow for an approximation of the value of action a, which is compared to the predicted value Q(s, a) in order to compute the update to the prediction. The agent can then determine a policy by selecting the highest valued action in each state (with occasional random exploration, e.g., using e-greedy or softmax methods).);
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the method of computing state transition of Uchibe et al. with the method of computing state transition of Rasmussen et al.
Doing so would allow for robustness against noise (para [0017] In some cases, the error module computes an error that may include an integrative discount. It is typical in RL implementations to use exponential discounting. However, in systems with uncertain noise, or that are better able to represent more linear functions, integrative discount can be more effective.).
Claim 5 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides."; hereinafter Misu) in view of Bane et al. (US-20150139074-A1; hereinafter Bane), Schneegass et al. (US-20100205974-A1; hereinafter Schneegass) Uchibe et al. (US-20170147949-A1; hereinafter Uchibe), and Mahmood et al. (“Emphatic Temporal-Difference Learning”).
Regarding Claim 5,
Misu, Bane, and Schneegass teach the method of claim 1, 
Uchibe et al. further discloses 
by using the policy and the expected long-term reward (para [0072] V ( x ) = min u [ c ( x , u ) + .gamma. y .about. P T ( | x , u ) [ V ( y ) ] ] ( 2 ) ##EQU00008##) for the topic transition as initial states (Para [0059] where q(x) and V(x) denote the cost and value function at state x and .gamma. represents a discount factor. p(y|x) and .pi.(y|x) denote the state transition probabilities before and after learning, respectively.)…based on temporal difference learning with user environment (para [0112] One of the features of the present invention is to show Eq. (11), which means the temporal difference error is zero for the optimal value function with the corresponding cost function.);
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0006] Since the likelihood of the optimal trajectory is parameterized by the cost function, the parameters of the cost can be optimized by maximizing likelihood. However, their methods require the entire trajectory data. A model-based IRL method is proposed by Dvijotham and Todorov (2010) (NPL 6) based on the framework of LMDP, in which the likelihood of the optimal state transition is represented by the value function. As opposed to path-integral approaches of IRL, it can be optimized from any dataset of state transitions.).
Misu, Bane, Schneegass, and Uchibe do not explicitly disclose 
by using the policy and the expected long-term reward for the topic transition as initial states, personalizing the policy for a specific user based on temporal difference learning with user environment.
	However, Mahood et al. teaches (“Emphatic Temporal-Difference Learning”)
wherein the method further comprises: 
pg. 1 The idea is to emphasize and deemphasize state updates with user-specific interest in conjunction with how much other states bootstrap from that state.) based on temporal difference learning with user environment (Pg. 2; Let us start with the problem of selective updating in the simplest function approximation case: linear TD(λ) with λ = 0. Consider a Markov decision process (MDP) with a finite set S of N states and a finite set A of actions, for the discounted total reward criterion with discount rate γ ∈ [0, 1). In this setting, an agent interacts with the environment by taking an action At ∈ A at state St ∈ S according to a policy π : A × S → [0, 1] where π(a|s) .= P{At =a|St =s} 1 , transitions to state St+1 ∈ S, and receives reward Rt+1 ∈ R in a sequence of time steps t ≥ 0. Let Pπ ∈ R N×N denote the state transition probability matrix and rπ ∈ R N the expected immediate rewards from each state under π).
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine reinforcement learning of Uchibe et al. with the Tempora-Difference learning of Mahmood et al.
Doing so would allow for updating state transition probabilities at various time steps (pg. 1; Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps.).
Regarding Claim 7,

	Uchibe et al. teaches 
	estimating a temporal difference error (para [0112] One of the features of the present invention is to show Eq. (11), which means the temporal difference error is zero for the optimal value function with the corresponding cost function.);
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0006] Since the likelihood of the optimal trajectory is parameterized by the cost function, the parameters of the cost can be optimized by maximizing likelihood. However, their methods require the entire trajectory data. A model-based IRL method is proposed by Dvijotham and Todorov (2010) (NPL 6) based on the framework of LMDP, in which the likelihood of the optimal state transition is represented by the value function. As opposed to path-integral approaches of IRL, it can be optimized from any dataset of state transitions.).
Misu does not explicitly disclose
estimating a temporal difference error defined by the user-provided reward, a current version of the expected long-term reward and a discounted long-term reward received from selection of a subsequent topic; and 
adjusting the expected long-term reward by the temporal difference error with a JP920160169US2 (1695C) Page 29 of 33 learning rate.

estimating a temporal difference error defined by the user-provided reward, a current version of the expected long-term reward and a discounted long-term reward received from selection of a subsequent topic (Pg 2; Let us start with the problem of selective updating in the simplest function approximation case: linear TD(λ) with λ = 0. Consider a Markov decision process (MDP) with a finite set S of N states and a finite set A of actions, for the discounted total reward criterion with discount rate γ ∈ [0, 1). In this setting, an agent interacts with the environment by taking an action At ∈ A at state St ∈ S according to a policy π : A × S → [0, 1] where π(a|s) .= P{At =a|St =s} 1 , transitions to state St+1 ∈ S, and receives reward Rt+1 ∈ R in a sequence of time steps t ≥ 0. Let Pπ ∈ R N×N denote the state transition probability matrix and rπ ∈ R N the expected immediate rewards from each state under π); and 
adjusting the expected long-term reward by the temporal difference error with a JP920160169US2 (1695C) Page 29 of 33 learning rate (pg 2. Let us start with the problem of selective updating in the simplest function approximation case: linear TD(λ) with λ = 0. Consider a Markov decision process (MDP) with a finite set S of N states and a finite set A of actions, for the discounted total reward criterion with discount rate γ ∈ [0, 1).).
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine reinforcement learning of Uchibe et al. with the Tempora-Difference learning of Mahmood et al.
Doing so would allow for updating state transition probabilities at various time steps (pg. 1; Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps.).
Claims 8 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides."; hereinafter Misu) in view of Bane et al. (US-20150139074-A1; hereinafter Bane), Schneegass et al. (US-20100205974-A1; hereinafter Schneegass), Uchibe et al. (US-20170147949-A1; hereinafter Uchibe), and Amitay et al. (US-20040236725-A1).
Regarding Claim 8,
Misu, Bane, and Schneegass teach the method of claim 1. Uchibe et al. further teaches wherein the obtaining comprises: 
	the probability…being used as the short-term reward for the topic transition to the destination topic (para [0066] Consequently, an immediate cost c(x.sub.t, u.sub.t) is given from the environment and the environment makes a state transition according to a state transition probability P.sub.T(y|x.sub.t, u.sub.t) from x.sub.t to y .di-elect cons. X).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0006] Since the likelihood of the optimal trajectory is parameterized by the cost function, the parameters of the cost can be optimized by maximizing likelihood. However, their methods require the entire trajectory data. A model-based IRL method is proposed by Dvijotham and Todorov (2010) (NPL 6) based on the framework of LMDP, in which the likelihood of the optimal state transition is represented by the value function. As opposed to path-integral approaches of IRL, it can be optimized from any dataset of state transitions.).
Misu, Bane, Schneegass, and Uchibe et al. do not explicitly disclose
counting an appearance of one or more positive expressions having dependency to the destination topic in the corpus
estimating probability of appearance of any one of the one or more positive expressions using a count of the appearance
However, Amitay et al. teaches 
counting an appearance of one or more positive expressions having dependency to the destination topic in the corpus (para [0111-0112] W.sub.t is the weight of each term t, as defined above, with a positive weight for on-topic terms and a negative weight for off-topic terms… wherein N.sub.t denotes the number of occurrences of the term t found by the disambiguator in the context in question.); and 
estimating probability of appearance of any one of the one or more positive expressions using a count of the appearance (para [0086] When one of the on-topic terms occurs in a context of spot 42, such as a term 50 "Music," which appears in a paragraph 48 containing the spot, it increases the likelihood that this spot is on-topic. The word "album" appearing in paragraph 48 could also be tagged as an on-topic term.),

Doing so would allow for identifying terms relevant to topics (para [0009] The methods of the present invention are particularly useful in rapidly identifying the occurrences of a term that are relevant to a topic of interest in a large, noisy corpus of documents, such as the World Wide Web.).
Regarding Claim 10,
Misu, Bane, and Schneegass teach the method of claim 1. Uchibe et al. further teaches wherein the obtaining comprises:
estimating occurrence probability…the policy base including the occurrence probability of the topic transition (para [0074] c(x,u)=q(x)+KL(.pi.(|x).parallel.p(|x)), (3) The p(|x) denotes the occurrence probability. para [0075] In this case, the Bellman equation (2) is simplified to the following equation: 
exp(-V(x))=exp(-q(x)).intg.p(y|x)exp(-.gamma.V(y))dy (4).).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0006] Since the likelihood of the optimal trajectory is parameterized by the cost function, the parameters of the cost can be optimized by maximizing likelihood. However, their methods require the entire trajectory data. A model-based IRL method is proposed by Dvijotham and Todorov (2010) (NPL 6) based on the framework of LMDP, in which the likelihood of the optimal state transition is represented by the value function. As opposed to path-integral approaches of IRL, it can be optimized from any dataset of state transitions.).
Uchibe et al. and Misu do not explicitly disclose
counting an appearance of the destination topic around the source topic in the corpus;
	However, Amitay et al. teaches
	counting an appearance of the destination topic around the source topic in the corpus (para [0111-0112] W.sub.t is the weight of each term t, as defined above, with a positive weight for on-topic terms and a negative weight for off-topic terms… wherein N.sub.t denotes the number of occurrences of the term t found by the disambiguator in the context in question.);
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the occurrence probability of Uchibe et al. with the appearance probability of Amitay et al.
Doing so would allow for identifying terms relevant to topics (para [0009] The methods of the present invention are particularly useful in rapidly identifying the occurrences of a term that are relevant to a topic of interest in a large, noisy corpus of documents, such as the World Wide Web.).
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides."; hereinafter Misu) in view of Bane et al. (US-20150139074-A1; hereinafter Bane), Schneegass et al. (US-20100205974-A1; hereinafter Schneegass), Uchibe et al. (US-20170147949-A1; hereinafter Uchibe), Amitay et al. (US-20040236725-A1), and Sanchez Charles et al. (US-20180032874-A1).
Regarding Claim 9,
Misu, Bane, and Schneegass teach the method of claim 1.
Uchibe et al. teaches wherein the obtaining comprises:
the probability…being used as the short-term reward for the topic transition from the source topic to the destination topic (para [0066] Consequently, an immediate cost c(x.sub.t, u.sub.t) is given from the environment and the environment makes a state transition according to a state transition probability P.sub.T(y|x.sub.t, u.sub.t) from x.sub.t to y .di-elect cons. X).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0006] Since the likelihood of the optimal trajectory is parameterized by the cost function, the parameters of the cost can be optimized by maximizing likelihood. However, their methods require the entire trajectory data. A model-based IRL method is proposed by Dvijotham and Todorov (2010) (NPL 6) based on the framework of LMDP, in which the likelihood of the optimal state transition is represented by the value function. As opposed to path-integral approaches of IRL, it can be optimized from any dataset of state transitions.).
Misu, Bane, Schneegass, and Uchibe et al. do not explicitly disclose

estimating probability of appearance of any one of the one or more positive expressions using a count of the appearance; and 
weighting the probability of the appearance by distance between the source topic and the destination topic, 
However, Amitay et al. teaches
counting an appearance of one or more positive expressions having dependency to the destination topic in the corpus (para [0111-0112] W.sub.t is the weight of each term t, as defined above, with a positive weight for on-topic terms and a negative weight for off-topic terms… wherein N.sub.t denotes the number of occurrences of the term t found by the disambiguator in the context in question.); 
estimating probability of appearance of any one of the one or more positive expressions using a count of the appearance (para [0086] When one of the on-topic terms occurs in a context of spot 42, such as a term 50 "Music," which appears in a paragraph 48 containing the spot, it increases the likelihood that this spot is on-topic. The word "album" appearing in paragraph 48 could also be tagged as an on-topic term.);
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the short-term reward of Uchibe et al. with the appearance probability of Amitay et al.
Doing so would allow for identifying terms relevant to topics (para [0009] The methods of the present invention are particularly useful in rapidly identifying the occurrences of a term that are relevant to a topic of interest in a large, noisy corpus of documents, such as the World Wide Web.).
Sanchez Charles et al. teaches 
weighting the probability of the appearance (para [0032] In accordance with various embodiments of the inventive subject matter, the vectorization module 325 may use vectorization algorithms, including, but not limited to, natural language vectorization algorithms, such as Doc2Vec, Latent Dirichlet Allocation (LDA), and/or Term Frequency-Inverse Document Frequency (TF-IDF) to generate the vectorized document. Doc2Vec is an extension of Word2vec, which is a group of related models that are used to produce word embeddings. These vectorization algorithms may encode the probability distribution of words in the document along with the transition probabilities between words.) by distance between the source topic and the destination topic (para [0044] Because topics may be represented in a vector space, a difference between topics can be easily defined as the usual Euclidean distance in a vector space. Examiner note: The TF-IDF is represented as vectors. The vectors are weighted (scored) based on the distance between topics.).
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the method of calculating a topic state change probability of Uchibe et al. with the method of calculating a distance between topics of Sanchez Charles et al.
para [0033] The topic detection module 330 may be configured to detect one or more topics within the vectorized document.).
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides."; hereinafter Misu) in view of Bane et al. (US-20150139074-A1; hereinafter Bane), Schneegass et al. (US-20100205974-A1; hereinafter Schneegass), Uchibe et al. (US-20170147949-A1; hereinafter Uchibe), Amitay et al. (US-20040236725-A1), and Dave et al. (US-20150154193-A1).
Regarding Claim 11,
Misu, Bane, and Schneegass teach the method of claim 1. Uchibe further teaches wherein the obtaining comprises:
estimating occurrence probability of the topic transition…, the policy base including the occurrence probability of the topic transition (para [0074] c(x,u)=q(x)+KL(.pi.(|x).parallel.p(|x)), (3) The p(|x) denotes the occurrence probability. para [0075] In this case, the Bellman equation (2) is simplified to the following equation: 
exp(-V(x))=exp(-q(x)).intg.p(y|x)exp(-.gamma.V(y))dy (4).).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0006] Since the likelihood of the optimal trajectory is parameterized by the cost function, the parameters of the cost can be optimized by maximizing likelihood. However, their methods require the entire trajectory data. A model-based IRL method is proposed by Dvijotham and Todorov (2010) (NPL 6) based on the framework of LMDP, in which the likelihood of the optimal state transition is represented by the value function. As opposed to path-integral approaches of IRL, it can be optimized from any dataset of state transitions.).
Misu, Bane, Schneegass, and Uchibe et al. do not explicitly disclose
counting co-occurrence of the destination topic with the source topic in the corpus; 
weighting a count of the co-occurrence by closeness between a position of the destination topic and a position of the source topic in a sentence;
	However, Amitay et al. teaches
	counting co-occurrence of the destination topic with the source topic in the corpus (para [0111-0112] W.sub.t is the weight of each term t, as defined above, with a positive weight for on-topic terms and a negative weight for off-topic terms… wherein N.sub.t denotes the number of occurrences of the term t found by the disambiguator in the context in question.); 
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the occurrence probability of Uchibe et al. with the co-occurrence count of Amitay et al.
Doing so would allow for identifying terms relevant to topics (para [0009] The methods of the present invention are particularly useful in rapidly identifying the occurrences of a term that are relevant to a topic of interest in a large, noisy corpus of documents, such as the World Wide Web.).
Dave et al. (US 20150154193 A1) teaches 
weighting a count of the co-occurrence by closeness between a position of the destination topic and a position of the source topic in a sentence (See paragraph [0030]).
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the co-occurrence count of Amitay et al. with the co-occurrence weight of Dave et al.
Doing so would allow for multiple topics to be extracted from documents (para [0028] In most cases a file may include a single topic, however a plurality of topics may also exist in a single document. Topic extraction techniques may include, for example, comparing keywords against models built with a multi-component extension of latent Dirichlet allocation (MC-LDA), among other techniques for topic identification.).
Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Misu et al. ("Reinforcement learning of question-answering dialogue policies for virtual museum guides."; hereinafter Misu) in view of Bane et al. (US-20150139074-A1; hereinafter Bane), Schneegass et al. (US-20100205974-A1; hereinafter Schneegass), Uchibe et al. (US-20170147949-A1; hereinafter Uchibe), Amitay et al. (US-20040236725-A1), and Stacy et al. (US-20130288222-A1).
Regarding Claim 13,

analyzing data from the corpus (para [0228] Here, the state variables that define the behaviors of the user include topics of articles selected by the user while browsing each webpage.) to obtain the short-term reward for the topic transition (para [0066] Consequently, an immediate cost c(x.sub.t, u.sub.t) is given from the environment and the environment makes a state transition according to a state transition probability P.sub.T(y|x.sub.t, u.sub.t) from x.sub.t to y .di-elect cons. X); 
calculating the expected long-term reward for the topic transition using the short-term reward for the topic transition and the discounted reward for the subsequent topic transition (para [0069-0072] where .gamma. .di-elect cons..left brkt-bot.0,1) is called the discount factor. It is known that the optimal value function satisfies the following Bellman equation: 
V ( x ) = min u [ c ( x , u ) + .gamma. y .about. P T ( ' x , u ) [ V ( y ) ] ] ( 2 ) ##EQU00006## 
Eq. (2) is a nonlinear equation due to the min operator. V(x) is the expected long-term reward, .gamma. is the discounted reward, and c (x,u) is the short-term reward.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the method of reinforcement learning of Misu with the method of reinforcement learning of Uchibe.
Doing so would allow for improved reinforcement learning (para [0006] Since the likelihood of the optimal trajectory is parameterized by the cost function, the parameters of the cost can be optimized by maximizing likelihood. However, their methods require the entire trajectory data. A model-based IRL method is proposed by Dvijotham and Todorov (2010) (NPL 6) based on the framework of LMDP, in which the likelihood of the optimal state transition is represented by the value function. As opposed to path-integral approaches of IRL, it can be optimized from any dataset of state transitions.).
Misu, Bane, Schneegass, and Uchibe et al. do not explicitly disclose 
	the short-term reward being obtained from a count of associating the positive expression in the corpus; and
evaluating a maximum long-term reward received from available subsequent topic JP920160169US2 (1695C) Page 31 of 33 transitions to calculate the discounted reward;
	However, Amitay et al. teaches
the short-term reward being obtained from a count of associating the positive expression in the corpus (para [0111-0112] W.sub.t is the weight of each term t, as defined above, with a positive weight for on-topic terms and a negative weight for off-topic terms… wherein N.sub.t denotes the number of occurrences of the term t found by the disambiguator in the context in question.);
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the short-term reward of Uchibe et al. with the appearance probability of Amitay et al.
Doing so would allow for identifying terms relevant to topics (para [0009] The methods of the present invention are particularly useful in rapidly identifying the occurrences of a term that are relevant to a topic of interest in a large, noisy corpus of documents, such as the World Wide Web.).
Stacy et al. teaches 
evaluating a maximum long-term reward received from available subsequent topic JP920160169US2 (1695C) Page 31 of 33 transitions to calculate the discounted reward (para [0077] Given an MDP (ignoring the partial observability aspect for the moment), the object is to construct a stationary policy .pi.: S.fwdarw.A, where .pi.(s) denotes the action to be executed in state s, that maximizes the expected accumulated reward over a horizon T of interest: 
E ( t = 0 T r t ) , ##EQU00002##);
It would have been obvious to persons’ having ordinary skill in the art before the effective filing date to combine the method of finding the most optimal state transition of Uchibe et al. with the method of finding the most optimal state transition of Stacy et al.
Doing so would allow for revision of the learning model (para [0084] Some embodiments of the learning model are able learn and revise the learning model to improve the model.).
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Cade (US 20090299496 A1) “Controller” – This prior art discloses a reinforcement learning with rewards drawn from a probability.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to HENRY K NGUYEN whose telephone number is (571)272-0217.  The examiner can normally be reached on Mon - Fri 7:00am-4:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/HENRY NGUYEN/Examiner, Art Unit 2121                                                                                                                                                                                                        


/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121