DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
In response to the Office Action mailed 5/25/2022, applicant has submitted an amendment filed 8/25/2022.
Claim(s) 1-3, 5-8, and 10-11, has/have been amended.  Claim(s) 4, 9, 12, has/have been cancelled. 
Response to Arguments
Applicant’s amendments did not address some 112 issues and also introduced others (see 112 rejections for full detail).

Additionally, Claim 1 recites, in the 5th to last line, “dialog act of the apparatus” and the 5th to last line of claims 5-6 recite “dialog act”.  Applicant may, at Applicant’s discretion, amend “dialog act of the apparatus” to recite –dialog act—for language consistency (but this amendment is not necessary).
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-3, 5-8, and 10-11 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

As per Claim 1 (and similarly claims 5-6):
“the logical expression representing the dialog act” in lines 7-8 of claim 1 lacks antecedent basis (line 3 of claim 1 recites that “the dialog act is performed by the user inputting a logical expression” but does not explicitly state that the logical expression represents the dialog act)
“the latest system” in line 9 of claim 1 lacks antecedent basis.
“the state of the dialog between the user and the dialog apparatus” and “the dialog between the user and the dialog apparatus” in lines 9-10 of claim 1 lack antecedent basis.
“the updated state of the dialog” in line 12 of claim 1 lacks antecedent basis.
	“the state of the dialog being performed with the user and the dialog apparatus” and “the dialog being performed with the user and the dialog apparatus” in lines 15-16 of claim 1 lack antecedent basis (as a consequence of “the state of the dialog between the user and the dialog apparatus” and “the dialog between the user and the dialog apparatus” in lines 9-10 of claim 1 lacking antecedent basis)
	“to obtain” in the 5th to last line of claim 1 to the 4th to last line of claim 1 seems like it should be –obtain—(“to” was deleted before “set a score” in the 8th to last line of claim 1 and before “update” in the 2nd to last line of claim 1).  
	“the state of the dialog” in the 4th to last line of claim 1 lacks antecedent basis (as a consequence of “the state of the dialog between the user and the dialog apparatus” and “the dialog between the user and the dialog apparatus” in lines 9-10 of claim 1 lacking antecedent basis)
“return” in the 3rd to last line of claim 1 (which was amended from “returns”) seems like it should be –returns—(grammar, i.e. “a reward function that… returns”).

As per Claim 2 (and similarly claims 7 and 10):
“the knowledge” in line 2 of claim 2 is ambiguous (it can refer to either “knowledge” in line 9 of claim 1, or to “knowledge being held in advance” in lines 10-11 of claim 1)
“the state of the dialog” in line 3 of claim 2 lacks antecedent basis (as a consequence of “the state of the dialog between the user and the dialog apparatus” and “the dialog between the user and the dialog apparatus” in lines 9-10 of claim 1 lacking antecedent basis)
“the response candidate” in line 3 of claim 2 is ambiguous (there are multiple response candidates in the dialog act set in claim 1, in addition to the selected one of the response candidates, and it is not clear which one “the response candidate” in line 3 of claim 2 is supposed to refer to)
“the state of the dialog” in line 6 of claim 2 lacks antecedent basis (as a consequence of “the state of the dialog between the user and the dialog apparatus” and “the dialog between the user and the dialog apparatus” in lines 9-10 of claim 1 lacking antecedent basis)
“the response candidates included in the set of response candidates” and “the set of response candidates” lack antecedent basis (claim 1 was amended to recite where the response candidates constitute a dialog act set, and “set of response candidates” was deleted from the 7th to last line of claim 1).
“the state of the dialog” in the 3rd to last line of claim 2 lacks antecedent basis (as a consequence of “the state of the dialog between the user and the dialog apparatus” and “the dialog between the user and the dialog apparatus” in lines 9-10 of claim 1 lacking antecedent basis)
“the response candidate” in the 2nd to last line of claim 2 is ambiguous (there are multiple response candidates in the dialog act set in claim 1, in addition to the selected one of the response candidates, and it is not clear which one “the response candidate” in line 3 of claim 2 is supposed to refer to).
“the score” in the 2nd to last line of claim 1 is ambiguous (each candidate has a score in the 8th to last line of claim 1)
“the state of the dialog” in the last 2 lines of claim 2 lacks antecedent basis (as a consequence of “the state of the dialog between the user and the dialog apparatus” and “the dialog between the user and the dialog apparatus” in lines 9-10 of claim 1 lacking antecedent basis)
“the response candidates after encoding” at the end of claim 2 is unclear because it is not clear if this phrase refers to encoded response candidates or to where the original/unencoded response candidates are used to set a score “after encoding”.

As per Claim 3 (and similarly claims 8 and 11):
“the state of the dialog” in line 4 of claim 3 lacks antecedent basis (as a consequence of “the state of the dialog between the user and the dialog apparatus” and “the dialog between the user and the dialog apparatus” in lines 9-10 of claim 1 lacking antecedent basis).
“the vector” in line 4 of claim 3 lacks antecedent basis.
“the structure of a logical expression that is representing the state of the dialog” and “the state of the dialog” in lines 4-5 of claim 3 lack antecedent basis.
“the state of the dialog encoded in the vector” and “the vector” lacks antecedent basis.
“and the obtained reward” in lines 4-5 of claim 3 can refer to where the obtained reward is also used to execute reinforcement learning processing (which is likely what Applicant meant to claim) or to something else that the use of the state of the dialog is “encoded in” (in addition to being “encoded in the vector”), and as claimed, it is not clear which interpretation Applicant meant to claim.

Allowable Subject Matter
Claims 1 and 5-6 would be allowable if rewritten or amended to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action.
Claims 2-3, 7-8, and 10-11, would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.

As per Claim(s) 1 (and similarly claim[s] 5-6, and consequently claim[s] 2-4 and 7-12 which depend on claim[s] 1, 5, and 6), the prior art of record does not teach or suggest the combination of all limitations in claim(s) 1, including (i.e. in combination with the remaining limitations in claim[s] 1) set a score to each of response candidates constituting the dialog act set based on the state of the dialog being performed with the user and the dialog apparatus, and a policy parameter, and refer to the set scores, to select one of the response candidates as a dialog act of the apparatus; and obtain a reward in the state of the dialog using a reward function that, as the reward, returns an evaluation of a behavior performed in a specific circumstance as a quantitatively represented numeric value, and update the policy parameter based on the obtained reward (i.e. where each of a plurality of response candidates are scored based on dialog/conversation state and a policy parameter, and where the policy parameter is updated based on a numeric value reward which is obtained in the state of the dialog using a reward function)
JP 2012-038287 (X reference in Search Report) teaches “The score calculation unit 15 normally calculates the score before the dialog text output unit 17 outputs the dialog text (not necessarily immediately before). In addition, it is suitable for the score calculation part 15 to calculate a score whenever the user input information reception part 14 receives user input information. Here, the score calculation unit 15 is to calculate a score by, for example, an arithmetic expression “score = f (user state information, weight vector)”. For example, f is “score = user state information × weight vector”. That is, the score calculation unit 15 uses the evaluation information managed in association with the sentence pattern information and the user state information that dynamically changes in order to determine the sentence pattern information of the sentence to be output next by the dialogue apparatus 1. Are used to calculate a score for each information recommendation method” and “The reward calculation unit 273 selects a spot included in the user input information using the expected value of the degree of match calculated by the random selection match value calculation unit 271 and the match degree calculated by the selected spot match level calculation unit 272. To calculate the reward” and “The learning unit 28 uses the reward to update the weight vector corresponding to the method identifier of the dialog device 1 and the information recommendation method storage unit 12 of the dialog device 1. For example, when the reward is a positive number, the learning unit 28 updates the weight vector corresponding to the method identifier of the dialog device 1 so that the information recommendation method included in the dialog text information is more easily selected. This weight vector is a weight vector of the information recommendation method storage unit 12. Here, updating means that the learning unit 28 may directly rewrite the weight vector in the information recommendation method storage unit 12 or may instruct the dialog device 1 to update. When the interactive device 1 receives an update instruction, the interactive device 1 rewrites the weight vector. The method and degree by which the learning unit 28 updates the weight vector does not matter. Usually, the larger the reward is, the learning unit 28 updates the weight vector corresponding to the technique identifier of the dialogue apparatus 1 according to the magnitude of the reward so that the information recommendation technique included in the dialog sentence information is more easily selected. To do. For example, when the reward is a negative number, the learning unit 28 updates the weight vector corresponding to the technique identifier of the dialogue apparatus 1 so that the information recommendation technique included in the dialogue sentence information is more difficult to be selected. For example, the learning unit 28 updates the weight vector according to a later-described Natural Actor Critic (NAC) algorithm, which is one of natural policy gradient methods. NAC is described in “Otake Yatsuya, Masaru Sugiyama: How to make a strong robotic game player, Mainichi Communications (2008).” Since it is a well-known technique, detailed explanation is omitted. NAC is a method for optimizing policies and is one of natural policy gradient methods. In the policy gradient method, instead of directly estimating the value function for the state S or estimating the action value function Q (S, A), the reward of the dialogue episode obtained by the policy before the update is used. Update policy π directly by natural gradient method to increase” (see PE2E translation).  This reference appears to describe where a score is calculated based on a weight vector and user state information and where a reward is used to update the weight vector.  This reference describes “The user state information storage unit 13 is information indicating a user's state, a preference vector that is information indicating a user's preference with respect to one or more determinants, and a knowledge vector indicating user's knowledge with respect to one or more determinants The user status information including The user state information may include an attribute vector that is information indicating one or more attribute values of the user. User attribute values include, for example, sex (male or female), age group (10's, 20's, 30's, baby boom junior, etc.), occupation, hometown, supporting political party, and the like” (see PE2E translation) which appears to suggest where user state information is information about a user and not information about a conversation/dialog state.  It is also not entirely clear that one score is set for each of a plurality of response candidates based on the dialog state and the weight vector.  The scores are calculated for each of “information recommendation methods” that possess “sentence pattern information” (see Google Translation) but it is not clear that the information recommendation methods or sentence pattern information are candidate responses.
M. -H. Su, K. -Y. Huang, T. -H. Yang, K. -J. Lai and C. -H. Wu, "Dialog State Tracking and action selection using deep learning mechanism for interview coaching," 2016 International Conference on Asian Language Processing (IALP), 2016, pp. 6-9, doi: 10.1109/IALP.2016.7875922. teaches receiving a dialog state and reward, updating a policy during learning, and outputting an action according to a learnt policy (Section IV B., particularly first paragraph).  This reference also describes where reward is a numerical value (page 8, left column).  This reference does not appear to specifically describe using a dialog state and the policy to set scores for each of a plurality of response candidates and selecting one of the response candidates based on the set scores.
2018/0232436 (62/459820 filed February 16, 2017 supports Specification) teaches “When the dialog mixer 130 is called, it accepts the base dialog states provided in the input. When the triggering event is new input, the dialog mixer 130 determines if the user is triggering a new dialog. A new dialog corresponds to a new dialog manager, e.g., a new schema or a new search in a dialog schema. If the user is triggering a new dialog, the dialog mixer 130 fetches the corresponding schema and initializes the dialog manager for the schema. The dialog mixer 130 then distributes the output of the natural language parser, also referred to as an analyzer, to all dialog managers. When the triggering event is a backend response, the dialog mixer 130 loads the dialog manager that corresponds with the backend response and applies the backend response to the dialog managers that request them, respectively. The dialog mixer 130 may solicit the dialog managers for backend requests and new state tokens. Each dialog manager solicited generates some kind of response, even if it is an error or failure response. In some implementations, the dialog manager 130 may also issue a backend request. The dialog mixer 130 rolls up each dialog manager's output, whether a system response or a backend request, into a response candidate. Each candidate has some combination of a system response(s) and/or a backend request(s), and a provisional dialog state. In some implementations, the dialog mixer 130 may perform second phase candidate generation. In second phase candidate generation the dialog mixer 130 may derive a composite candidate response from two or more individual schemas. The dialog mixer 130 provides the candidate response(s), a respective dialog state for each candidate response, and annotations for each candidate response back to the dialog host 120, where the responses are ranked, pruned, and potentially a response is triggered and provided to the input/output devices 110” (paragraph 35, see paragraph 30 of Provisional 62/459820) 
2014/0272884 teaches “Thus, using the array of rankers, each ranker, or subsets of rankers, may be tuned or trained for representing a particular domain or area of knowledge. The combination of the array of rankers may thus be used to provide a question and answer ranking mechanism that is applicable to multiple domains or areas of knowledge. This leads to a question and answer system that provides high quality answer results in a multiple-domain or even open-domain environment. One key advantage in a multiple-domain or open-domain QA system is improved performance. Such improved performance is achieved by the QA system of the illustrative embodiments in that multiple rankers are utilized which have been iteratively trained based on a reward value basis where the reward value is based on the ranks of candidate answers rather than the confidence scores associated with the candidate answers. This is important in that when different rankers are used in a heterogeneous array of rankers, the confidence scores may be computed differently by each ranker and thus, the confidence scores are not comparable across the different rankers. Hence, it is more accurate to base the reward value, indicative of the correctness of the operation of the ranker, based on the ranks of the candidate answers, their correspondence with the golden answer set, and the computed quality of the ranker itself over multiple iterations of the training” (paragraph 26).
2020/0142888 teaches “In some implementations, the control model and/or the generative model can be trained at least in part based on reinforcement learning. In some of those implementations, the control model and the generative model are trained separately, but in combination with one another. In training the control model and/or the generative model based on reinforcement learning, generated variants may be submitted to a search system, and responses (and optionally lack of responses) from the search system can indicate rewards. For example, for a response, to a query variant, that is an “answer” response, a reward can be assigned that is proportional (or otherwise related) to a quality of the answer response (e.g., as indicated by a response score, provided by the search system, for the “answer” response). In some of those examples, where no response is provided in response to a query variant and/or when the response is deemed (e.g., based on output from the search system) to not be an “answer” response, no reward will be assigned. In other words, only the last “answer” response will be rewarded and intermediate actions updated based on such reward (e.g., with a Monte-Carlo Q learning approach). In this manner, Q function learning, or other reinforcement function learning, can occur based on rewards that are conditioned on responses provided by a search system that is interacted with during the reinforcement learning. In implementations of reinforcement learning described herein, the state at a given time step is indicated by one or more of the state features (e.g., such as those described above), and the action can be either a query variant (i.e., generate a further query variant) or provide an “answer” response. Each action of the action space can be paired with a string that defines the corresponding question or “answer” response” (paragraph 17).  This reference does not qualify as prior art.

	Upon further search (in response to the amendment filed 8/25/2022):
	2018/0232435 (foreign priority date precedes effective date of this application) teaches “As has been described above, actions in an SDS system may comprise two parts, a communication function a (e.g. inform, deny, confirm, etc.) and (optionally) a list of slot-value pairs s, v (e.g. food=Chinese, pricerange=expensive, etc.). Dialogue policy optimisation can be solved via Reinforcement Learning (RL), where the goal during learning is to estimate a quantity Q(b, a), for each b and a, reflecting the expected cumulative rewards of the system executing action a at belief state b. A reward function assigns a reward r given the current state and action taken. The reward function can assign a reward r given the current state and action at each turn in the dialogue. Alternatively, this may be done by determining a reward value R at the end of each dialogue (e.g. from human input) and then determining a reward r for each turn of the dialogue (i.e. corresponding to a state and action taken at that turn) from this final value R for the dialogue. Q estimates the expected value of future rewards (accumulated r), given state b and action a. The value of Q may be updated based on the r values using a Q learning update rule” (paragraph 260).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC YEN whose telephone number is (571)272-4249. The examiner can normally be reached M-F 12:00PM -8:30PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, RICHEMOND DORVIL can be reached on (571)272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





EY 9/7/2022
/ERIC YEN/Primary Examiner, Art Unit 2658