DETAILED ACTION
This action is written in response to the application filed 12/5/19. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-25 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Serban. (Serban IV, Sankar C, Germain M, Zhang S, Lin Z, Subramanian S, Kim T, Pieper M, Chandar S, Ke NR, Rajeshwar S. A deep reinforcement learning chatbot. arXiv preprint arXiv:1709.02349. 2017 Sep 7.)
Regarding claims 1, 13 and 20, Serban discloses a method (and a related computer system and computer program product) of using data from a knowledge store to configure a reinforcement learning software agent, the method comprising:
[The Examiner notes that generic computer hardware including a computer processor and one or more computer readable storage media are inherent throughout the Serban disclosure.]
receiving, by a computer, access to the knowledge store regarding a topic for software agent support;
P. 3: “Alicebot uses a set of AIML (artificial intelligence markup language) templates to produce a response given the dialogue history and user utterance (Wallace 2009, Shawar & Atwell 2007). We use the freely available Alice kernel available at www.alicebot.org.”Also pp. 6-7: “Movie titles, actor names, and director names are extracted from the Internet Movie Database (IMDB). Movie descriptions are taken from Google Knowledge Graph’s API. Other movie title queries are directed to the Open Movie Database (OMDB).8 For actor and director queries, the Wikiedata API is used. First, a search for actor and director names is done on a Wikidata JSON dump.”
using, by the computer, information from the knowledge store for training a reinforcement learning model supporting the reinforcement learning software agent;
See generally sec. 3.1: Template-based dialog models. Both Alicebot and Elizabot are template-based dialog systems.Also p. 6, sec. 3.2: Knowledge base-based question answering.Sec. 4 discusses reinforcement learning-based model selection policy based on the dialog systems discussed in sec. 3.P. 2: “Further, we apply reinforcement learning—including value function and policy gradient methods—to train the system to select an appropriate response from the models in its ensemble.”Also P. 7: “As described earlier, the model uses word embeddings to match tags. These word embeddings are trained using Word2Vec on movie plot summaries and actor biographies extracted from the IMDB database (Mikolov et al. 2013).”
testing, by the computer, the trained reinforcement learning model in a testing environment, the testing environment having limited connectivity to an external environment; and
P. 2: “The trained systems yield substantial improvements in A/B testing experiments with real-world users.”P. 14: “Crowdsourcing: We use Amazon Mechanical Turk (AMT) to collect data for training the scoring model. We follow a setup similar to Liu et al. (2016). We show human evaluators a dialogue along with 4 candidate responses, and ask them to score how appropriate each candidate response is on a 1-5 Likert-type scale. The score 1 indicates that the response is inappropriate or does not make sense, 3 indicates that the response is acceptable, and 5 indicates that the response is excellent and highly appropriate.”P. 16: “In total, we collected 199; 678 labels. We split this into training (train), development (dev) and testing (test) datasets consisting of respectively 137,549, 23,298 and 38,831 labels each.”The Examiner notes that the “testing environment” has “limited connectivity” in the sense that—during the testing phase—the model is deployed only to crowdworkers (i.e. Amazon Mechanical Turk workers, aka turkers).
deploying the reinforcement learning software agent with the tested and trained reinforcement learning model within an environment to autonomously perform actions to process requests.
Abstract and introduction: the chatbot system described throughout the disclosures was designed around the Alexa chatbot system and specifically the Amazon Alexa Prize competition.P. 14: “The dialogues are extracted from interactions between Alexa users and preliminary versions of our system.”P. 35: “Since nearly all our system components are trainable machine learning models, the system is likely to improve greatly with more interactions and additional data.”

Regarding claims 2, 14 and 21, Serban discloses the further limitation wherein the information from the knowledge store is in a semi-structured format and includes questions with one or more corresponding answers.
The Examiner notes that “semi-structured” implies both structured and unstructured fields are present.Structured, p. 6: “The model has a list of entity names and tags (e.g. movie plot and release year).” Also (same page) “Movie titles, actor names, and director names are extracted from the Internet Movie Database (IMDB).”Unstructured: p. 6, algorithm 3: “dialogue history”. Also (same page) “Movie descriptions are taken from Google Knowledge Graph’s API.” (Free-form text descriptions are inherently unstructured.)Also: P. 4, table 1: “Table 1: Example dialogues and corresponding candidate responses generated by response models. The response of the final system is marked in bold.”Also, see pp. 6-7, algorithms 3-5, each illustrating a procedure for choosing from among a plurality of template-based responses.

Regarding claims 3, 15 and 22, Serban discloses the further limitation wherein the knowledge store comprises information generated by a plurality of users, and wherein at least part of the information is curated by the plurality of users.
P. 9: “To train the logistic regression classifier, we annotated 12; 000 user utterances and candidate response pairs for appropriateness on a Likert-type scale 1 􀀀 5. The user utterances were extracted from interactions between Alexa users and a preliminary version of the system.” (Emphasis added.)P. 24: “During evaluation, each dialogue history is sampled from a separate set of dialogue histories, HEval, which is disjoint from the set of dialogue histories, HTrain used at training time. This ensures that the policy is not overfitting our finite set of dialogue histories.”P. 34: “For training the system policy, they employ a user simulator trained on real-world human-human dialogues.”

Regarding claims 4, 16 and 23, Serban discloses the further limitation comprising:
providing access to a domain simulator to execute actions by the reinforcement learning software agent in the testing environment.
P. 24: “Training Given the Abstract Discourse MDP, we are now able to learn policies directly from simulations. We use Q-learning with experience replay to learn the policy parametrized as an action-value function (Mnih et al. 2013, Lin 1993). Q-learning is a simple off-policy reinforcement learning algorithm, which has been shown to be effective for training policies parametrized by neural networks.” (Emphasis added.)

Regarding claims 5, Serban discloses the further limitation wherein the knowledge store includes features that are used by the reinforcement learning model to rank the information provided in the knowledge store.
P. 3: “To generate a response, the dialogue manager follows a three-step procedure. First, it uses all response models to generate a set of candidate responses. Second, if there exists a priority response in the set of candidate responses (i.e. a response which takes precedence over other responses), this response will be returned by the system.5 For example, for the question "What is your name?", the response "I am an Alexa Prize socialbot" is a priority response. Third, if there are no priority responses, the response is selected by the model selection policy. For example, the model selection policy may select a response by scoring all candidate responses and picking the highest-scored response. The overall process is illustrated in Figure 1.”

Regarding claims 6, Serban discloses the further limitation wherein the features include one or more of upvotes, downvotes, an author name, an author title, or an author status.
P. 16, fig. 5, illustrating a turker feedback form, allowing turkers to rate candidate responses on a scale of 1-5. The Examiner interprets ‘upvotes’ as encompassing scores of 4 or 5, and ‘downvotes’ as encompassing scores of 1 or 2.

Regarding claim 7, Serban discloses the further limitation wherein the reinforcement learning model ranks the information in the knowledge store based on user preferences.
P. 4, table 1, illustrating candidate responses; the final system response is marked in bold.See also p. 16, fig. 5, illustrating an interface for obtaining human feedback on candidate responses. The system learns to improve responses by incorporating this feedback via reinforcement learning. See generally sec. 4.

Regarding claim 8, Serban discloses the further limitation comprising:
generating a query;
P. 6: generating a query from a user utterance.
matching the query to a state in a policy generated by the reinforcement learning model, wherein each state has a corresponding action; and
PP. 10-13, sec. 4: model selection policy.P. 6: “BoWMovies model [bag-of-words] is a template-based response model”. (Emphasis added.)See also p. 22, fig. 7, showing a state machine. The state machine is also discussed at p. 34, sec. 7.1.
executing by the reinforcement learning software agent the action returned from the matching policy.
PP. 10-13, sec. 4: model selection policy.

Regarding claim 9, Serban discloses the further limitation comprising:
receiving a reward based on executing the action by the reinforcement learning software agent, when the action resolves the query.
P. 10: “The dialogue manager is an agent, which takes actions in an environment in order to maximize rewards”.See also p. 18, sec. 4.4 discussing learned reward function.

Regarding claim 10, Serban discloses the further limitation wherein a planning decision agent and a random software agent are used to validate learned actions by the reinforcement learning software agent.
P. 26: “In addition to evaluating the five policies described earlier, we also evaluate three heuristic policies: 1) a policy selecting responses at random called Random, 2) the Alicebot policy, and 3) the Evibot + Alicebot policy. Evaluating these models will serve to validate the approximate MDP.” (Emphasis added.)

Regarding claim 11, Serban discloses the further limitation comprising:
updating a policy generated by the reinforcement learning model, at a fixed or dynamic frequency, to include questions and answers added to the knowledge store.
P. 23: “For our purpose, H is the set of all recorded dialogues between Alexa users and a preliminary version of the system. This formally makes the Abstract Discourse MDP a non-parametric model, since sampling from the model requires access to the set of recorded dialogue histories H. This set grows over time when the system is deployed in practice. This is useful, because it allows to continuously improve the policy as new data becomes available.” (Emphasis added.)See also p. 11: Action-value parametrization (reward function).

Regarding claims 12, 19 and 25, Serban discloses the further limitation wherein the reinforcement learning software agent deployed with the trained reinforcement learning model reaches a desired goal more efficiently than a reinforcement learning software agent without a trained reinforcement learning model.
[The Examiner notes that this limitation is an intended result which does not place meaningful limitations on the recited method / system / computer program product. See MPEP 2111.04(I).]P. 26: “In addition to evaluating the five policies described earlier, we also evaluate three heuristic policies: 1) a policy selecting responses at random called Random, 2) the Alicebot policy, and 3) the Evibot + Alicebot policy. Evaluating these models will serve to validate the approximate MDP.” (Emphasis added.)Table 5: illustrating policy evaluation results. The Examiner notes that the reinforcement-learning-based techniques outperform other techniques, e.g. Random.

Regarding claim 17, the above rejections of claim 5 and 6 together apply equally to this claim.

Regarding claims 18 and 24, the above rejections of claim 8 and 9 together apply equally to these claims.

Additional Relevant Prior Art
The following references were identified by the Examiner as being relevant to the disclosed invention, but are not relied upon in any particular prior art rejection: 
Ilievski discloses a chatbot system employing reinforcement learning, as well as transfer learning to transfer knowledge from one domain to another (e.g. movie booking domain to restaurant domain). See especially sec. 3. (Ilievski V, Musat C, Hossmann A, Baeriswyl M. Goal-oriented chatbot dialog management bootstrapping with transfer learning. arXiv preprint arXiv:1802.00500. 2018 Feb 1.)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Vincent Gonzales whose telephone number is (571) 270-3837. The examiner can normally be reached on Monday-Friday 7 a.m. to 4 p.m. MT.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached at (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/Vincent Gonzales/Primary Examiner, Art Unit 2124