DETAILED ACTION
Currently claims 1-20 are pending for application 16/262618 filed on 30 January 2019.  All references cited in the IDS have been considered.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.

4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1 and 6 are  rejected under 35 U.S.C. 103 as being unpatentable over Alexandros Komninos (“Leveraging Structure for Learning Representations of Words, Sentences and Knowledge Bases”, PhD Thesis, University of York, January 2018, pp. IEEE Pervasive Computing, 2018, pp. 1-130), hereinafter referred to as Komninos, in view of Bordes et al. (“Large-scale Simple Question Answering with Memory Networks”, http://arxiv.org/pdf/1506.02075v1.pdf, arXiv:1506.02075v1 [cs.LG], 5 June 2015, pp. 1-10),  hereinafter referred to as Bordes, 

In regards to claim 1, Komninos teaches A computer-implemented method for question answering using one or more processors to cause steps to be performed comprising: ([pp. 92-93, Section 6.1, Figure 6.1] Question Answering on Knowledge Bases (KBQA) is a specific QA setting requiring mapping questions expressed in natural language into queries to be executed against a Knowledge Base (KB). The questions are answered by the retrieved list of entities or in the case of complex questions by a function applied to the list, such as counting or sorting. In this work, the focus is on the simple KBQA setting using the SimpleQuestions dataset (Bordes et al., 2015), where given a knowledge base consisting of facts encoded as triples of the form (subject, relation, object), questions can be answered directly by a single fact., wherein a computer-based method performs question answering (Figure 6.1).) generating, using a predicate learning model, a predicted predicate representation in a knowledge graph (KG) predicate embedding space for a question comprising one or more tokens; ([p. 98, Section 6.4, pp. 102-103, Section 6.4.2, Figure 6.1] The model learns to decompose the question into an entity and relation representation to be compared with the corresponding parts of the query graph., Each possible query is represented by a subgraph of the form: (subject, relation, [object 1, object 2, ..., object n]). The subgraph is encoded into a subject representation and into a relation representation that also includes the information of the answer as the objects entities., The input is a sequence of vectors consisting of the word embedding, the n-gram vector of each question word and the probability computed by the attention network for this word: <equation 6.5> … The entity and relation mention representations are given by a weighted sum of elements of h followed by a dense layer with a tanh activation. … <equation 6.7> … The final non-linear transformation aims to project the entity and relation representation in a new common space with the corresponding KB entities and relations., wherein the relation/predicate representation in a KB sub-graph (knowledge graph) embedding space for a question is determined/predicted (equation 6.7) using a BiLSTM neural network framework (Figure 6.1).) generating, using a head entity learning model, a predicted head entity representation in an KG entity embedding space for the question; ([p. 98, Section 6.4, pp. 102-103, Section 6.4.2, Figure 6.1] The model learns to decompose the question into an entity and relation representation to be compared with the corresponding parts of the query graph., Each possible query is represented by a subgraph of the form: (subject, relation, [object 1, object 2, ..., object n]). The subgraph is encoded into a subject representation and into a relation representation that also includes the information of the answer as the objects entities., The input is a sequence of vectors consisting of the word embedding, the n-gram vector of each question word and the probability computed by the attention network for this word: <equation 6.5> … The entity and relation mention representations are given by a weighted sum of elements of h followed by a dense layer with a tanh activation. … <equation 6.6> … The final non-linear transformation aims to project the entity and relation representation in a new common space with the corresponding KB entities and relations., wherein the subject/head entity representation in a KB sub-graph (knowledge graph) embedding space for a question is determined/predicted (equation 6.6) using a BiLSTM neural network framework (Figure 6.1).) obtaining a predicted tail entity representation, based on a relation function that relates, for a fact in KG embedding space, a head entity representation and a predicate representation to a tail entity representation, from the predicted predicate representation and the predicted head entity representation, the predicted predicate representation, the predicted head entity representation, and the predicted tail entity representation forming a predicted fact; ([p. 97, Section 6.3, p. 102, Section 6.4.2, p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] Facts in Freebase are encoded as (subject, relation, object) triples, but since only the (subject, relation) part is relevant to form a query, a useful representation is to create a larger subgraph were all the objects are aggregated into (subject, relation, [object 1, object 2, ..., object n]) tuples., Each possible query is represented by a subgraph of the form: (subject, relation, [object 1, object 2, ..., object n]). The subgraph is encoded into a subject representation and into a relation representation that also includes the information of the answer as the objects entities., The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein a candidate/predicted fact is formed from the predicted subject/head entity, predicted relation/predicate, and a corresponding associated predicted object/tail entity (in which the relationship between the tail entity and the predicate/head entity pair is determined by the corresponding KG embeddings) such that this candidate/predicted fact is a candidate query for the KG that determines the answer (object) associated with the question.)   identifying, using a head entity detection (HED) model, one or more predicted head entity names for the question, each predicted head entity name comprises one or more tokens from the question; ([pp. 102-103, Section 6.4.2, Figure 6.1] The sequence of Bernoulli probabilities provided by the attention can also act as a gate to n-gram embeddings at the corresponding word positions and is used as a pooling operation to form a weighted average of n-gram embeddings of the entity mention: <equation 6.11> This character based representation of the entity is compared with the name descriptions of the entities, which are also encoded in the same way as averaged n-gram embeddings:…, wherein the entity mention (interpreted as the subject/head entity) in the question is identified as a weighted average over the n-gram embeddings such that this identified head entity is interpreted as corresponding to tokens from the question since it is a character based representation of the entity such that the HED model is indicated in equation 6.11.) …;  Customer No. 1192764328888-2279 (BN181205USN2)PATENT ([p. 103, Section 6.4.2, Figure 6.1] This character based representation of the entity is compared with the name descriptions of the entities, which are also encoded in the same way as averaged n-gram embeddings: … The similarity between the question entity mention and the name of a KB entity is then given by:… This is the maximum cosine similarity between the question entity mentions and any of the known aliases of the entity in the KB., wherein an evaluation of the similarity (string) between a character representation of the (head) question entity includes an evaluation over the aliases (interpreted as synonyms) associated with a kb embedding of name descriptions of entities in the KB/KG.) constructing a candidate fact set comprising one or more candidate facts, … and choosing, based on a joint distance metric, one candidate fact in the candidate fact set with a minimum joint distance to the predicted fact as an answer to the question. ([p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein an answer (interpreted as corresponding to the query with the highest probability from a set of candidate queries/facts which expresses the most likely subject – predicate – answer triplet) is identified by performing a joint similarity calculation (equation 6.14 and argument of w in equation 6.15) between the elements of a candidate/predicted fact including the predicted subject/head entity, the predicted relation/predicate, and (through the KG) a corresponding predicted object/tail entity.)
However, Komninos does not explicitly teach searching … each candidate fact comprises a head entity from among the head entity synonyms.  Although Komninos teaches the formation of a candidate fact set (i.e., the set queries with evaluations in the form of equation 6.15) and the evaluation of a similarity between character representation of the (head) question entity and corresponding entities in the KG along with entity aliases/synonyms, he does not explicitly disclose a search over the KG to find the synonyms or aliases followed by the formation/augmentation of the candidate fact set using those synonyms.  
However, Bordes, in the analogous environment of performing question answering using KG embeddings, teaches  identifying,  … one or more … head entity names for the question, each … head entity name comprises one or more tokens from the question; searching, in the KG, head entity synonyms related to the one or more … head entity names;  constructing a candidate fact set comprising one or more candidate facts, each candidate fact comprises a head entity from among the head entity synonyms; and choosing, based on a joint distance metric, one candidate fact in the candidate fact set with a minimum joint distance to the predicted fact as an answer to the question.Customer No. 1192764328888-2279 (BN181205USN2)PATENT ([p. 5, Section 3.3] Candidate generation: To generate candidate facts, we match n-grams of words of the question to aliases of Freebase entities and select a few matching entities. All facts having one of these entities as subject are scored in a second step. We first generate all possible n-grams from the question, removing those that contain an interrogative pronoun or 1-grams that belong to a list of stopwords. We only keep the n-grams which are an alias of an entity, and then discard all n-grams that are a subsequence of another n-gram, except if the longer n-gram only differs by in, of, for or the at the beginning…. Scoring is performed using an embedding model. Given two embedding matrices WV ∈ R d×NV and WS ∈ R d×NS , which respectively contain, in columns, the d-dimensional embeddings of the words/n-grams of the vocabulary and the embeddings of the Freebase entities and relationships, the similarity between question q and a Freebase candidate fact y is computed as: SQA(q, y) = cos(WV g(q),WSf(y)), with cos() the cosine similarity., wherein, given a (small) set of candidate fact/query, aliases are generated from the entities of the question (interpreted as including the head entity) such that these aliases are synonyms with the respective entity since they can be used semantically and syntactically interchangeably with the respective entity and such that these aliases are found through a search of the KG freebase entities and wherein the candidate facts are evaluated using the (embedded) KG (subgraph embedding) to find the best answer to the question such that this evaluation is performed jointly using a joint distance metric in the form of the cosine similarity between the question and fact embeddings with the minimum joint distance interpreted as corresponding to the maximum cosine similarity (in other words, like Komninos, Bordes teaches the selection of a candidate answer based on a joint distance metric over a candidate fact set).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos to incorporate the teachings of Bordes to search a KG for  head entity synonyms related to the one or more predicted head entity names and  construct a candidate fact set comprising one or more candidate facts with each candidate fact comprises a head entity from among the head entity synonyms with the resulting candidate fact set evaluated using a joint distance metric to compute a predicted fact as an answer to the question. The modification would have been obvious because one of ordinary skill would have been motivated to achieve excellent and scalable question answering performance by using KG subgraph embeddings to evaluate candidate answers to a question in which the candidate answers found in the KG are augmented through aliases/synonyms of question entities found in the KG (Bordes, [Abstract, p. 9, Section 7, Table 4]).

In regards to claim 6, the rejection of claim 1 is incorporated and Komninos further teaches  wherein the joint distance metric comprises distance terms representing distance between a vector in the candidate fact and a corresponding vector in the predicted fact, each term is a p norm to measure vector distance.  ([p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein an answer (interpreted as corresponding to the query with the highest probability from a set of candidate queries/facts which expresses the most likely subject – predicate – answer triplet) is identified by performing a joint similarity calculation (equation 6.14 and argument of w in equation 6.15) between the elements of a candidate/predicted fact including the predicted subject/head entity, the predicted relation/predicate, and (through the KG) a corresponding predicted object/tail entity such that this metric includes a distance metric (|x-y|) between particular head-entity question-fact vectors as wells as between particular relationship question-fact vectors with that distance metric interpreted to be a p-norm metric with p being commonly interpretable with this representation as being either 1 or 2.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos to incorporate the teachings of Bordes for the same reasons as pointed out for claims 1.


Claims 2 and 3 are  rejected under 35 U.S.C. 103 as being unpatentable over Komninos, in view of Bordes, and in further view of Miwa et al. (“End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures”, http://arxiv.org/pdf/1601.00770v3.pdf, arXiv:1601.00770 v3[cs.CL], 8 June 2016, pp. 1-13), hereinafter referred to as Miwa.

In regards to claim 2, the rejection of claim 1 is incorporated and Komninos further teaches  wherein the predicate learning model has a neural network structure comprising a bidirectional recurrent neural network layer and an attention layer, the generation of the predicted predicate representation comprising: mapping the one or more tokens in the question into a sequence of word embedding vectors; ([p. 101, Section 6.4, Figure 6.1] The question is encoded into two latent representations, one for the entity mention and one for the relation mention. The input is a sequence of vectors consisting of the word embedding, the n-gram vector of each question word and the probability computed by the attention network for this word: <equation 6.5>, wherein each word in the question is mapped into/represented as word embeddings (e_wi in equation 6.5) that is used to form a question representation input into the BiLSTM layers.) generating, using the bidirectional recurrent neural network layer, a forward hidden state sequence and a backward hidden state sequence; concatenating the forward and backward hidden state vectors into a concatenated hidden state vector; ([p. 101, Section 6.4, Figure 6.1]A bidirectional LSTM is used to transform the sequence of input vectors x to a sequence of contextualized vectors h: … wherein the BiLSTM processes the question word embedding representation to form a sequence of contextualized vectors consisting of the concatenation of forward and backward hidden states of the BiLSTM.) applying, by the attention layer, an attention weight to the concatenated hidden state vector to obtain a weighted hidden state vector; … ([pp. 101-102, Section 6.4, Figure 6.1 The entity and relation mention representations are given by a weighted sum of elements of h followed by a dense layer with a tanh activation. The weights are determined by the probabilities from the attention network. <equation 6.7>, wherein an attention layer (Figure 6.1) generates weights that are applied to the concatenated hidden state vector (as seen in the argument of the summation in equation 6.7).) applying a fully connected layer to the hidden state to obtain a target vector for each token; and using a mean of all target vectors as the predicted predicate representation.  ([pp. 101-102, Section 6.4, Figure 6.1] The entity and relation mention representations are given by a weighted sum of elements of h followed by a dense layer with a tanh activation. The weights are determined by the probabilities from the attention network. <equation 6.7> , wherein the predicted predicate/relation representation h_q^rel (equation 6.7) is computed by forming an average (a sum over i) over the weighted hidden state representation for each (ith) word/token and then multiplied by the weight W_q^rel (with dimension k x i) which is being interpreted as a fully connected layer (i.e., fully connected to the hidden state) to generate the predicate representation and wherein the target vector is being interpreted as being the application of the weight W_q^rel transformation to the hidden state vector (in other words, equation 6.7 inherently performs the same mathematical function as one in which the (constant) weight matrix W_q^rel (corresponding to the fully connected layer) were inserted into the summation so that a set of target vectors for each word would be generated before being averaged by the summation operation.)
However, Komninos and Bordes do not explicitly teach concatenating the weighted hidden state vector with the word embedding to obtain a hidden state for each token;.  Although Komninos discloses the weighting the backward/forward hidden state vector and the usage of that to generate the predicted predicate representation, he does not teach the concatenation of this hidden state vector with the word embeddings.  
However, Miwa, in the analogous environment of extracting relation and entity from word sequences using LSTM’s, teaches  generating, using the bidirectional recurrent neural network layer, a forward hidden state sequence and a backward hidden state sequence; concatenating the forward and backward hidden state vectors into a concatenated hidden state vector; … concatenating the … hidden state vector with the word embedding to obtain a hidden state for each token.   ([pp. 3-4, Section 3.2, Figure 1], The LSTM unit at t-th word receives the concatenation of word and POS embeddings as its input vector: h xt = v (w) t ; v (p) t i . We also concatenate the hidden state vectors of the two directions’ LSTM units corresponding to each word (denoted as −→ht and ←−ht) as its output vector, st = h−→ht ; ←−ht i , and pass it to the subsequent layers., We stack the dependency layers (corresponding to relation candidates) on top of the sequence layer to incorporate both word sequence and dependency tree structure information into the output. The dependency-layer LSTM unit at the t-th word receives as input xt = h st ; v (d) t ; v (e) t i , i.e., the concatenation of its corresponding hidden state vectors st in the sequence layer, dependency type embedding v (d) t (denotes the type of dependency to the parent3 ), and label embedding v (e) t (corresponds to the predicted entity label) wherein the embedding vector vt for a word in a sentence (question) is processed through a BiLSTM to form a concatenation of forward and backward hidden state vectors (st) which is then further concatenated with various word embedding elements (v_e, v_t) before undergoing additional processing in the dependency layers to perform relation/predicate classification/prediction.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos and Bordes to incorporate the teachings of Miwa to concatenate the weighted hidden state vector with the word embedding to obtain a hidden state for each token.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved relation and entity detection through a joint entity and dependency bi-LSTM model with shared parameters (Miwa, [Abstract, pp. 6-8, Section 4.3, p. 9, Section 5, Table 1]).

In regards to claim 3, the rejection of claim 2 is incorporated and Komninos further teaches  wherein the head entity learning model have a neural network structure the same as the predicate learning model.  ([pp. 101-102, Section 6.4, Figure 6.1]The entity and relation mention representations are given by a weighted sum of elements of h followed by a dense layer with a tanh activation. The weights are determined by the probabilities from the attention network. <equations 6.6, 6.7>, wherein as can be seen in equations 6.6 and 6.7 as well as in Figure 6.1, the neural network structure for both the predicate/relation representation model and the subject/head entity representation model are the same.
	It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos to incorporate the teachings of Bordes and Miwa for the same reasons as pointed out for claims 1 and 2 respectively.

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Komninos, in view of Bordes, in view of Miwa, and in further view of Dai et al. (“CFO: Conditional Focused Neural Question Answering with Large-scale Knowledge Bases”, http://arxiv.org/pdf/1606.01994v2.pdf, arXiv:1606.01994v2[cs.CL], 4 July 2016, pp. 1-11), hereinafter referred to as Dai.

In regards to claim 4, the rejection of claim 3 is incorporated and Komninos and Bordes do not further teach  wherein the predicate learning model and the head entity learning model are pre-trained using a training data set with ground truth facts via a predicate objective function and a head entity objective function respectively.  
Although Komninos teaches the pretraining of particular components of the framework ([pp. 103-104, Section 6.5]) he does not disclose the pretraining specifically for the predicate learning model and the head entity learning model components and although these particular components are understood to have been trained prior to their application, Komninos does not explicitly disclose the terms of the training or their associated loss functions. Bordes does not disclose  BiLSTM-based models.
However, Miwa, in the analogous environment of extracting relation and entity from word sequences using LSTM’s, teaches  wherein … the head entity learning model are pre-trained using a training data set with ground truth facts ….   ([p. 4, Section 3.3, p. 6, Section 4.2, Figure 1], We perform entity detection on top of the sequence layer. We employ a two-layered NN with an nhe -dimensional hidden layer h (e) and a softmax output layer for entity detection. <equation 2>, The dataset consists of 8,000 training and 2,717 test sentences, and each sentence is annotated with a relation between two given nominals. We randomly selected 800 sentences from the training set as our development set., wherein the subject/head entity prediction/learning/detection model is pretrained in the QA framework.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos and Bordes to incorporate the teachings of Miwa to pretrain the head entity learning model are pre-trained using a training data set with ground truth facts.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved relation and entity detection through a joint entity and dependency bi-LSTM model with shared parameters in which the entity detection module is pretrained (Miwa, [Abstract, pp. 6-8, Section 4.3, p. 9, Section 5, Table 1]).
However, Miwa does not explicitly teach the predicate learning model a… via a predicate objective function and a head entity objective function respectively. Miwa does not disclose  the training considerations associated with the predicate/dependency prediction module (equation 5) and does not disclose loss/objective function details for either model.
However, Dai, in the analogous environment of performing question answering using knowledge base embeddings, teaches wherein the predicate learning model and the head entity learning model are pre-trained using a training data set with ground truth facts via a predicate objective function and a head entity objective function respectively ([p. 1, Section 1, p. 4, Section 4.2, pp. 5-6, Section 5.1] To find the answer to a single-fact question, it suffices to identify the subject entity and relation (implicitly) mentioned by the question, and then forms a corresponding structured query. The problem can be formulated into a probabilistic form. Given a single-fact question q, finding the subjectrelation pair s, ˆ rˆ from the KB K which maximizes the conditional probability p(s, r|q), i.e. <equation 1> ., Relation network In this work, the probability of relations given a question, p(r|q), is modeled by the following network <equation 9>… As introduced in section 3, the factor p(s|q, r) models the fitness of a subject s appearing in the question q, given the main topic is about the relation… . For simplicity, we use two additive terms to model the joint effect <equation 11>  ., To estimate the parameters of pθr (r|q) and pθs (s|r, q), MLE can be utilized to maximize the empirical (log-)likelihood of subject-relation pairs .. <equations 15 and 16>. wherein (BiLSTM) model parameters for the predicate/relation prediction model (equation 9) and the subject/head entity prediction/learning model (equation 11) are obtained through training with a training set according to a log-likelihood loss objective function (equation 16) and wherein this training is interpreted as being pretraining in the sense of having occurred prior to the application of the network in an evaluation mode where it is noted that various components of those models (embeddings) are, in a more narrow sense, pretrained.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos, Bordes, and Miwa to incorporate the teachings of Dai to concatenate the weighted hidden state vector with the word embedding to obtain a hidden state for each token.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved relation and entity detection through a joint entity and dependency bi-LSTM model with shared parameters (Dai, [Abstract, pp. 6-8, Section 4.3, p. 9, Section 5, Table 1]).

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Komninos, in view of Bordes, and in further view of Zheng et al. (“Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme”, http://arxiv.org/pdf/1706.05075v1.pdf, arXiv:1706.05075v1[cs.CL], 7 June 2017, pp. 1-10), hereinafter referred to as Zheng.

In regards to claim 5, the rejection of claim 1 is incorporated and Komninos further teaches  wherein the HED model has a neural network structure comprising a bidirectional recurrent neural network layer and a fully connecter layer, the identification of the one or more predicted head entity names for the question comprising: mapping the one or more tokens in the question into a sequence of word embedding vectors; ([p. 101, Section 6.4, Figure 6.1] The question is encoded into two latent representations, one for the entity mention and one for the relation mention. The input is a sequence of vectors consisting of the word embedding, the n-gram vector of each question word and the probability computed by the attention network for this word: <equation 6.5>, wherein each word in the question is mapped into/represented as word embeddings (e_wi in equation 6.5) that is used to form a question representation input into the BiLSTM layers.) generating, at the bidirectional recurrent neural network layer, a forward hidden state sequence and a backward hidden state sequence; concatenating the forward and backward hidden state vectors to obtain a concatenated hidden state vector; ([p. 101, Section 6.4, Figure 6.1]A bidirectional LSTM is used to transform the sequence of input vectors x to a sequence of contextualized vectors h: …, wherein the BiLSTM processes the question word embedding representation to form a sequence of contextualized vectors consisting of the concatenation of forward and backward hidden states of the BiLSTM) applying the fully connected layer and a … function to the concatenated hidden state vector to obtain a target vector for each token, each target vector has two probability values corresponding to probabilities that the token belongs to entity token name and non-entity token name; and selecting one or more tokens as the head entity name …. ([pp. 102-103, Section 6.4.2, Figure 6.1] <equations 6.6, 6.7> … The intuition behind the above equations is that the output of the attention network indicates the probability p of a token being part of the entity mention, and with 1 − p the probability of being in the relation mention. These probabilities can be used to as a gate to get an entity and relation representation of the question. The token representations being averaged are the states of the biLSTM network, making them carry information about the order of the tokens in the sentence…. The sequence of Bernoulli probabilities provided by the attention can also act as a gate to n-gram embeddings at the corresponding word positions and is used as a pooling operation to form a weighted average of n-gram embeddings of the entity mention: <equation 6.11> This character based representation of the entity is compared with the name descriptions of the entities, which are also encoded in the same way as averaged n-gram embeddings: wherein the system determines 2 probabilities, with a first probability p^att_i corresponding to the probability of a word/token belonging to an entity (subject/head entity) and a second probability 1-p^att_i corresponding to the probability of the word/token corresponding to a relation (i.e., non-entity) with the association of the question word/token to the head entity (subject) determined by that probability according to equation 6.6 and where equation 6.11, it is noted also is interpreted as an entity token/word identification/selection.) 
However, Komninos and Bordes do not explicitly teach softmax … and selecting one or more tokens as the head entity name based on probability value of each token belonging to entity token name. Komninos does not teach the use of a softmax function for the classification of the entity (instead he uses the attention functionality) and does not disclose an explicit selection of a word token according to a target vector derived from that probability (instead that word is identified more directly from the attention probabilities without using the hidden states). Bordes does not teach entity/non-entity detection.
However, Zheng, in the analogous environment of performing joint relation and entity extraction from word sequences using BiLSTMs, teaches  applying the fully connected layer and a Softmax function to the concatenated hidden state vector to obtain a target vector for each token, each target vector has two probability values corresponding to probabilities that the token belongs to entity token name and non-entity token name; and selecting one or more tokens as the head entity name based on probability value of each token belonging to entity token name. ([p. 3, Section 3.1, pp. 4-5, Section 3.3], Each word is assigned a label that contributes to extract the results. Tag “O” represents the “Other” tag, which means that the corresponding word is independent of the extracted results. In addition to “O”, the other tags consist of three parts: the word position in the entity, the relation type, and the relation role…. For example, the word of “United” is the first word of entity “United States” and is related to the relation “Country-President”, so its tag is “B-CP-1”. The other entity “ Trump”, which is corresponding to “United States”, is labeled as “S-CP-2”. Besides, the other words irrelevant to the final result are labeled as “O”., For each word wt , the forward LSTM layer will encode wt by considering the contextual information from word w1 to wt , which is marked as −→ht . In the similar way, the backward LSTM layer will encode wt based on the contextual information from wn to wt , which is marked as ←− ht . Finally, we concatenate ←− ht and −→ht to represent word t’s encoding information, denoted as ht = [−→ht , ←− ht ]…. When detecting the tag of word wt, the inputs of the decoding layer are: ht obtained from Bi-LSTM encoding layer, former predicted tag embedding T_t-1, former cell value c_t-1, and the former hidden vector in decoding layer h_t-1… The final softmax layer computes normalized entity tag probabilities based on the tag predicted vector Tt : <equations 14 and 15>, wherein words in a sentence (question) are processed through a BiLSTM to form a contextual encoding of the sentence (question) in the form of concatenated forward and backward hidden state vectors (for each word/token) such that each concatenated hidden state vector is transformed to a tag predicted (target vector) y_t (equation 14) and processed through the softmax layer/function to determine a corresponding tag probability for that word such that the probability corresponds to whether the word is a head entity or not the head entity with the word then subsequently selected to be the head entity (in a triplet fact representation) based on that probability and wherein, it is noted that the Bi-LSTM model for detecting and labeling/tagging the entities in the sentence forms a HED model.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos and Bordes to incorporate the teachings of Zheng to apply the fully connected layer and a Softmax function to the concatenated hidden state vector to obtain a target vector for each token, with each target vector having two probability values corresponding to probabilities that the token belongs to entity token name and non-entity token name and to select one or more tokens as the head entity name based on probability value of each token belonging to entity token name.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved entity extraction and tagging through a joint entity and relation extraction in and end-to-end model using a bi-LSTM framework with a softmax layer to perform the entity tagging (Zheng, [Abstract, p. 5, Section 3.3, p. 6, Section 4.2, Table 1]).

Claims 7 and 8 are  rejected under 35 U.S.C. 103 as being unpatentable over Komninos, in view of Bordes, and in further view of Lukovnikov et al. (“Neural Network-based Question Answering over Knowledge Graphs on Word and Character Level”, IW3C2 WWW2017, April 2017, pp. 1211-1220), hereinafter referred to as Lukovnikov.

In regards to claim 7, the rejection of claim 6 is incorporated and Komninos further teaches  wherein the joint distance metric further comprises string similarity terms representing string similarity between name of entity in the candidate fact and the tokens classified as entity name by the HED model, …  ([p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein an answer (interpreted as corresponding to the query with the highest probability from a set of candidate queries/facts which expresses the most likely subject – predicate – answer triplet) is identified by performing a joint similarity calculation (equation 6.14 and argument of w in equation 6.15) between the elements of a candidate/predicted fact including the predicted subject/head entity, the predicted relation/predicate, and (through the KG) a corresponding predicted object/tail entity such that this metric includes a string similarity metric (sim_name) between particular head-entity (HED model) question mention and corresponding fact entity and wherein Komninos also determines/classifies a correspondence between a given token in the sentence as a non-entity/predicate according to a probability.)
However, neither Komninos nor Bordes teaches and string similarity between name of the predicate in the candidate fact and the tokens classified as non entity name by the HED model. Komninos does not include the character string representation of the question term relationship/predicate mention in joint distance metric.
However, Lukovnikov, in the analogous environment of performing end-to-end question answering with bi-LSTM relation, teaches  and string similarity between name of the predicate in the candidate fact and the tokens classified as non entity name by the HED model ([p. 1213, Section 2.1.1, p. 1215, Section 2.1.4, p. 1215, Section 2.2, Figure 3]The mapping of a question q = {w1, . . . , wT } to its subject and predicate related vector representations r s q and r r q, respectively, is done using a single-layered unidirectional GRU based encoder network. We call this part of the model the question encoder ENCQ <equation 7> The question encoder ENCQ first uses the word representation function REPW(wt) to generate vector representations for all words wt, t = 1, . . . , T (as described in the next paragraph), which are subsequently fed to the RNN until all words have been seen., Given the question encoding vector rq = (r s q, r p q ), the latent vector representation rp of the relation, and the latent representation rs of the subject entity, we compute two matching scores: one between the question and subject entity and one between the question and predicate, as follows: <equations 14a, 14b>, Using these scoring functions, we can solve the task of finding the right subjectpredicate pair (sg, pg) (i.e. , retrieving triples (sg, pg, oi) ∈ G such that the set of objects in these triples constitutes the answer to question q) by picking the best scoring subject entity and predicate given a question according to Equations (1) and (2), respectively., wherein a GRU/RNN-based question encoder determines/classifies a representation of the predicate tokens in the input sentence in an embedding space (figure 3) into a subject/entity and a relation/non-entity and wherein this representation is used in a similarity computation (equations 14) that is then used to score candidate answers (i.e., the string similarity of both the head entity prediction-fact head entity and the predicted predicate-fact predicate are used to find the best answer from the KG).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos and Bordes to incorporate the teachings of Lukovnikov to for the joint distance metric to further comprises string similarity terms representing string similarity between name of entity in the candidate fact and the tokens classified as entity name by the HED model, and string similarity between name of the predicate in the candidate fact and the tokens classified as non entity name by the HED model.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved question answering performance in an end-to-end ML configuration by using in which improved entity extraction and tagging through a joint entity and relation extraction in and end-to-end model using both character-level and word-level information for entity and predicate predication (Lukovnikov, [Abstract, pp. 1212-1213, Section 1, p. 1219, Section 6, Table 4]).

In regards to claim 8, the rejection of claim 7 is incorporated and Komninos further teaches  wherein the joint distance metric is a weighted combination of the distance terms and the string similarity terms. ([p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein an answer (interpreted as corresponding to the query with the highest probability from a set of candidate queries/facts which expresses the most likely subject – predicate – answer triplet) is identified by performing a joint similarity calculation (equation 6.14 and argument of w in equation 6.15) between the elements of a candidate/predicted fact including the predicted subject/head entity, the predicted relation/predicate, and (through the KG) a corresponding predicted object/tail entity such that this metric includes a combination of a distance metric (|x-y|) and a string similarity metric (sim_name) such that the factor w applied to the concatenation combination of those metrics is being interpreted as a weight or a weighting function.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos to incorporate the teachings of Bordes and Lukovnkov for the same reasons as pointed out for claims 7.

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Komninos, in view of Bordes, and in further view of Zhang et al. (“Knowledge Graph Embedding for Hyper-Relational Data”, Tsinghua Science and Technology, Vol., 22, No. 2, April, 2017, pp. 185-197), hereinafter referred to as Zhang.

In regards to claim 9, the rejection of claim 6 is incorporated and Komninos and Bordes do not further teach wherein in the joint distance metric, the candidate fact has a tail entity embedding vector …, using the relation function, from a head entity embedding vector and a predicate embedding vector of the candidate fact.  Although Komninos teaches a (functional) relationship in a candidate fact in the KG embedding space between the object (tail entity) and the corresponding predicate/relationship and subject/head entity embedding vectors of that fact through the structure of the KG (i.e., given a predicate/relationship and subject/head entity in the KG, there corresponds one or more object/tail entities), he does not explicitly disclose a calculation using a relation function in relationship to the joint distance metric. Although Bordes also uses KG/subgraph embeddings that characterize relationships between the tail entity and the head-entity/relationship pair, he also does not disclose the calculation as recited.
However, Zhang, in the analogous environment of performing question answering using KG embeddings, teaches  wherein in the joint distance metric, the candidate fact has a tail entity embedding vector calculated, using the relation function, from a head entity embedding vector and a predicate embedding vector of the candidate fact ([pp. 194-195, Section 5, Figure 3], Figure 3 illustrates how TransHR works in QA. First, TransHR must be trained by large-scale triples and then generates an entity-to-vector (each entity in training triples and its corresponding vector) and a relation-to-vector (each relation in training triples and its corresponding vector). For the question “where was Obama born?”, we detect the entity Obama and the relation born in the question[40] and find their corresponding vector representations from the training result of the TransHR, for which the entityto-vector and relation-to-vector are already obtained, respectively….The entity vector and the relation vector are taken as the inputs of TransHR and the problem of answering a question becomes a problem of link prediction. TransHR must then predict the missing t for the given (h; r). To do this, TransHR regards each entity ti in the training triples as the missing t and calculates the score of each (h C rMr ti) sequentially, then ranks the scores in ascending order to identify the top ten closest candidate answers (USA, Canada, UK, ... ), which are the final results generated by TransHR for the question., wherein a relation function h+rM_r is applied to a the head entity h of a candidate fact and the relation/predicate r of the candidate fact (each of which has been detected from the question) such that the corresponding tail entity is being interpreted as being estimated by that relation function so that a distance metric (h+rMr_ti) formed from searching the KG embedding space over candidate tail entities in the KG is computed in order rank candidate answers (tail entities) for the question (i.e., to find ti which minimizes that distance) and wherein it is noted that (h+rMr –t) is also a joint distance metric since it incorporates both relation/predicate and head entity/subject predicted vectors.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos and Bordes to incorporate the teachings of Zhang to calculate the tail entity embedding vector in a candidate fact using a relation function from a head entity embedding vector and a predicate embedding vector of the candidate fact for the joint distance metric.  The modification would have been obvious because one of ordinary skill would have been motivated to perform effective question answering by exploiting the compact and generalizable representations of knowledge graph embedding and by using TransHR to effectively search that KG when there are multiple relations between entities (Zhang, [Abstract, p. 185, Section 1, p. 195, Section 6]).

Claims 10 and 17 are  rejected under 35 U.S.C. 103 as being unpatentable over Komninos, in view of Bordes, and in further view of Dai.

In regards to claim 10, the rejection of claim 1 is incorporated and Komninos does not further teach  wherein searching head entity synonyms in the KG related to the one or more predicted head entity names comprising: inputting entity vector for each head entity name into the KG; and searching, in the KG, head entity synonyms with corresponding token embedding, by both embedding comparison and string match, each head entity synonym has direct or partial string match to the head entity name, or has embedding similarity to the entity vector.  Although Komninos teaches the formation of a candidate fact set (i.e., the set queries with evaluations in the form of equation 6.15) and the evaluation of a similarity between character representation of the (head) question entity and corresponding entities in the KG along with entity aliases/synonyms, he does not explicitly disclose a search over the KG to find the synonyms or aliases followed by the formation/augmentation of the candidate fact set using those synonyms.  
However, Bordes, in the analogous environment of performing question answering using KG embeddings, teaches  wherein searching head entity synonyms in the KG related to the one or more predicted head entity names comprising: inputting entity vector for each head entity name into the KG; and searching, in the KG, head entity synonyms …, … and string match, each head entity synonym has direct or partial string match to the head entity name, ….  Customer No. 1192764328888-2279 (BN181205USN2)PATENT ([p. 5, Section 3.3] Candidate generation: To generate candidate facts, we match n-grams of words of the question to aliases of Freebase entities and select a few matching entities. All facts having one of these entities as subject are scored in a second step. We first generate all possible n-grams from the question, removing those that contain an interrogative pronoun or 1-grams that belong to a list of stopwords. We only keep the n-grams which are an alias of an entity, and then discard all n-grams that are a subsequence of another n-gram, except if the longer n-gram only differs by in, of, for or the at the beginning…. Scoring is performed using an embedding model. Given two embedding matrices WV ∈ R d×NV and WS ∈ R d×NS , which respectively contain, in columns, the d-dimensional embeddings of the words/n-grams of the vocabulary and the embeddings of the Freebase entities and relationships, the similarity between question q and a Freebase candidate fact y is computed as: SQA(q, y) = cos(WV g(q),WSf(y)), with cos() the cosine similarity., wherein synonyms/aliases for (head) entities are found by searching the KG (Freebase) for aliases that are based on matching (partial or full/direct) string representation of the words (i.e., n-grams, with the exclusion of possible aliases/synonyms based on particular n-gram attributes).)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos to incorporate the teachings of Bordes to search a KG for  head entity synonyms related to the one or more predicted head entity names by searching head entity synonyms in the KG related to the one or more predicted head entity names by  inputting the entity vector for each head entity name into the KG and searching, in the KG, head entity synonyms by string match with each head entity synonym having direct or partial string match to the head entity name. The modification would have been obvious because one of ordinary skill would have been motivated to achieve excellent and scalable question answering performance by using KG subgraph embeddings to evaluate candidate answers to a question in which the candidate answers found in the KG are augmented through aliases/synonyms of question entities found in the KG in the form of similar string representations of fact entities, including the subject entity (Bordes, [Abstract, p. 9, Section 7, Table 4]).
However, Bordes does not explicitly teach … with corresponding token embedding … both embedding comparison … or has embedding similarity to the entity vector. In other words, Bordes does not determine synonyms through a search through the embedded KG space although the search over question-answer similarity does include variations of the question representation (i.e. both a KG embedded space similarity as well as a string match).
However, Dai, in the analogous environment of performing question answering using knowledge base embeddings, teaches wherein searching head entity synonyms in the KG related to the one or more predicted head entity names comprising: inputting entity vector for each head entity name into the KG; and searching, in the KG, head entity synonyms with corresponding token embedding, by both embedding comparison and string match, each head entity synonym has direct or partial string match to the head entity name, or has embedding similarity to the entity vector ([p. 3, Section 3.2, p. 4, Section 4.2, p. 5, Section 4.3] The fundamental intuition for pruning is that the subject entity must be mentioned by some textual substring (subject mention) in the question. Thus, the candidate space can be restricted to entities whose name/alias matches an n-gram of the question, as in (Yih et al., 2014; Yih et al., 2015; Bordes et al., 2015). We refer to this straight-forward method as N-Gram pruning., For simplicity, we use two additive terms to model the joint effect <equation 11>  where u(s, r, q) is the subject scoring function, u(s, r, q) = g(q) >E(s) + αh(r, s) (12) g(q) is another semantic question embedding, E(s) is a vector representation of a subject, h(r, s) is the subject-relation score, and α is the weight parameter used to trade off the two sources…. 3), which trains the embedings of entities and relations by enforcing E(s) + E(r) = E(o) for every observed triple (s, r, o)., Intuitively, this pruning method resembles the human behavior of first identifying the subject mention with the help of context, and then using it as the key word to search the KB…. Finally, the match function M(s, wˆ) is simply defined as either strict match between an alias of s and wˆ, or approximate match provided by the Freebase entity suggest API 1 ., wherein an initial set of fact candidates are found by using aliases (synonyms – string matching criteria) associated with a token in a question (including a subject entity) such that the subject s is either a string match with the mention word/token in the question or else is a match obtained in an entity embedding space (interpreted to correspond to the KG embedding space – Figure 1)  through the parameterization of pk(w|k) from which the set of predicted subjects (including subjects similar in that embedding space – i.e., synonyms in a general sense) is thereby identified and wherein the representation E(s) in equation 12 also represents similar subjects/synonyms by virtue of that term being an embedded space representation of the subject which is associated with the candidate fact by virtue of TransE functionality.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos and Bordes to incorporate the teachings of Dai to search, in the KG, head entity synonyms with corresponding token embedding, by both embedding comparison and string match, with each head entity synonym has direct or partial string match to the head entity name, or has embedding similarity to the entity vector. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved relation and entity detection through a joint entity and dependency bi-LSTM model with shared parameters in which the search for candidate facts/answers in the KG embedding space is focused according to synonyms/similar words of the subject as determined both by string and embedding representations  (Dai, [Abstract, p. 5, Section 4.3, pp. 6-8, Section 4.3, p. 9, Section 5, Table 1]).

In regards to claim 17, Komninos teaches A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more processors, causes the steps for question answering to be performed comprising: ([pp. 92-93, Section 6.1, Figure 6.1] Question Answering on Knowledge Bases (KBQA) is a specific QA setting requiring mapping questions expressed in natural language into queries to be executed against a Knowledge Base (KB). The questions are answered by the retrieved list of entities or in the case of complex questions by a function applied to the list, such as counting or sorting. In this work, the focus is on the simple KBQA setting using the SimpleQuestions dataset (Bordes et al., 2015), where given a knowledge base consisting of facts encoded as triples of the form (subject, relation, object), questions can be answered directly by a single fact., wherein a computer-based method performs question answering (Figure 6.1).) generating a vector in a knowledge graph (KG) predicate embedding space as a predicted predicate representation for a question comprising one or more tokens; ([p. 98, Section 6.4, pp. 102-103, Section 6.4.2, Figure 6.1] The model learns to decompose the question into an entity and relation representation to be compared with the corresponding parts of the query graph., Each possible query is represented by a subgraph of the form: (subject, relation, [object 1, object 2, ..., object n]). The subgraph is encoded into a subject representation and into a relation representation that also includes the information of the answer as the objects entities., The input is a sequence of vectors consisting of the word embedding, the n-gram vector of each question word and the probability computed by the attention network for this word: <equation 6.5> … The entity and relation mention representations are given by a weighted sum of elements of h followed by a dense layer with a tanh activation. … <equation 6.7> … The final non-linear transformation aims to project the entity and relation representation in a new common space with the corresponding KB entities and relations., wherein the relation/predicate representation/vector in a KB sub-graph (knowledge graph) embedding space for a question is determined/predicted (equation 6.7) using a BiLSTM neural network framework (Figure 6.1).) generating a vector in a KG entity embedding space as a predicted head entity representation for the question; ([p. 98, Section 6.4, pp. 102-103, Section 6.4.2, Figure 6.1] The model learns to decompose the question into an entity and relation representation to be compared with the corresponding parts of the query graph., Each possible query is represented by a subgraph of the form: (subject, relation, [object 1, object 2, ..., object n]). The subgraph is encoded into a subject representation and into a relation representation that also includes the information of the answer as the objects entities., The input is a sequence of vectors consisting of the word embedding, the n-gram vector of each question word and the probability computed by the attention network for this word: <equation 6.5> … The entity and relation mention representations are given by a weighted sum of elements of h followed by a dense layer with a tanh activation. … <equation 6.6> … The final non-linear transformation aims to project the entity and relation representation in a new common space with the corresponding KB entities and relations., wherein the subject/head entity representation/vector in a KB sub-graph (knowledge graph) embedding space for a question is determined/predicted (equation 6.6) using a BiLSTM neural network framework (Figure 6.1).) obtaining a predicted tail entity representation, based on a relation function based upon knowledge graph (KG) embedding, from the predicted predicate representation and the predicted head entity presentation, the predicted Customer No. 1192764928888-2279 (BN181205USN2)PATENT predicate representation, and the predicted tail entity representation forming a predicted fact; ([p. 97, Section 6.3, p. 102, Section 6.4.2, p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] Facts in Freebase are encoded as (subject, relation, object) triples, but since only the (subject, relation) part is relevant to form a query, a useful representation is to create a larger subgraph were all the objects are aggregated into (subject, relation, [object 1, object 2, ..., object n]) tuples., Each possible query is represented by a subgraph of the form: (subject, relation, [object 1, object 2, ..., object n]). The subgraph is encoded into a subject representation and into a relation representation that also includes the information of the answer as the objects entities., The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein a candidate/predicted fact is formed from the predicted subject/head entity, predicted relation/predicate, and a corresponding associated predicted object/tail entity (in which the relationship between the tail entity and the predicate/head entity pair is determined by the corresponding KG embeddings) such that this candidate/predicted fact is a candidate query for the KG that determines the answer (object) associated with the question.) identifying one or more predicted head entity names for the question, each predicted head entity name comprises one or more tokens from the question ([pp. 102-103, Section 6.4.2, Figure 6.1] The sequence of Bernoulli probabilities provided by the attention can also act as a gate to n-gram embeddings at the corresponding word positions and is used as a pooling operation to form a weighted average of n-gram embeddings of the entity mention: <equation 6.11> This character based representation of the entity is compared with the name descriptions of the entities, which are also encoded in the same way as averaged n-gram embeddings:…, wherein the entity mention (interpreted as the subject/head entity) in the question is identified as a weighted average over the n-gram embeddings such that this identified head entity is interpreted as corresponding to tokens from the question since it is a character based representation of the entity such that the HED model is indicated in equation 6.11.) …;  Customer No. 1192764328888-2279 (BN181205USN2)PATENT ([p. 103, Section 6.4.2, Figure 6.1] This character based representation of the entity is compared with the name descriptions of the entities, which are also encoded in the same way as averaged n-gram embeddings: … The similarity between the question entity mention and the name of a KB entity is then given by:… This is the maximum cosine similarity between the question entity mentions and any of the known aliases of the entity in the KB., wherein an evaluation of the similarity (string) between a character representation of the (head) question entity includes an evaluation over the aliases (interpreted as synonyms) associated with a kb embedding of name descriptions of entities in the KB/KG.)  identifying one or more predicted head entity names for the question, each predicted head entity name comprises one or more tokens from the question… constructing a candidate fact set comprising one or more candidate facts, each candidate fact comprises a head entity among the head entity synonyms; and choosing one candidate fact in the candidate fact set with a minimum joint distance to the predicted fact based on a joint distance metric as an answer to the question.  ([p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein an answer (interpreted as corresponding to the query with the highest probability from a set of candidate queries/facts which expresses the most likely subject – predicate – answer triplet) is identified by performing a joint similarity calculation (equation 6.14 and argument of w in equation 6.15) between the elements of a candidate/predicted fact including the predicted subject/head entity, the predicted relation/predicate, and (through the KG) a corresponding predicted object/tail entity.)
However, Komninos does not explicitly teach searching, in the KG, head entity synonyms to the one or more predicted head entity names by both embedding comparison and string match.  Although Komninos teaches the formation of a candidate fact set (i.e., the set queries with evaluations in the form of equation 6.15) and the evaluation of a similarity between character representation of the (head) question entity and corresponding entities in the KG along with entity aliases/synonyms, he does not explicitly disclose a search over the KG to find the synonyms or aliases followed by the formation/augmentation of the candidate fact set using those synonyms.  
However, Bordes, in the analogous environment of performing question answering using KG embeddings, teaches  identifying,  … one or more … head entity names for the question, each … head entity name comprises one or more tokens from the question; searching, in the KG, head entity synonyms to the one or more predicted head entity names by … and string match; constructing a candidate fact set comprising one or more candidate facts, each candidate fact comprises a head entity among the head entity synonyms; and choosing one candidate fact in the candidate fact set with a minimum joint distance.Customer No. 1192764328888-2279 (BN181205USN2)PATENT ([p. 5, Section 3.3] Candidate generation: To generate candidate facts, we match n-grams of words of the question to aliases of Freebase entities and select a few matching entities. All facts having one of these entities as subject are scored in a second step. We first generate all possible n-grams from the question, removing those that contain an interrogative pronoun or 1-grams that belong to a list of stopwords. We only keep the n-grams which are an alias of an entity, and then discard all n-grams that are a subsequence of another n-gram, except if the longer n-gram only differs by in, of, for or the at the beginning…. Scoring is performed using an embedding model. Given two embedding matrices WV ∈ R d×NV and WS ∈ R d×NS , which respectively contain, in columns, the d-dimensional embeddings of the words/n-grams of the vocabulary and the embeddings of the Freebase entities and relationships, the similarity between question q and a Freebase candidate fact y is computed as: SQA(q, y) = cos(WV g(q),WSf(y)), with cos() the cosine similarity., wherein, given a (small) set of candidate fact/query, aliases are generated from the entities of the question (interpreted as including the head entity) such that these aliases are synonyms with the respective entity since they can be used semantically and syntactically interchangeably with the respective entity and such that these aliases are found through a search of the KG freebase entities according to character similarity/match and wherein the candidate facts are evaluated using the (embedded) KG (subgraph embedding) to find the best answer to the question such that this evaluation is performed jointly using a joint distance metric in the form of the cosine similarity between the question and fact embeddings with the minimum joint distance interpreted as corresponding to the maximum cosine similarity (in other words, like Komninos, Bordes teaches the selection of a candidate answer based on a joint distance metric over a candidate fact set).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos to incorporate the teachings of Bordes to search a KG for  head entity synonyms according to string matching related to the one or more predicted head entity names and  construct a candidate fact set comprising one or more candidate facts with each candidate fact comprises a head entity from among the head entity synonyms with the resulting candidate fact set evaluated using a joint distance metric to compute a predicted fact as an answer to the question. The modification would have been obvious because one of ordinary skill would have been motivated to achieve excellent and scalable question answering performance by using KG subgraph embeddings to evaluate candidate answers to a question in which the candidate answers found in the KG are augmented through aliases/synonyms of question entities found in the KG (Bordes, [Abstract, p. 9, Section 7, Table 4]).
However, Bordes does not explicitly teach … by both embedding comparison …. In other words, Bordes does not determine synonyms through a search through the embedded KG space although the search over question-answer similarity does include variations of the question representation (i.e. both a KG embedded space similarity as well as a string match).
However, Dai, in the analogous environment of performing question answering using knowledge base embeddings, teaches searching, in the KG, head entity synonyms to the one or more predicted head entity names by both embedding comparison and string match.  ([p. 3, Section 3.2, p. 4, Section 4.2, p. 5, Section 4.3] The fundamental intuition for pruning is that the subject entity must be mentioned by some textual substring (subject mention) in the question. Thus, the candidate space can be restricted to entities whose name/alias matches an n-gram of the question, as in (Yih et al., 2014; Yih et al., 2015; Bordes et al., 2015). We refer to this straight-forward method as N-Gram pruning., For simplicity, we use two additive terms to model the joint effect <equation 11>  where u(s, r, q) is the subject scoring function, u(s, r, q) = g(q) >E(s) + αh(r, s) (12) g(q) is another semantic question embedding, E(s) is a vector representation of a subject, h(r, s) is the subject-relation score, and α is the weight parameter used to trade off the two sources…. 3), which trains the embedings of entities and relations by enforcing E(s) + E(r) = E(o) for every observed triple (s, r, o)., Intuitively, this pruning method resembles the human behavior of first identifying the subject mention with the help of context, and then using it as the key word to search the KB…. Finally, the match function M(s, wˆ) is simply defined as either strict match between an alias of s and wˆ, or approximate match provided by the Freebase entity suggest API 1 ., wherein an initial set of fact candidates are found by using aliases (synonyms – string matching criteria) associated with a token in a question (including a subject entity) such that the subject s is either a string match with the mention word/token in the question or else is a match obtained in an entity embedding space (interpreted to correspond to the KG embedding space – Figure 1)  through the parameterization of pk(w|k) from which the set of predicted subjects (including subjects similar in that embedding space – i.e., synonyms in a general sense) is thereby identified and wherein the representation E(s) in equation 12 also represents similar subjects/synonyms by virtue of that term being an embedded space representation of the subject which is associated with the candidate fact by virtue of TransE functionality.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos and Bordes to incorporate the teachings of Dai to search, in the KG, head entity synonyms to the predicted head entity names by both embedding comparison and string match. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved relation and entity detection through a joint entity and dependency bi-LSTM model with shared parameters in which the search for candidate facts/answers in the KG embedding space is focused according to synonyms/similar words of the subject as determined both by string and embedding represenations  (Dai, [Abstract, p. 5, Section 4.3, pp. 6-8, Section 4.3, p. 9, Section 5, Table 1]).

Claim 11 is  rejected under 35 U.S.C. 103 as being unpatentable over Komninos, in view of Bordes, in view of Dai, and in further view of Zhang et al. (“Knowledge Graph Embedding for Hyper-Relational Data”, Tsinghua Science and Technology, Vol., 22, No. 2, April, 2017, pp. 185-197), hereinafter referred to as Wang et al. (“Knowledge Graph Embedding by Translating on Hyperplanes”, Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014, pp. 1112-1119), hereinafter referred to as Wang.

In regards to claim 11, the rejection of claim 10 is incorporated and Komninos, Bordes, and Dai do not further teach wherein for head entity name comprising multiple tokens, the entity vector is combined from a dot product of entity vectors of each token.  Neither Komninos nor Bordes discuss a KG representation of a multiple head entities associated with a sentence.
However, Wang, in the analogous environment of performing link prediction using KG embeddings, teaches  wherein for head entity name comprising multiple tokens, the entity vector is combined from a dot product of entity vectors of each token.   ([p 1113, Embedding by Translating on Hyperplanes, p. 1114, Section Translating on Hyperplanes, p. 1115, Link Prediction] If ∀i ∈ {0, . . . , m},(hi , r, t) ∈ ∆, i.e., r is a many-to-one map, then h0 = . . . = hm. Similarly, if ∀i,(h, r, ti) ∈ ∆, i.e., r is a one-to-many map, then t0 = . . . = tm., To overcome the problems of TransE in modeling reflexive/one-to-many/many-to-one/many-to-many relations, we propose a model which enables an entity to have distributed representations when involved in different relations. As illustrated in Figure 1, for a relation r, we position the relation-specific translation vector dr in the relation-specific hyperplane wr (the normal vector) rather than in the same space of entity embeddings. Specifically, for a triplet (h, r, t), the embedding h and t are first projected to the hyperplane wr. The projections are denoted as h⊥ and t⊥, respectively…. Then the score function is fr(h, t) = k(h − w> r hwr) + dr − (t − w> r twr)k 2., This task is to complete a triplet (h, r, t) with h or t missing, i.e., predict t given (h, r) or predict h given (r, t). Rather than requiring one best answer, this task emphasizes more on ranking a set of candidate entities from the knowledge graph., wherein, in the event that multiple head entities (in a sentence/question) are mapped to an entity (many-to-one map with multiple hi’s), the representation of the head entity vector in the KG space is determined through projections into a relation-specific hyperplanes (interpreted as a dot product of entity vectors) such that the distribution of the head entity vector over the set of subject tokens is represented by the vectors over that hyperplane and wherein it is noted that this projection/dot product is used to compute a score to determine/predict the tail entity/answer.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos,  Bordes, and Dai to incorporate the teachings of Wang to the entity vector is combined from a dot product of entity vectors of each token when the  head entity name comprises multiple tokens. The modification would have been obvious because one of ordinary skill would have been motivated to improve link prediction using KG embeddings in situations in which the triplet components involve one-to-many, many-to-one, or many-to-many relations (Wang, [Abstract, p. 1118, Conclusion, Table 3, Figure 2]).

Claim 12 is  rejected under 35 U.S.C. 103 as being unpatentable over Komninos, in view of Dai.

In regards to claim 12, Komninos teaches  A computer-implemented method for question answering using one or more processors that cause steps to be performed comprising: ([pp. 92-93, Section 6.1, Figure 6.1] Question Answering on Knowledge Bases (KBQA) is a specific QA setting requiring mapping questions expressed in natural language into queries to be executed against a Knowledge Base (KB). The questions are answered by the retrieved list of entities or in the case of complex questions by a function applied to the list, such as counting or sorting. In this work, the focus is on the simple KBQA setting using the SimpleQuestions dataset (Bordes et al., 2015), where given a knowledge base consisting of facts encoded as triples of the form (subject, relation, object), questions can be answered directly by a single fact., wherein a computer processor-based method performs question answering (Figure 6.1).)  generating, using a predicate learning model stored in one or more memories of one or more computing devices, a predicted predicate representation for a question comprising one or more tokens in a predicate embedding space, … ([p. 98, Section 6.4, pp. 102-103, Section 6.4.2, Figure 6.1] The model learns to decompose the question into an entity and relation representation to be compared with the corresponding parts of the query graph., Each possible query is represented by a subgraph of the form: (subject, relation, [object 1, object 2, ..., object n]). The subgraph is encoded into a subject representation and into a relation representation that also includes the information of the answer as the objects entities., The input is a sequence of vectors consisting of the word embedding, the n-gram vector of each question word and the probability computed by the attention network for this word: <equation 6.5> … The entity and relation mention representations are given by a weighted sum of elements of h followed by a dense layer with a tanh activation. … <equation 6.7> … The final non-linear transformation aims to project the entity and relation representation in a new common space with the corresponding KB entities and relations., wherein the relation/predicate representation in a KB sub-graph (knowledge graph) embedding space for a question is determined/predicted (equation 6.7) using a BiLSTM neural network framework (Figure 6.1).) generating, using a head entity learning model stored in one or more memories of one or more computing devices, a predicted head entity representation for the question in an entity embedding space, … ([p. 98, Section 6.4, pp. 102-103, Section 6.4.2, Figure 6.1] The model learns to decompose the question into an entity and relation representation to be compared with the corresponding parts of the query graph., Each possible query is represented by a subgraph of the form: (subject, relation, [object 1, object 2, ..., object n]). The subgraph is encoded into a subject representation and into a relation representation that also includes the information of the answer as the objects entities., The input is a sequence of vectors consisting of the word embedding, the n-gram vector of each question word and the probability computed by the attention network for this word: <equation 6.5> … The entity and relation mention representations are given by a weighted sum of elements of h followed by a dense layer with a tanh activation. … <equation 6.6> … The final non-linear transformation aims to project the entity and relation representation in a new common space with the corresponding KB entities and relations., wherein the subject/head entity representation in a KB sub-graph (knowledge graph) embedding space for a question is determined/predicted (equation 6.6) using a BiLSTM neural network framework (Figure 6.1).)  identifying, using a relation function based upon knowledge graph (KG) embedding, a predicted tail entity presentation from the predicted predicate representation and the predicted head entity presentation, the predicted head entity representation, the predicted predicate representation, and the predicted tail entity representation forming a predicted fact; ([p. 97, Section 6.3, p. 102, Section 6.4.2, p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] Facts in Freebase are encoded as (subject, relation, object) triples, but since only the (subject, relation) part is relevant to form a query, a useful representation is to create a larger subgraph were all the objects are aggregated into (subject, relation, [object 1, object 2, ..., object n]) tuples., Each possible query is represented by a subgraph of the form: (subject, relation, [object 1, object 2, ..., object n]). The subgraph is encoded into a subject representation and into a relation representation that also includes the information of the answer as the objects entities., The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein a candidate/predicted fact is formed from the predicted subject/head entity, predicted relation/predicate, and a corresponding associated predicted object/tail entity (in which the relationship between the tail entity and the predicate/head entity pair is determined by the corresponding KG embeddings) such that this candidate/predicted fact is a candidate query for the KG that determines the answer (object) associated with the question.)   and selecting a fact from among at least a subset of facts in the KG, based on a joint distance metric, as answer to the question, the selected fact having a minimum Customer No. 1192764728888-2279 (BN181205USN2)PATENT joint distance between it and the predicted fact according to the joint distance metric.  ([p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein an answer (interpreted as corresponding to the query with the highest probability from a set of candidate queries/facts which expresses the most likely subject – predicate – answer triplet) is identified by performing a joint similarity calculation (equation 6.14 and argument of w in equation 6.15) between the elements of a candidate/predicted fact including the predicted subject/head entity, the predicted relation/predicate, and (through the KG) a corresponding predicted object/tail entity.)
However, Komninos does not explicitly teach the predicate learning model being pre-trained using training data with ground truth facts and a predicate objective function;… head entity learning model being pre- trained using training data with ground truth facts and a head entity objective function. Although Komninos teaches the pretraining of particular components of the framework ([pp. 103-104, Section 6.5]) he does not disclose the pretraining specifically for the predicate learning model and the head entity learning model components and although these particular components are understood to have been trained prior to their application, Komninos does not explicitly disclose the terms of the training or their associated loss functions. 
However, Dai, in the analogous environment of performing question answering using knowledge base embeddings, teaches the predicate learning model being pre-trained using training data with ground truth facts and a predicate objective function;… head entity learning model being pre- trained using training data with ground truth facts and a head entity objective function ([p. 1, Section 1, p. 4, Section 4.2, pp. 5-6, Section 5.1] To find the answer to a single-fact question, it suffices to identify the subject entity and relation (implicitly) mentioned by the question, and then forms a corresponding structured query. The problem can be formulated into a probabilistic form. Given a single-fact question q, finding the subjectrelation pair s, ˆ rˆ from the KB K which maximizes the conditional probability p(s, r|q), i.e. <equation 1> ., Relation network In this work, the probability of relations given a question, p(r|q), is modeled by the following network <equation 9>… As introduced in section 3, the factor p(s|q, r) models the fitness of a subject s appearing in the question q, given the main topic is about the relation… . For simplicity, we use two additive terms to model the joint effect <equation 11>  ., To estimate the parameters of pθr (r|q) and pθs (s|r, q), MLE can be utilized to maximize the empirical (log-)likelihood of subject-relation pairs .. <equations 15 and 16>. wherein (BiLSTM) model parameters for the predicate/relation prediction model (equation 9) and the subject/head entity prediction/learning model (equation 11) are obtained through training with a training set according to a log-likelihood loss objective function (equation 16) and wherein this training is interpreted as being pretraining in the sense of having occurred prior to the application of the network in an evaluation mode where it is noted that various components of those models (embeddings) are also, in a more narrow sense, pretrained.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos to incorporate the teachings of Dai to concatenate the weighted hidden state vector with the word embedding to obtain a hidden state for each token.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved relation and entity detection through a joint entity and dependency bi-LSTM model with shared parameters (Dai, [Abstract, pp. 6-8, Section 4.3, p. 9, Section 5, Table 1]).

Claims 13 and 14 are  rejected under 35 U.S.C. 103 as being unpatentable over Komninos, in view of Dai, and in further view of Zheng.
In regards to claim 13, the rejection of claim 12 is incorporated and Komninos further teaches  wherein the at least a subset is a candidate fact set comprising one or more candidate facts chosen from the one or more facts in the KG, each candidate fact comprises a head entity … one or more predicted head entity names … by a head entity detection (HED) model comprising at least a bidirectional recurrent neural network layer and a fully connected layer.  ([p. 101, Section 6.4, pp. 102-103, Section 6.4.2, Figure 6.1] The question is encoded into two latent representations, one for the entity mention and one for the relation mention. The input is a sequence of vectors consisting of the word embedding, the n-gram vector of each question word and the probability computed by the attention network for this word: <equation 6.5>, equations 6.6, 6.7> … The intuition behind the above equations is that the output of the attention network indicates the probability p of a token being part of the entity mention, and with 1 − p the probability of being in the relation mention. These probabilities can be used to as a gate to get an entity and relation representation of the question. The token representations being averaged are the states of the biLSTM network, making them carry information about the order of the tokens in the sentence…. The sequence of Bernoulli probabilities provided by the attention can also act as a gate to n-gram embeddings at the corresponding word positions and is used as a pooling operation to form a weighted average of n-gram embeddings of the entity mention: <equation 6.11> This character based representation of the entity is compared with the name descriptions of the entities, which are also encoded in the same way as averaged n-gram embeddings: wherein each word in the question is mapped into/represented as word embeddings (e_wi in equation 6.5) that is used to form a question representation input into the BiLSTM layers and wherein the system determines 2 probabilities, with a first probability p^att_i corresponding to the probability of a word/token belonging to an entity (subject/head entity) and a second probability 1-p^att_i corresponding to the probability of the word/token corresponding to a relation (i.e., non-entity) with the association of the question word/token to the head entity (subject) determined by that probability according to equation 6.6 and where equation 6.11, it is noted also is interpreted as an entity token/word identification/selection.) 
However, Komninos does not explicitly teach as a synonym for … identified. Although Komninos teaches the formation of a candidate fact set (i.e., the set queries with evaluations in the form of equation 6.15) and the evaluation of a similarity between character representation of the (head) question entity and corresponding entities in the KG along with entity aliases/synonyms, he does not explicitly disclose a determination of synonyms (such as may be found in a KG) that is then used in the formation/augmentation of the candidate fact set using those synonyms. Komninos does not disclose an explicit identification of a word token using the hidden states of the BiLSTM.
However, Dai, in the analogous environment of performing question answering using knowledge base embeddings, teaches wherein the at least a subset is a candidate fact set comprising one or more candidate facts chosen from the one or more facts in the KG, each candidate fact comprises a head entity as a synonym for one or more predicted head entity names ….   ([p. 3, Section 3.2, p. 4, Section 4.2, p. 5, Section 4.3] The fundamental intuition for pruning is that the subject entity must be mentioned by some textual substring (subject mention) in the question. Thus, the candidate space can be restricted to entities whose name/alias matches an n-gram of the question, as in (Yih et al., 2014; Yih et al., 2015; Bordes et al., 2015). We refer to this straight-forward method as N-Gram pruning., For simplicity, we use two additive terms to model the joint effect <equation 11>  where u(s, r, q) is the subject scoring function, u(s, r, q) = g(q) >E(s) + αh(r, s) (12) g(q) is another semantic question embedding, E(s) is a vector representation of a subject, h(r, s) is the subject-relation score, and α is the weight parameter used to trade off the two sources…. 3), which trains the embedings of entities and relations by enforcing E(s) + E(r) = E(o) for every observed triple (s, r, o)., Intuitively, this pruning method resembles the human behavior of first identifying the subject mention with the help of context, and then using it as the key word to search the KB…. Finally, the match function M(s, wˆ) is simply defined as either strict match between an alias of s and wˆ, or approximate match provided by the Freebase entity suggest API 1 ., wherein an initial set of fact candidates are found by using aliases (synonyms – string matching criteria) associated with a token in a question (including a subject entity) such that the subject s is either a string match with the mention word/token in the question or else is a match obtained in an entity embedding space (interpreted to correspond to the KG embedding space – Figure 1)  through the parameterization of pk(w|k) from which the set of predicted subjects (including subjects similar in that embedding space – i.e., synonyms in a general sense) is thereby identified and wherein the representation E(s) in equation 12 also represents similar subjects/synonyms by virtue of that term being an embedded space representation of the subject which is associated with the candidate fact by virtue of TransE functionality.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos to incorporate the teachings of Dai for the at least a subset is a candidate fact set to comprise one or more candidate facts chosen from the one or more facts in the KG, each candidate fact comprises a head entity as a synonym for one or more predicted head entity names. The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved relation and entity detection through a joint entity and dependency bi-LSTM model with shared parameters in which the search for candidate facts/answers in the KG embedding space is focused according to synonyms/similar words of the subject as determined both by string and embedding represenations  (Dai, [Abstract, p. 5, Section 4.3, pp. 6-8, Section 4.3, p. 9, Section 5, Table 1]).
However, Komninos and Dai do not explicitly teach identified … Dai does not teach entity/non-entity detection. Although Dai teaches detection/identification of a subject/head entity using a BiLSTM, he does not teach this identification using a fully connected layer. 
However, Zheng, in the analogous environment of performing joint relation and entity extraction from word sequences using BiLSTMs, teaches  … one or more predicted head entity names identified by a head entity detection (HED) model comprising at least a bidirectional recurrent neural network layer and a fully connected layer. ([p. 3, Section 3.1, pp. 4-5, Section 3.3], Each word is assigned a label that contributes to extract the results. Tag “O” represents the “Other” tag, which means that the corresponding word is independent of the extracted results. In addition to “O”, the other tags consist of three parts: the word position in the entity, the relation type, and the relation role…. For example, the word of “United” is the first word of entity “United States” and is related to the relation “Country-President”, so its tag is “B-CP-1”. The other entity “ Trump”, which is corresponding to “United States”, is labeled as “S-CP-2”. Besides, the other words irrelevant to the final result are labeled as “O”., For each word wt , the forward LSTM layer will encode wt by considering the contextual information from word w1 to wt , which is marked as −→ht . In the similar way, the backward LSTM layer will encode wt based on the contextual information from wn to wt , which is marked as ←− ht . Finally, we concatenate ←− ht and −→ht to represent word t’s encoding information, denoted as ht = [−→ht , ←− ht ]…. When detecting the tag of word wt, the inputs of the decoding layer are: ht obtained from Bi-LSTM encoding layer, former predicted tag embedding T_t-1, former cell value c_t-1, and the former hidden vector in decoding layer h_t-1… The final softmax layer computes normalized entity tag probabilities based on the tag predicted vector Tt : <equations 14 and 15>, wherein words in a sentence (question) are processed through a BiLSTM to form a contextual encoding of the sentence (question) in the form of concatenated forward and backward hidden state vectors (for each word/token) such that each concatenated hidden state vector is transformed to a tag predicted (target vector using W – the fully connected layer) y_t (equation 14) and processed through the softmax layer/function to determine a corresponding tag probability for that word such that the probability corresponds to whether the word is a head entity or not the head entity with the word then subsequently selected to be the head entity (in a triplet fact representation) based on that probability and wherein, it is noted that the Bi-LSTM model for detecting and labeling/tagging the entities in the sentence forms a HED model.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos and Dai to incorporate the teachings of Zheng for the one or more predicted head entity names to be identified by a head entity detection (HED) model comprising at least a bidirectional recurrent neural network layer and a fully connected layer.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved entity extraction and tagging through a joint entity and relation extraction in and end-to-end model using a bi-LSTM framework with a softmax layer to perform the entity tagging (Zheng, [Abstract, p. 5, Section 3.3, p. 6, Section 4.2, Table 1]).

In regards to claim 14, the rejection of claim 13 is incorporated and Komninos further teaches  wherein the one or more predicted head entity names are identified by the HED model by steps comprising: generating, using the bidirectional recurrent neural network layer, a forward hidden state sequence and a backward hidden state sequence from a sequence of word embedding vectors of the one or more tokens in the question; concatenating the forward and backward hidden state vectors into a concatenated hidden state vector; ([p. 101, Section 6.4, Figure 6.1]A bidirectional LSTM is used to transform the sequence of input vectors x to a sequence of contextualized vectors h: …, wherein the BiLSTM processes the question word embedding representation to form a sequence of contextualized vectors consisting of the concatenation of forward and backward hidden states of the BiLSTM) applying at least the fully connected layer to the concatenated hidden state vector to obtain a target vector for each token, each target vector has two probability values corresponding to probabilities that the token belongs to entity token name and non-entity token name; ([pp. 102-103, Section 6.4.2, Figure 6.1] <equations 6.6, 6.7> … The intuition behind the above equations is that the output of the attention network indicates the probability p of a token being part of the entity mention, and with 1 − p the probability of being in the relation mention. These probabilities can be used to as a gate to get an entity and relation representation of the question. The token representations being averaged are the states of the biLSTM network, making them carry information about the order of the tokens in the sentence…. The sequence of Bernoulli probabilities provided by the attention can also act as a gate to n-gram embeddings at the corresponding word positions and is used as a pooling operation to form a weighted average of n-gram embeddings of the entity mention: <equation 6.11> This character based representation of the entity is compared with the name descriptions of the entities, which are also encoded in the same way as averaged n-gram embeddings: wherein the system determines 2 probabilities, with a first probability p^att_i corresponding to the probability of a word/token belonging to an entity (subject/head entity) and a second probability 1-p^att_i corresponding to the probability of the word/token corresponding to a relation (i.e., non-entity) with the association of the question word/token to the head entity (subject) determined by that probability according to equation 6.6 and where equation 6.11, it is noted also is interpreted as an entity token/word identification/selection.) 
However, Komninos and Dai do not explicitly teach and selecting one or more tokens as the head entity name based on probability value of each token belonging to entity token name. Komninos does not disclose an explicit selection of a word token according to a target vector derived from that probability (instead that word is identified more directly from the attention probabilities without using the hidden states). Although Dai teaches the determination of a subject according to a probability (equation 11), he does not explicitly disclose that this corresponds to the head entity name in the question.
However, Zheng, in the analogous environment of performing joint relation and entity extraction from word sequences using BiLSTMs, teaches  applying at least the fully connected layer to the concatenated hidden state vector to obtain a target vector for each token, each target vector has two probability values corresponding to probabilities that the token belongs to entity token name and non-entity token name; and selecting one or more tokens as the head entity name based on probability value of each token belonging to entity token name. ([p. 3, Section 3.1, pp. 4-5, Section 3.3], Each word is assigned a label that contributes to extract the results. Tag “O” represents the “Other” tag, which means that the corresponding word is independent of the extracted results. In addition to “O”, the other tags consist of three parts: the word position in the entity, the relation type, and the relation role…. For example, the word of “United” is the first word of entity “United States” and is related to the relation “Country-President”, so its tag is “B-CP-1”. The other entity “ Trump”, which is corresponding to “United States”, is labeled as “S-CP-2”. Besides, the other words irrelevant to the final result are labeled as “O”., For each word wt , the forward LSTM layer will encode wt by considering the contextual information from word w1 to wt , which is marked as −→ht . In the similar way, the backward LSTM layer will encode wt based on the contextual information from wn to wt , which is marked as ←− ht . Finally, we concatenate ←− ht and −→ht to represent word t’s encoding information, denoted as ht = [−→ht , ←− ht ]…. When detecting the tag of word wt, the inputs of the decoding layer are: ht obtained from Bi-LSTM encoding layer, former predicted tag embedding T_t-1, former cell value c_t-1, and the former hidden vector in decoding layer h_t-1… The final softmax layer computes normalized entity tag probabilities based on the tag predicted vector Tt : <equations 14 and 15>, wherein words in a sentence (question) are processed through a BiLSTM to form a contextual encoding of the sentence (question) in the form of concatenated forward and backward hidden state vectors (for each word/token) such that each concatenated hidden state vector is transformed to a tag predicted (target vector with W being representing the fully connected layer) y_t (equation 14) and processed through the softmax layer/function to determine a corresponding tag probability for that word such that the probability corresponds to whether the word is a head entity or not the head entity with the word then subsequently selected to be the head entity (in a triplet fact representation) based on that probability and wherein, it is noted that the Bi-LSTM model for detecting and labeling/tagging the entities in the sentence forms a HED model.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos and Dai to incorporate the teachings of Zheng to apply the fully connected layer to the concatenated hidden state vector to obtain a target vector for each token, with each target vector having two probability values corresponding to probabilities that the token belongs to entity token name.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved entity extraction and tagging through a joint entity and relation extraction in and end-to-end model using a bi-LSTM framework with a softmax layer to perform the entity tagging (Zheng, [Abstract, p. 5, Section 3.3, p. 6, Section 4.2, Table 1]).

Claims 15, 16, and 18-20 are  rejected under 35 U.S.C. 103 as being unpatentable over Komninos, in view of Dai, in view of Zheng, and in further view of Lukovnikov.

In regards to claim 15, the rejection of claim 13 is incorporated and Komninos further teaches wherein the joint distance metric comprises vector distance terms representing l2 norm of vector distance between a vector in the candidate fact and a corresponding vector in the predicted fact, ([p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein an answer (interpreted as corresponding to the query with the highest probability from a set of candidate queries/facts which expresses the most likely subject – predicate – answer triplet) is identified by performing a joint similarity calculation (equation 6.14 and argument of w in equation 6.15) between the elements of a candidate/predicted fact including the predicted subject/head entity, the predicted relation/predicate, and (through the KG) a corresponding predicted object/tail entity such that this metric includes a distance metric (|x-y|) between particular head-entity question-fact vectors as wells as between particular relationship question-fact vectors with that distance metric interpreted to be a p-norm metric with p being commonly interpretable with this representation as being either 1 or 2.) and string similarity terms representing string similarity between name of entity in the candidate fact and the tokens classified as entity name by the HED model, and ….  ([p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein an answer (interpreted as corresponding to the query with the highest probability from a set of candidate queries/facts which expresses the most likely subject – predicate – answer triplet) is identified by performing a joint similarity calculation (equation 6.14 and argument of w in equation 6.15) between the elements of a candidate/predicted fact including the predicted subject/head entity, the predicted relation/predicate, and (through the KG) a corresponding predicted object/tail entity such that this metric includes a string similarity metric (sim_name) between particular head-entity (HED model) question mention and corresponding fact entity and wherein, as previously noted, Komninos also determines/classifies a given token in the sentence as a non-entity/predicate according to a probability.)
However, Komninos, Dai, and Zheng do not teach and string similarity between name of the predicate in the candidate fact and the tokens classified as non entity name by the HED model. Komninos does not include the character string representation of the question term relationship/predicate mention in joint distance metric even though he teaches the inclusion of the question term head entity-subject fact similarity metric. Zheng and Dai do not teach character similarity metrics as recited.
However, Lukovnikov, in the analogous environment of performing end-to-end question answering with bi-LSTM relation, teaches  and string similarity between name of the predicate in the candidate fact and the tokens classified as non entity name by the HED model ([p. 1213, Section 2.1.1, p. 1215, Section 2.1.4, p. 1215, Section 2.2, Figure 3]The mapping of a question q = {w1, . . . , wT } to its subject and predicate related vector representations r s q and r r q, respectively, is done using a single-layered unidirectional GRU based encoder network. We call this part of the model the question encoder ENCQ <equation 7> The question encoder ENCQ first uses the word representation function REPW(wt) to generate vector representations for all words wt, t = 1, . . . , T (as described in the next paragraph), which are subsequently fed to the RNN until all words have been seen., Given the question encoding vector rq = (r s q, r p q ), the latent vector representation rp of the relation, and the latent representation rs of the subject entity, we compute two matching scores: one between the question and subject entity and one between the question and predicate, as follows: <equations 14a, 14b>, Using these scoring functions, we can solve the task of finding the right subjectpredicate pair (sg, pg) (i.e. , retrieving triples (sg, pg, oi) ∈ G such that the set of objects in these triples constitutes the answer to question q) by picking the best scoring subject entity and predicate given a question according to Equations (1) and (2), respectively., wherein a GRU/RNN-based question encoder determines/classifies a representation of the predicate tokens in the input sentence in an embedding space (figure 3) into a subject/entity and a relation/non-entity and wherein this representation is used in a similarity computation (equations 14) that is then used to score candidate answers (i.e., the string similarity of both the head entity prediction-fact head entity and the predicted predicate-fact predicate are used to find the best answer from the KG).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos, Dai, and Zheng to incorporate the teachings of Lukovnikov to for the joint distance metric to further comprises string similarity terms representing string similarity between name of entity in the candidate fact and the tokens classified as entity name by the HED model, and string similarity between name of the predicate in the candidate fact and the tokens classified as non entity name by the HED model.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved question answering performance in an end-to-end ML configuration by using in which improved entity extraction and tagging through a joint entity and relation extraction in and end-to-end model using both character-level and word-level information for entity and predicate predication (Lukovnikov, [Abstract, pp. 1212-1213, Section 1, p. 1219, Section 6, Table 4]).

In regards to claim 16, the rejection of claim 15 is incorporated and Komninos further teaches wherein the joint distance metric is a weighted combination of the vector distance terms and the string similarity terms with a weight for each term in the joint distance metric.  ([p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein an answer (interpreted as corresponding to the query with the highest probability from a set of candidate queries/facts which expresses the most likely subject – predicate – answer triplet) is identified by performing a joint similarity calculation (equation 6.14 and argument of w in equation 6.15) between the elements of a candidate/predicted fact including the predicted subject/head entity, the predicted relation/predicate, and (through the KG) a corresponding predicted object/tail entity such that this metric includes a combination of a distance metric (|x-y|) and a string similarity metric (sim_name) such that the factor w applied to the concatenation combination of those metrics is being interpreted as a weight or a weighting function.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos to incorporate the teachings of Dai, Zheng, and Lukovnikov for the same reasons as pointed out for claims 15.

In regards to claim 18, the rejection of claim 17 is incorporated and Komninos further teaches wherein the joint distance metric comprises vector distance terms representing l2 norm of vector distance between a vector in the candidate fact and a corresponding vector in the predicted fact, and string similarity terms representing string similarity between entity name of candidate fact and entity tokens in the question, and …  ([p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein an answer (interpreted as corresponding to the query with the highest probability from a set of candidate queries/facts which expresses the most likely subject – predicate – answer triplet) is identified by performing a joint similarity calculation (equation 6.14 and argument of w in equation 6.15) between the elements of a candidate/predicted fact including the predicted subject/head entity, the predicted relation/predicate, and (through the KG) a corresponding predicted object/tail entity such that this metric includes a combination of a distance metric (|x-y| which is a norm commonly interpreted as either l-1 or l-2) and a string similarity metric (sim_name).)
However, Komninos, Dai, and Zheng do not teach string similarity between predicate name of candidate fact and non-entity tokens in the question. Komninos does not include the character string representation of the question term relationship/predicate mention in joint distance metric even though he teaches the inclusion of the question term head entity-subject fact similarity metric. Zheng and Dai do not teach character similarity metrics as recited.
However, Lukovnikov, in the analogous environment of performing end-to-end question answering with bi-LSTM relation, teaches  string similarity between predicate name of candidate fact and non-entity tokens in the question ([p. 1213, Section 2.1.1, p. 1215, Section 2.1.4, p. 1215, Section 2.2, Figure 3]The mapping of a question q = {w1, . . . , wT } to its subject and predicate related vector representations r s q and r r q, respectively, is done using a single-layered unidirectional GRU based encoder network. We call this part of the model the question encoder ENCQ <equation 7> The question encoder ENCQ first uses the word representation function REPW(wt) to generate vector representations for all words wt, t = 1, . . . , T (as described in the next paragraph), which are subsequently fed to the RNN until all words have been seen., Given the question encoding vector rq = (r s q, r p q ), the latent vector representation rp of the relation, and the latent representation rs of the subject entity, we compute two matching scores: one between the question and subject entity and one between the question and predicate, as follows: <equations 14a, 14b>, Using these scoring functions, we can solve the task of finding the right subjectpredicate pair (sg, pg) (i.e. , retrieving triples (sg, pg, oi) ∈ G such that the set of objects in these triples constitutes the answer to question q) by picking the best scoring subject entity and predicate given a question according to Equations (1) and (2), respectively., wherein a GRU/RNN-based question encoder determines/classifies a representation of the predicate tokens in the input sentence in an embedding space (figure 3) into a subject/entity and a relation/non-entity and wherein this representation is used in a similarity computation (equations 14) that is then used to score candidate answers (i.e., the string similarity of both the head entity prediction-fact head entity and the predicted predicate-fact predicate are used to find the best answer from the KG).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos, Dai, and Zheng to incorporate the teachings of Lukovnikov to for the joint distance metric to further comprises string similarity terms representing string similarity between name of entity in the candidate fact and the tokens classified as entity name by the HED model, and string similarity between name of the predicate in the candidate fact and non-entity tokens in the question.  The modification would have been obvious because one of ordinary skill would have been motivated to achieve improved question answering performance in an end-to-end ML configuration by using in which improved entity extraction and tagging through a joint entity and relation extraction in and end-to-end model using both character-level and word-level information for entity and predicate predication (Lukovnikov, [Abstract, pp. 1212-1213, Section 1, p. 1219, Section 6, Table 4]).

In regards to claim 19, the rejection of claim 18 is incorporated and Komninos further teaches wherein the joint distance metric is a weighted combination of the vector distance terms and the string similarity terms.  ([p. 103, Section 6.4.2, p. 103, Section 6.4.2, Figure 6.1] The final outcome of the model is the probability of a question being correctly mapped to a query. We obtain that by computing similarity and distance features between the encoded representations of corresponding parts between the question and possible queries….<equation 6.15>, wherein an answer (interpreted as corresponding to the query with the highest probability from a set of candidate queries/facts which expresses the most likely subject – predicate – answer triplet) is identified by performing a joint similarity calculation (equation 6.14 and argument of w in equation 6.15) between the elements of a candidate/predicted fact including the predicted subject/head entity, the predicted relation/predicate, and (through the KG) a corresponding predicted object/tail entity such that this metric includes a combination of a distance metric (|x-y| which is a norm commonly interpreted as either l-1 or l-2) and a string similarity metric (sim_name) such that the factor w applied to the concatenation combination of those metrics is being interpreted as a weight or a weighting function.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos to incorporate the teachings of Dai, Zheng, and Lukovnikov for the same reasons as pointed out for claims 18.

In regards to claim 20, the rejection of claim 19 is incorporated and Komninos further teaches wherein in the joint distance metric, the string similarity terms counterweight the vector distance terms. wherein an answer (interpreted as corresponding to the query with the highest probability from a set of candidate queries/facts which expresses the most likely subject – predicate – answer triplet) is identified by performing a joint similarity calculation (equation 6.14 and argument of w in equation 6.15) between the elements of a candidate/predicted fact including the predicted subject/head entity, the predicted relation/predicate, and (through the KG) a corresponding predicted object/tail entity such that this metric includes a combination of a distance metric (|x-y| which is a norm commonly interpreted as either l-1 or l-2) and a string similarity metric (sim_name) such that the factor w applied to the concatenation combination of those metrics is being interpreted as a weight or a weighting function and wherein the string similarity terms and distance terms are being interpreted as respective “counterweights” of one another in the sense that they provide different feature perspectives to the calculation of the probability of the association of a predicted tail entity in a candidate fact with the corresponding predicted head entity and relation/predicate (i.e., one is based on a distance in a KG embedding space while the other is based on similarity of the token embeddings).)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komninos to incorporate the teachings of Dai, Zheng, and Lukovnikov for the same reasons as pointed out for claims 18.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Qiu et al. (“Joint Detection of Topic Entity and Relation for Simple Question Answering”, Knowledge Science, Engineering and Management, 11th International Conference, August, 2018, Part II, pp. 371-382) teach a BiLSTM framework in QA for detecting question topic entity and relation and scoring the topic entity and the relation relative to candidate facts in an embedded knowledge base.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT LEWIS KULP whose telephone number is (571)272-7983. The examiner can normally be reached M, Th, F 8-5:30; Tu 8-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ROBERT LEWIS KULP/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124