DETAILED ACTION
This Office Action is in response to the remarks entered on 02/25/2021. Claims 1, 15 were amended. No claims were added. No claims were cancelled.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are presented for examination.
In reference to Applicant’s arguments about: Rejections under 35 U.S.C. §103:
 	Applicant’s Argument: 
Claims 1-20 were rejected under 35 U.S.C. § 103 over Meij Edgar (U.S. 2016/0189047) in view of Phan ("NeuPL: Attention-based Semantic Matching and Pair-Linking for Entity Disambiguat") in view of Sun ("Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguati"). 
Applicant respectfully traverses the rejection. Specifically, Applicant respectfully asserts that, contrary to what is stated in the Office Action, Phan fails to teach or suggest "evaluate a minimum entropy loss function using the classification." The Examiner argues that the binary cross-entropy loss function in Phan is equivalent to a minimum entropy loss function. Applicant respectfully disagrees. A cross-entropy loss function is different than a minimum entropy loss function. A cross-entropy loss function calculates the total entropy between two distributions. It is a measure of the difference in entropy between the two probability distributions for a given random event or set of events. The present claims describe something different, a "minimum entropy loss 
Indeed, paragraph [0050] of the present Specification describes that a minimum entropy loss function measures the entropy of resultant probability scores and indicates whether that entropy has reached a minimum. While both the cross-entropy loss function and the minimum entropy loss function are loss functions and both are related to entropy, they are measuring something completely different. 
Applicant also respectfully submits that, contrary to what is stated in the Office action, Sun fails to teach or suggest "causing the attention model to produce different results each time it is applied until the minimum entropy loss function indicates that entropy in the classification has been minimized." The Examiner argues that section 2.3, second paragraph of Sun teaches the calculation of a similarity between context mention pairs and the selection of the closest one as a final result. The Examiner further argues that the calculation of the similarity of the context mention pair with each candidate entity produces a different result, While that might be true, none of that has anything to do with determining whether a minimum entropy loss function indicates that entropy in the classification has been minimized. 
For these reasons, Applicant respectfully asserts that the independent claims are in condition for allowance. As to the dependent claims, these claims are also in condition for allowance for the same reasons as their respective parent independent claims. 
Examiner’s Response: 
Applicant’s arguments, see pages 1-2, filed 02/25/2021, with respect to the rejection(s) of claim(s) 1, 15 under Meij in view of Phan and further in view of Sun have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Phan and Meij, and further in view of the Zagoruyko.
Examiner respectfully disagrees to applicant’s argument because the claims 1 and 15 are unpatentable over Phan in view of Meij and further in view of the Zagoruyko. Meij in view of Zagoruyko teaches each and every limitation of the claims 1, and 15. Furthermore, Zagoruyko discloses the cross entropy is being minimized, as it can be seen at [Pages 5], L(W,x) is the cross entropy loss that is being minimized. See further at Figure 5 shows that the attention and loss is performed iteratively after each group, the loss function is being minimized after a performance of each group. Therefore, the lowest loss values will be put into the backpropagation step. Furthermore, see [section. 3.2 ] the equation (3), (4) and (5), L(W,x) being minimized and then backpropagation is preformed until the classification has been minimized, see further at equation (5. As the reason explained above, the rejections of claims 1 and 15 are still maintained. 
Examiner respectfully reminds applicant that Phan and Meij and further in view of Zagoruyko teaches each and every limitation of dependents claims 1 and 15. Therefore, the argument is not persuasive, the rejections of the dependent claims are still maintained. 
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Meij et al. (Pub. No. US2016/0189047- hereinafter, Meij) in view of Phan et al. (NeuPL: Attention-based Semantic Matching and Pair-Linking for Entity Disambiguation – School of Computer Science and Engineering, Nanyang Technological University, Singapore -hereinafter, Phan) and further in view of Zagoruyko et al. (PAYING MORE ATTENTION TO ATTENTION:IMPROVING THE PERFORMANCE OF CONVOLUTIONAL NEURAL NETWORKS VIA ATTENTION TRANSFER- Universit´e Paris-Est, ´ Ecole des Ponts ParisTech, Paris, France– hereinafter, Zagoruyko)
Regarding to claim 1, Phan teaches 
[…]
for each of one or more terms in the text document, identify one or more entities to which the term potentially maps, wherein the text document includes at least one ambiguous term, an ambiguous term being a term that potentially maps to two different, but similarly named, entities (Phan, [section: Introduction, column: left, first paragraph], “Documents often contain mentions of named entities such as  [Section: abstract, column: left, first paragraph], “Entity disambiguation, also known as entity linking, is the task of mapping mentions in text to the corresponding entities in a given knowledge base, e.g., Wikipedia. Two key challenges are making use of mention’s context to disambiguate (i.e., local objective), and promoting coherence of all the linked entities (i.e., global objective).”);
extract one or more features from the text document (Phan, [Section: introduction, column: left-right, second paragraph], “Entity disambiguation is a critical task in bridging the unstructured text and structured knowledge bases. The result of entity disambiguation is beneficial for many tasks, including knowledge base population, information retrieval and extraction, question answering, and content analysis.” Furthermore, see Phan, [Section: 3.2, column: right, first-second paragraph], “Embedding models aim to generate a continuous representation for every word, such that two words that are close in meaning are also close in the embedding vector space. It assumes that words are similar if they co-occur often with the same words [31]. ;
apply an attention model to the text document based on the extracted one or more features, resulting in an attention weight being applied to each of the one or more terms in the text document (Phan, [Page 1668, column: left, first to fourth paragraphs], “It is achieved through two Long Short-Term Memory (LSTM) networks modeling the context on the left- and right-sides of a mention. Furthermore, to assist the matching in ambiguous cases where noises present in mention’s context, we employ attention mechanism into the designed model. To the best of our knowledge, we are the first to build such a comprehensive neural network to model the semantic matching for entity disambiguation…We present a deep neural network model that measures the semantic similarity between a mention’s context and a target entity. Our DNN model is designed in a way to fully utilize the embedded information including mention’s position and word order. We also use the attention mechanism to highlight the informative parts in the local context.” Furthermore, see Phan, [Page: 1671, column: left - right, seventh paragraph], “Combination with prior probability. Prior probability P(e |m) is the likelihood of mention m with the given surface form being linked to entity e. It is approximated by the hyperlink statistic in Wikipedia [38]. Although P(e |m) completely ignores the 
     ϕ(mi , ei ) = (1 − α ) σ (mi , ei ) + αP (ei |mi )                 (8)
In this equation, α is a weight factor, and σ(mi, ei)  is the output of DNN given mi ’s context  and entity ei  (see Figure 1).” Examiner’s note, therefore weight factor α is being applied to terms in text document (input document).);  
encode the one or more terms based on the attention weights (Phan, [Fig.1, Page: 1670, column: right, first paragraph], “Shown in Figure 1, two separate LSTM networks are used to encode the mention’s left- and right-side contexts respectively. The left-side context is passed in forward direction … By doing so, we align mention m at the end of each sequence, so that LSTM is aware of its position. This is important because local context may contain more than one mention, and the model needs to focus on the right mention for correct linking...” Examiner’s note, as shown in the figure 1, the connection of each node in graph is weight (attention weight(s)), therefore, the encode the mention’s left –and right directions based on the weights.);
classify each of one or more ambiguous terms based on the encoded terms (Phan, [page: 1672, section: 5, column: left, first-second paragraphs], “Considering the set of decisions made by pair-linking, it results in an edge cover on entity graph where each assigned entity is forced to be coherent with at least another entity. Furthermore, Pair-linking resolves mentions in an iterative manner which encourages subsequent assignments to be consistent with previous ones.Pair-Linking procedure is detailed in Algorithm 1. Specifically, Pair-Linking maintains a priority queue Q and each element Qmi ,mj tracks the most confident linking pairs involving mentions mi and mj . Qmi ,mj is ∈ Ci , ej ∈ Cj , and the confidence score of the pair assignment is the highest among Ci × Cj , according to Equation 9. After initialization, Pair-Linking iteratively retrieves the most confi-dent pair assignment from Q (Line 7) and links the pair of mentions to the associated entities (Lines 8-9). Then, Pair-Linking updates Q, more precisely, Qmk ,mi and Qmk ,mj (Lines 10-13). For Qmk ,mi , the possible pairs of assignments between mk and mi are now conditioned by mi |→ ei , and the same applies to Qmk ,mj ..” Examiner’s note, pair assignment is considered as the classify),
the classification assigning a value to each different entity that each ambiguous term potentially maps to (Phan, [page: 1668, section: 2, column: right, third paragraph], “To perform collective linking, a mention-entity graph is constructed with (i) edges between mention and entity, weighted by score of local context matching, and (ii) edges between entity and entity, weighted by their relatedness. Hoffart et al. [23] cast the joint mapping into the problem of identifying dense subgraph that contains exactly one mention-entity edge for each mention.” Examiner’s note, the weight score of a connection between entities and their relatedness is considered as the value of each different entity);
[…]
However, Phan does not teach a system comprising: a memory; and a computer-readable medium having instructions stored thereon, which, when executed by a processor, cause the system to: receive a text document; evaluate a minimum entropy loss function using the classification, the minimum entropy loss function measuring entropy of an iteration of the attention model and indicating whether that entropy is lower than entropy of any prior iterations of the attention model; and back-propagate results from the minimum entropy loss function to the attention model, causing the attention model to produce different results each time it is applied until the minimum entropy loss function indicates that entropy in the classification has been minimized.
On the other hand, Meij teaches a system comprising: a memory (Meij, [Par.0101, lines 7-8], “Tangible non-transitory "storage" type media include any or all of the memory “);
and a computer-readable medium having instructions stored thereon, which, when executed by a processor, cause the system to (Meij, [Par. 0102, lines 14-20], “The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to tangible "storage" media, terms such as computer or machine "readable medium" refer to any medium that participates in providing instructions to a processor for execution.”):
receive a text document (Meij, [Par. 0044, lines 6-11], “The entity linking engine 300 links the received text string to entities by segmenting the text string into segments and iden-tifying a set of entities linked to the segments in accordance with a probabilistic model based on surface form information associated with the entities”);
Phan and Meij are analogous in arts because they have the same field of endeavor classifying the disambiguation entity using machine learning techniques.

However, neither Phan nor Meij teaches evaluate a minimum entropy loss function using the classification, the minimum entropy loss function measuring entropy of an iteration of the attention model and indicating whether that entropy is lower than entropy of any prior iterations of the attention model; and back-propagate results from the minimum entropy loss function to the attention model, causing the attention model to produce different results each time it is applied until the minimum entropy loss function indicates that entropy in the classification has been minimized.
On the other hand, Zagoruyko teaches evaluate a minimum entropy loss function using the classification, the minimum entropy loss function measuring entropy of an iteration of the attention model and indicating whether that entropy is lower than entropy of any prior iterations of the attention model (Zagoruyko, the cross entropy is being minimized, as it can be seen at [Pages 5, the third paragraph], “Without loss of generality, we assume that transfer losses are placed between student and teacher attention maps of same spatial resolution, but, if needed, attention maps can be interpolated to match their shapes. Let S, T andWS,WT denote student, teacher and their weights correspondingly, andlet L(W; x) denote a standard cross entropy loss. Let also I denote the indices of all teacher-student activation layer pairs for which we want to transfer attention maps. Then we can define the following total loss: L(W,x) is the cross entropy loss that is being minimized” See further at Figure 5 shows that the attention and loss is performed iteratively after each group, the loss function is being minimized after a performance of each group. Examiner interprets the minimizing of entropy loss value after the performance of each group corresponding to a choosing of the lowest entropy low values after the performance of each group.);
and back-propagate results from the minimum entropy loss function to the attention model, causing the attention model to produce different results each time it is applied until the minimum entropy loss function indicates that entropy in the classification has been minimized (Zagoruyko, [Page 6, section 3.2], “

    PNG
    media_image1.png
    521
    781
    media_image1.png
    Greyscale
”
 Therefore, the lowest loss values will be put into the backpropagation step. Furthermore, see [section. 3.2 ] the equation (3), (4) and (5), L(W,x) being minimized and then backpropagation is preformed until the classification has been minimized. ).
Phan, Meij and Zagoruyko are analogous in arts because they have the same filed of endeavor classifying the disambiguation entity using machine learning techniques.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Phan, Meij’s method, further in view of Zagoruyko by evaluating the entropy loss value and chose a lowest loss value to input into the back propagation and  having attention model to produce different results each time it is applied until the minimum entropy loss 
Regarding to claim 8, is being rejected as the same reason as the claim 1.
Regarding to claim 15, is being rejected as the same reason as the claim 1.
Additionally, Meij teaches a non-transitory machine-readable storage medium comprising instructions, which when implemented by one or more machines, cause the one or more machines to perform operations comprising (Meij, [Par. 0012, lines 1-5], “In one example, a non-transitory machine readable medium having information recorded thereon for entity link-ing is disclosed. The recorded information, when read by the machine, causes the machine to perform a series of processes. A text string is received.”):
Regarding to claim 2, Phan, as modified in view of Meij and further in view of Zagoruyko teaches the system of claim 1, wherein the encoding is performed using a recurrent neural network ((Phan, [Page: 1669, section: 3.3, first paragraph, column: right], “Long Short-Term Memory (LSTM) is a specific Recurrent Neural Network that has been widely used to model variable-length sequence [22, 25]. In ,
However, Phan and Meij not teach and wherein the back-propagating includes back-propagating the results from the minimum entropy loss function to the recurrent neural network.
 	On the other hand, Zagoruyko teaches and wherein the back-propagating includes back-propagating the results from the minimum entropy loss function to the recurrent neural network (Zagoruyko, [Page 6, section 3.2], “
    PNG
    media_image2.png
    650
    975
    media_image2.png
    Greyscale

 ” ).

Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Phan, Meij’s method, further in view of Zagoruyko by evaluating the entropy loss value and chose a lowest loss value to input into the back propagation and  having attention model to produce different results each time it is applied until the minimum entropy loss function indicates that entropy in the classification has been minimized .The modification would have been obvious because one of the ordinary skills in art would be motivated to improve performance of CNN Neural Network (Zagoruyko, [Abstract section], “In this work we show that, by properly defining attention for convolutional neural networks, we can actually use this type of in-formation in order to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network. To that end, we propose several novel methods of transferring attention, showing consistent improvement across a variety of datasets and convolutional neural network architectures.”).
Regarding to claim 9, is being rejected as the same reason as the claim 2.
Regarding to claim 16, is being rejected as the same reason as the claim 2.
Regarding to claim 3, Phan, as modified in view of Meij and further in view of Zagoruyko teaches the system of claim 1, wherein the classifying is performed using a feedforward neural network, (Phan, [page: 1670, Fig. 1, section: 4, column: right], “in Figure 1. Shown in the figure, two separate LSTM networks are used to .
However, Phan and Meij not teach and wherein the back-propagating includes back- propagating the results from the minimum entropy loss function to the feedforward neural network.
On the other hand, Zagoruyko teaches and wherein the back-propagating includes back- propagating the results from the minimum entropy loss function to the feedforward neural network (Zagoruyko, [Page 6, section 3.2], “

    PNG
    media_image2.png
    650
    975
    media_image2.png
    Greyscale
” ).

Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Phan, Meij’s method, further in view of Zagoruyko by evaluating the entropy loss value and chose a lowest loss value to input into the back propagation and  having attention model to produce different results each time it is applied until the minimum entropy loss function indicates that entropy in the classification has been minimized .The modification would have been obvious because one of the ordinary skills in art would be motivated to improve performance of CNN Neural Network (Zagoruyko, [Abstract section], “In this work we show that, by properly defining attention for convolutional neural networks, we can actually use this type of in-formation in order to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network. To that end, we propose several novel methods of transferring attention, showing consistent improvement across a variety of datasets and convolutional neural network architectures.”)
Regarding to claim 10, is being rejected as the same reason as the claim 3.
Regarding to claim 17, is being rejected as the same reason as the claim 3.
Regarding to claim 4, Phan, as modified in view of Meij and further in view of Zagoruyko teaches the system of claim 1, wherein the attention model produces the different results each time it is applied by varying attention weights applied to the one or more features (Phan, [Pages: 1669-1670, section: 3.3, first paragraph], 

    PNG
    media_image3.png
    97
    592
    media_image3.png
    Greyscale

In above equations, i, f , o, C are input gate, forget gate, output gate, and cell memory, respectively. s is sigmoid function. W*, U* and b* are network parameters to be learned during training. Instead of taking the last hidden state hn, we use the max-pooling result of all hidden states over time steps i.e., max (h1, ...hn) as the final representation for sequence S.” Examiner’s note, therefore, W* is considered as attention weight(s). The LSTM will produce a different output value at each time t.).
Regarding to claim 11, is being rejected as the same reason as the claim 4.
Regarding to claim 18, is being rejected as the same reason as the claim 4.
Regarding to claim 5, Phan, as modified in view of Meij and further in view of Zagoruyko teaches the system of claim 1, wherein the attention model produces the different results each time it is applied by varying what the one or more features are (Phan, [Pages: 1669, section: 3.1, second –fourth paragraphs], “We train Gradient Boosted Regression Trees model [13] as a candidate ranker to reduce the size 
Prior probability P(e|m). P(e|m) is the likelihood that the men-tion with surface form m being mapped to entity e. P(e|m) is pre-calculated based on the hyperlinks in Wikipedia.
String similarity. We use servalral string similarity measures ncluding: (i) edit distance, (ii) whether mention m exactly matches entity e’s name, (iii) whether m is a prefix or suffix of the entity name, and (iv)  whether m is an abbreviation of the entity name. Note that the string similarity features are calculated for the original mention as well as the boundary-corrected mention and the expanded mention described earlier.” Examiner’s note, therefore, a result of mapping for each pair of (mention candidate) based on features (prefix, suffix, abbreviation of the entity names).).
Regarding to claim 12, is being rejected as the same reason as the claim 5.
Regarding to claim 19, is being rejected as the same reason as the claim 5.
Regarding to claim 6, Phan, as modified in view of Meij and further in view of Zagoruyko teaches the system of claim 1, wherein the one or more features include term frequency-inverse document frequency (TF-IDF) (Phan, [page: 1671, column: left, fifth paragraph], “where V* and v are attention parameters to be learned during training. By the formulas, ht will be given more weight if it is more relevant to the attention vector p. In our context, the attention mechanism works like tf-idf weighting scheme that emphasizes the embedded information in local context (i.e.,hi ) that are more relevant to the target entity. It helps to eliminate noise hence to improve semantic matching.”).
Regarding to claim 13, is being rejected as the same reason as the claim 6.
Regarding to claim 20, is being rejected as the same reason as the claim 6.
Regarding to claim 7, Phan, as modified in view of Meij and further in view of Zagoruyko teaches the system of claim 1, wherein the one or more features include one or more features extracted from a member profile of an author of the text document (Phan, [page: 1670, section: 4, column: right, last paragraph], “Entity profile. To build entity profiles, we exploit the pre-trained embeddings of entities in Section 3.2. Because we treat entities similar to words in training, an entity embedding encodes not only semantic information but also syntactic knowledge about how the associated entity is mentioned. For example, entities about geographic location will be more likely to be placed close to prepositions ‘in’ or ‘at’ in the embedding space. We further complete entity profiles by including their descriptions. For each entity, we take the first 150 words from its Wikipedia page as its description.”).
Regarding to claim 14, is being rejected as the same reason as the claim 7.


Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any 
The prior art made of record on the PTO-892 and not relied upon is considered pertinent to applicant’s disclosure.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to EM N TRIEU whose telephone number is (571)272-5747.  The examiner can normally be reached on 7:30 - 5:00 M_TH.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access 

/E.T./           Examiner, Art Unit 2126                                                                                                                                                                                             
/BABOUCARR FAAL/Primary Examiner, Art Unit 2184