DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This is a 2nd non-final rejection on the merit. Prosecution is being reopen due to a review of the application indicating that a 101 software per se rejection is required for this application. A new ground of rejection is set forth below.
Claims 1-20 are pending in this office action.

Response to Amendment
This office action is in response to applicant’s communication filed on March 8th, 2022. The applicant’s remark and amendments to the claims were considered with the results that follow. 
In response to the last Office Action, claims 1, 7, and 15 are amended. As a result, claims 1-20 are pending in this office action.

Response to Arguments
Applicant’s argument, see pg. 8, filed on March 8th, 2022, with respect to the rejection of independent claims 1, 7, and 15 as amended under 35 U.S.C 103, where the applicant asserts that the prior arts do not teach or suggest the newly amended claims reciting, "a third artificial neural network system capable of generating a third set of enhanced tag predictions for the query segment based on both: (a) a set of pre-context pretrained word vectors obtained for words preceding the query segment in the natural language query and (b) a set of post-context pretrained word vectors for words following the query segment in the natural language query, wherein the third artificial neural network is trained using a set of pre- context pretrained word embeddings and a set of post-context pretrained word embeddings for a respective sample query segment; and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment, wherein the prediction mixing system comprises a sub-prediction mixer that includes a set of mixing weights that are learned during a training of the second artificial neural network system" as recited in amended independent claims 1, 7, and 15.  The examiner agreed that the applied reference, DeFelice and Chen, does not teach or suggest the above limitations, therefore the argument have been fully considered and are persuasive. The Chen rejection has been withdrawn in independent claims 1, 7, and 15. However, upon further consideration, a new ground of rejection is made in view of U.S Patent 9,053,431 issued to Michael Lamport Commons (hereinafter as “Commons”) is shown to teach the amended limitation.

Commons teaches a third artificial neural network system capable of generating a third set of enhanced tag predictions for the query segment based on both: (a) a set of pre-context pretrained word vectors obtained for words preceding the query segment in the natural language query ((Gruhn: Col 22, lines 17-23; The pattern recognizers may be statistically based, rule based, or the like, and extract the “object” having an unrecognized pattern from the input space of the ANN system. Advantageously, the unrecognized pattern may be presented to a knowledge base as a query, which will then return either an “identification” of the object, or information related to the object. Col 22, lines 42-45; In a more general sense, this technique permits a vast and dynamic knowledge base to be integrated into the neural network scheme, and thus avoid a need for retraining of the neural network as the environment changes. Col 23, lines 7-12; In some cases, the object is readily identified, and based on that identification, processed within the same level. For example, in a semantic network, a new word may be encountered. Reference to a knowledge base may produce a synonym, which the neural network can then process. Col 38, lines 4-5; The stage/order at which a stacked neural network begins and ends and the number of neural networks in a hierarchical stack depend on the nature of the problem to be solved. Moreover, each neural network in a hierarchical stack may use different architectures, algorithms, and training methods)),

(b) a set of post-context pretrained word vectors for words following the query segment in the natural language query (Commons: Col 23, lines 4-7; The neural network at each level preferably includes logic for formulating an external search of an appropriate database or databases in dependence on the type of information and/or context, and for receiving and interpreting the response. Col 30, lines 12-17; The third neural network organizes the characters in the text of the message into meaningful strings of characters, such as words, phrases, sentences, paragraphs, etc., and either provides an output or stores an indicia representing the meaningful strings of characters. Col 41, lines 18-23; Neural network...is trained by inputting patterns of words and sentences that it needs to identify. When neural network...associates a pattern with a word or a sentence, the network outputs to neural network...the pattern's classification as a word or a sentence, as well as the position in the text as a whole of the word or the sentence), wherein and 

a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment (Commons: Col 35, 28-34; Referring to FIG. 1, a hierarchical stacked neural network 10 of the present invention comprises a plurality of up to O architecturally distinct, ordered neural networks 20, 22, 24, 26, etc., of which only four (Nm, Nm+1, Nm+2, Nm(O−1)) are shown. The number of neural networks in hierarchical stacked neural network 10 is the number of consecutive stages/orders needed to complete the task assigned. Col 36, lines 20-26; The output from neural network 24 is input into neural network 26, which processes the output from neural network 24 with stage/order 5 actions. The output from neural network 26 is input into neural network 28, which processes the output from neural network 26 with stage/order 6 actions. Neural network 28 is the highest neural network in the hierarchical stack and produces output 62. Col 36, lines 40-43; The actions and tasks in each successive neural network are a combination, reordering and transforming the tasks of the immediately preceding neural network in the hierarchical stack), wherein

the prediction mixing system comprises a sub-prediction mixer that includes a set of mixing weights that are learned during a training of the second artificial neural network system (Commons: Col 36, lines 40-43; The actions and tasks in each successive neural network are a combination, reordering and transforming the tasks of the immediately preceding neural network in the hierarchical stack Col 37, lines 20-26; In the case of unsupervised training the neural network continues to learn, adapt, and alter its actions throughout the course of its operation. It can respond to new patterns not presented during the initial training and assignment of weights. This capacity allows a network to learn from new external stimuli in a manner similar to how learning takes place in the real world. Col 37, lines 32-35; This type of training constitutes a transfer of learning from one neural network to another; the new neural network does not have to be independently trained, thereby saving time and resources. Col 38, lines 1-5; The stage/order at which a stacked neural network begins and ends and the number of neural networks in a hierarchical stack depend on the nature of the problem to be solved. Moreover, each neural network in a hierarchical stack may use different architectures, algorithms, and training methods).  

	As such, Commons teaches the amended limitations as discussed above. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-6 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter specifically software per se.
As to Claim 1, the claim limitation recites a “computing system” comprising a first, second, and third “artificial neural network system” and a “prediction mixing system”. These are described as being software system in the applicant’s specification on [0040]. The applicant specification on [0040] indicates, “FIG. 1 schematically depicts example enhanced tagging computing system 100. System 100 uses named entity recognition system 102 and includes first artificial neural network system 112, second artificial neural network system 118, third artificial neural network system 126, and prediction mixing system 130. Each of these systems 102, 112, 118, 126, and 130 can be implemented on one or more computer systems”. That is, these system of software are implemented on one or more computers. There is no explicit information in the specification that identifies the “artificial neural network” systems and “prediction mixing system” as hardware anywhere in the applicant’s specification.
Software products alone are not patent eligible because they do not fall within any of the four statutory categories of patentable subject matter. Therefore, when the broadest reasonable interpretation of a claim covers a software per se, the claim must be rejected under 35 U.S.C 101 as covering non-statutory subject matter. 
Claims 2-6 are rejected by virtue of their dependency and for failing to cure the deficiencies of independent claim 1.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 5-7, 11-15, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over U.S Patent Application Publication 2019/0236148 issued to Michael DeFelice (hereinafter as "DeFelice") in view of U.S Patent 9,053,431 issued to Michael Lamport Commons (hereinafter as “Commons”).

	Regarding claim 1, DeFelice teaches a computing system comprising: a first artificial neural network system capable of generating a first set of enhanced tag predictions for a query segment of a natural language query based on a set of baseline tag predictions generated for the query segment by a named entity recognition system (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text....A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199. In addition, the decoder RNN 1184 takes two additional inputs), wherein the first artificial neural network system is trained using a first training data set comprising a set of baseline tag predictions (DeFelice: [0061]; Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199); 

a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment based on a set of one or more pretrained word vectors obtained for the query segment (DeFelice: [0061]; the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text...inform the probabilities of each particular assertion ascertainable within the text....A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch). [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199), wherein the second artificial neural network system is trained using a second training data set comprising a set of one or more pretrained word embeddings from the set of pretrained word embeddings for a respective sample query segment (DeFelice: [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch));

	Although, DeFelice teaches multiple neural networks that retrieve inputs and provides their respective functions according to the neural networks (See DeFelice: [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199). DeFelice does not explicitly teach a third artificial neural network system capable of generating a third set of enhanced tag predictions for the query segment based on both: (a) a set of pre-context pretrained word vectors obtained for words preceding the query segment in the natural language query and  (b) a set of post-context pretrained word vectors for words following the query segment in the natural language query, wherein the third artificial neural network is trained using a set of pre- context pretrained word embeddings and a set of post-context pretrained word embeddings for a respective sample query segment; and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment, wherein the prediction mixing system comprises a sub-prediction mixer that includes a set of mixing weights that are learned during a training of the second artificial neural network system.

	However, Commons teaches a third artificial neural network system capable of generating a third set of enhanced tag predictions for the query segment based on both: (a) a set of pre-context pretrained word vectors obtained for words preceding the query segment in the natural language query (Commons: Col 22, lines 17-23; The pattern recognizers may be statistically based, rule based, or the like, and extract the “object” having an unrecognized pattern from the input space of the ANN system. Advantageously, the unrecognized pattern may be presented to a knowledge base as a query, which will then return either an “identification” of the object, or information related to the object. Gruhn: Col 23, lines 4-7; The neural network at each level preferably includes logic for formulating an external search of an appropriate database or databases in dependence on the type of information and/or context, and for receiving and interpreting the response. Col 30, lines 12-17; The third neural network organizes the characters in the text of the message into meaningful strings of characters, such as words, phrases, sentences, paragraphs, etc., and either provides an output or stores an indicia representing the meaningful strings of characters. Col 38, lines 4-5; The stage/order at which a stacked neural network begins and ends and the number of neural networks in a hierarchical stack depend on the nature of the problem to be solved. Moreover, each neural network in a hierarchical stack may use different architectures, algorithms, and training methods) and

(b) a set of post-context pretrained word vectors for words following the query segment in the natural language query (Commons: Col 23, lines 4-7; The neural network at each level preferably includes logic for formulating an external search of an appropriate database or databases in dependence on the type of information and/or context, and for receiving and interpreting the response. Col 30, lines 12-17; The third neural network organizes the characters in the text of the message into meaningful strings of characters, such as words, phrases, sentences, paragraphs, etc., and either provides an output or stores an indicia representing the meaningful strings of characters. Col 41, lines 18-23; Neural network...is trained by inputting patterns of words and sentences that it needs to identify. When neural network...associates a pattern with a word or a sentence, the network outputs to neural network...the pattern's classification as a word or a sentence, as well as the position in the text as a whole of the word or the sentence), wherein

the third artificial neural network is trained using a set of pre- context pretrained word embeddings and a set of post-context pretrained word embeddings for a respective sample query segment (Commons: Col 22, lines 42-45; In a more general sense, this technique permits a vast and dynamic knowledge base to be integrated into the neural network scheme, and thus avoid a need for retraining of the neural network as the environment changes. Col 23, lines 4-7; The neural network at each level preferably includes logic for formulating an external search of an appropriate database or databases in dependence on the type of information and/or context, and for receiving and interpreting the response. Col 30, lines 12-17; The third neural network organizes the characters in the text of the message into meaningful strings of characters, such as words, phrases, sentences, paragraphs, etc., and either provides an output or stores an indicia representing the meaningful strings of characters. Col 41, lines 15-23; neural network 114 analyzes patterns output by neural network 112 and determines logical stopping places for strings of text, such as spaces, punctuation marks, or ends of lines. Neural network...is trained by inputting patterns of words and sentences that it needs to identify. When neural network...associates a pattern with a word or a sentence, the network outputs to neural network...the pattern's classification as a word or a sentence, as well as the position in the text as a whole of the word or the sentence); and 

a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment (Commons: Col 35, 28-34; Referring to FIG. 1, a hierarchical stacked neural network 10 of the present invention comprises a plurality of up to O architecturally distinct, ordered neural networks 20, 22, 24, 26, etc., of which only four (Nm, Nm+1, Nm+2, Nm(O−1)) are shown. The number of neural networks in hierarchical stacked neural network 10 is the number of consecutive stages/orders needed to complete the task assigned. Col 36, lines 20-26; The output from neural network 24 is input into neural network 26, which processes the output from neural network 24 with stage/order 5 actions. The output from neural network 26 is input into neural network 28, which processes the output from neural network 26 with stage/order 6 actions. Neural network 28 is the highest neural network in the hierarchical stack and produces output 62. Col 36, lines 40-43; The actions and tasks in each successive neural network are a combination, reordering and transforming the tasks of the immediately preceding neural network in the hierarchical stack), wherein

 the prediction mixing system comprises a sub-prediction mixer that includes a set of mixing weights that are learned during a training of the second artificial neural network system (Commons: Col 37, lines 20-26; In the case of unsupervised training the neural network continues to learn, adapt, and alter its actions throughout the course of its operation. It can respond to new patterns not presented during the initial training and assignment of weights. This capacity allows a network to learn from new external stimuli in a manner similar to how learning takes place in the real world. Col 37, lines 32-35; This type of training constitutes a transfer of learning from one neural network to another; the new neural network does not have to be independently trained, thereby saving time and resources. Col 38, lines 1-5; The stage/order at which a stacked neural network begins and ends and the number of neural networks in a hierarchical stack depend on the nature of the problem to be solved. Moreover, each neural network in a hierarchical stack may use different architectures, algorithms, and training methods).  

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify DeFelice (teaches a first artificial neural network system of generating a first set of enhanced tag predictions for query segment of a natural language query based on a set of baseline tag predictions, a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment) with the teachings of Commons (teaches a third artificial neural network system of generating a third set of enhanced tag predictions for the query segment based on both a set of pre-context pretrained word vectors and post-context pretrained vector for words following the query segment in the natural language query and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment). One of ordinary skill in the art would have been motivated to make such a combination of providing better results in improving the training of the ranking network through feedback without the need to retrain data (See: Commons: Col 9, lines 51-56). In addition, the references (DeFelice and Commons) teach features that are directed to analogous art and they are directed to the same field of endeavor as DeFelice and Commons are directed to neural networks being utilized to deliver results according to recognitions.

Regarding claim 5, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, and DeFelice further teaches the query segment comprises a word (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213....named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input); and wherein 

the second artificial neural network system is capable of generating the second set of enhanced tag predictions for the query segment based on a pretrained word vector obtained for the word (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text. For example, if the prospect refers to “the University,” this is ambiguous without greater context. However, grouping information according to geography would indicate geographic proximity to the University of Missouri. A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch). [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199).  

	Regarding claim 6, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, and DeFelice further teaches the query segment comprises a plurality of words (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input); and wherein

 the second artificial neural network system is capable of generating the second set of enhanced tag predictions for the query segment based on a respective pretrained word vector obtained for each word of the plurality of words (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; 
The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text. For example, if the prospect refers to “the University,” this is ambiguous without greater context. However, grouping information according to geography would indicate geographic proximity to the University of Missouri. A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch). [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199). 

	Regarding claim 7, DeFelice teaches a method comprising: a first artificial neural network system generating a first set of enhanced tag predictions for a query segment of a natural language query based on a set of baseline tag predictions generated for the query segment by a named entity recognition system (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text....A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199. In addition, the decoder RNN 1184 takes two additional inputs), wherein the first artificial neural network system is trained using a first training data set comprising a set of baseline tag predictions (DeFelice: [0061]; Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199); Inventor(s): Zhang, Xiaohai et al.Examiner: Ho, Andrew N. 

Application No.: 16/455,389- 3/9- Art Unit: 2162a second artificial neural network system generating a second set of enhanced tag predictions for the query segment based on a set of one or more pretrained word vectors obtained for the query segment (DeFelice: [0061]; the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text...inform the probabilities of each particular assertion ascertainable within the text....A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch). [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199), wherein the second artificial neural network system is trained using a second training data set comprising a set of one or more pretrained word embeddings from the set of pretrained word embeddings for a respective sample query segment (DeFelice: [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch));

Although, DeFelice teaches multiple neural networks that retrieve inputs and provides their respective functions according to the neural networks (See DeFelice: [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199). DeFelice does not explicitly teach a third artificial neural network system generating a third set of enhanced tag predictions for the query segment based on both: (a) a set of pre-context pretrained word vectors obtained for words preceding the query segment in the natural language query and (b) a set of post-context pretrained word vectors for words following the query segment in the natural language query, wherein the third artificial neural network is trained using a set of pre- context pretrained word embeddings and a set of post-context pretrained word embeddings for a respective sample query segment; and a prediction mixing system mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment, wherein the prediction mixing system comprises a sub-prediction mixer that includes a set of mixing weights that are learned during a training of the second artificial neural network system.  

	However, Commons teaches a third artificial neural network system generating a third set of enhanced tag predictions for the query segment based on both: (a) a set of pre-context pretrained word vectors obtained for words preceding the query segment in the natural language query (Commons: Col 22, lines 17-23; The pattern recognizers may be statistically based, rule based, or the like, and extract the “object” having an unrecognized pattern from the input space of the ANN system. Advantageously, the unrecognized pattern may be presented to a knowledge base as a query, which will then return either an “identification” of the object, or information related to the object. Gruhn: Col 23, lines 4-7; The neural network at each level preferably includes logic for formulating an external search of an appropriate database or databases in dependence on the type of information and/or context, and for receiving and interpreting the response. Col 30, lines 12-17; The third neural network organizes the characters in the text of the message into meaningful strings of characters, such as words, phrases, sentences, paragraphs, etc., and either provides an output or stores an indicia representing the meaningful strings of characters. Col 38, lines 4-5; The stage/order at which a stacked neural network begins and ends and the number of neural networks in a hierarchical stack depend on the nature of the problem to be solved. Moreover, each neural network in a hierarchical stack may use different architectures, algorithms, and training methods) and (b) a set of post-context pretrained word vectors for words following the query segment in the natural language query (Commons: Col 23, lines 4-7; The neural network at each level preferably includes logic for formulating an external search of an appropriate database or databases in dependence on the type of information and/or context, and for receiving and interpreting the response. Col 30, lines 12-17; The third neural network organizes the characters in the text of the message into meaningful strings of characters, such as words, phrases, sentences, paragraphs, etc., and either provides an output or stores an indicia representing the meaningful strings of characters. Col 41, lines 18-23; Neural network...is trained by inputting patterns of words and sentences that it needs to identify. When neural network...associates a pattern with a word or a sentence, the network outputs to neural network...the pattern's classification as a word or a sentence, as well as the position in the text as a whole of the word or the sentence), wherein
the third artificial neural network is trained using a set of pre- context pretrained word embeddings and a set of post-context pretrained word embeddings for a respective sample query segment (Commons: Col 22, lines 42-45; In a more general sense, this technique permits a vast and dynamic knowledge base to be integrated into the neural network scheme, and thus avoid a need for retraining of the neural network as the environment changes. Col 23, lines 4-7; The neural network at each level preferably includes logic for formulating an external search of an appropriate database or databases in dependence on the type of information and/or context, and for receiving and interpreting the response. Col 30, lines 12-17; The third neural network organizes the characters in the text of the message into meaningful strings of characters, such as words, phrases, sentences, paragraphs, etc., and either provides an output or stores an indicia representing the meaningful strings of characters. Col 41, lines 15-23; neural network 114 analyzes patterns output by neural network 112 and determines logical stopping places for strings of text, such as spaces, punctuation marks, or ends of lines. Neural network...is trained by inputting patterns of words and sentences that it needs to identify. When neural network...associates a pattern with a word or a sentence, the network outputs to neural network...the pattern's classification as a word or a sentence, as well as the position in the text as a whole of the word or the sentence); and 

a prediction mixing system mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment (Commons: Col 35, 28-34; Referring to FIG. 1, a hierarchical stacked neural network 10 of the present invention comprises a plurality of up to O architecturally distinct, ordered neural networks 20, 22, 24, 26, etc., of which only four (Nm, Nm+1, Nm+2, Nm(O−1)) are shown. The number of neural networks in hierarchical stacked neural network 10 is the number of consecutive stages/orders needed to complete the task assigned. Col 36, lines 20-26; The output from neural network 24 is input into neural network 26, which processes the output from neural network 24 with stage/order 5 actions. The output from neural network 26 is input into neural network 28, which processes the output from neural network 26 with stage/order 6 actions. Neural network 28 is the highest neural network in the hierarchical stack and produces output 62. Col 36, lines 40-43; The actions and tasks in each successive neural network are a combination, reordering and transforming the tasks of the immediately preceding neural network in the hierarchical stack), wherein 

the prediction mixing system comprises a sub-prediction mixer that includes a set of mixing weights that are learned during a training of the second artificial neural network system (Commons: Col 37, lines 20-26; In the case of unsupervised training the neural network continues to learn, adapt, and alter its actions throughout the course of its operation. It can respond to new patterns not presented during the initial training and assignment of weights. This capacity allows a network to learn from new external stimuli in a manner similar to how learning takes place in the real world. Col 37, lines 32-35; This type of training constitutes a transfer of learning from one neural network to another; the new neural network does not have to be independently trained, thereby saving time and resources. Col 38, lines 1-5; The stage/order at which a stacked neural network begins and ends and the number of neural networks in a hierarchical stack depend on the nature of the problem to be solved. Moreover, each neural network in a hierarchical stack may use different architectures, algorithms, and training methods).  

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify DeFelice (teaches a first artificial neural network system of generating a first set of enhanced tag predictions for query segment of a natural language query based on a set of baseline tag predictions, a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment) with the teachings of Commons (teaches a third artificial neural network system of generating a third set of enhanced tag predictions for the query segment based on both a set of pre-context pretrained word vectors and post-context pretrained vector for words following the query segment in the natural language query and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment). One of ordinary skill in the art would have been motivated to make such a combination of providing better results in improving the training of the ranking network through feedback without the need to retrain data (See: Commons Col 9, lines 51-56). In addition, the references (DeFelice and Commons) teach features that are directed to analogous art and they are directed to the same field of endeavor as DeFelice and Gruhn are directed to neural networks being utilized to deliver results according to recognitions.

	Regarding claim 11, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, and DeFelice further teaches
the query segment comprises a word (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213....named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input); and wherein

 the method further comprises the second artificial neural network system generating the second set of enhanced tag predictions for the query segment based on a pretrained word vector obtained for the word (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text. For example, if the prospect refers to “the University,” this is ambiguous without greater context. However, grouping information according to geography would indicate geographic proximity to the University of Missouri. A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch). [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199).  

	Regarding claim 12, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, and DeFelice further teaches
the query segment comprises a plurality of words (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input); and wherein the method further comprises the second artificial neural network system generating the second set of enhanced tag predictions for the query segment based on a respective pretrained word vector obtained for each word of the plurality of words (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text. For example, if the prospect refers to “the University,” this is ambiguous without greater context. However, grouping information according to geography would indicate geographic proximity to the University of Missouri. A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch). [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199).  

	Regarding claim 13, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, and Commons further teaches further comprising jointly training the first artificial neural network, the second artificial neural network, and the third artificial neural network based on a set of training examples (Commons: Col 35, 28-34; Referring to FIG. 1, a hierarchical stacked neural network 10 of the present invention comprises a plurality of up to O architecturally distinct, ordered neural networks 20, 22, 24, 26, etc., of which only four (Nm, Nm+1, Nm+2, Nm(O−1)) are shown. The number of neural networks in hierarchical stacked neural network 10 is the number of consecutive stages/orders needed to complete the task assigned. Col 36, lines 40-43; The actions and tasks in each successive neural network are a combination, reordering and transforming the tasks of the immediately preceding neural network in the hierarchical stack. Col 36, lines 20-26; The output from neural network 24 is input into neural network 26, which processes the output from neural network 24 with stage/order 5 actions. The output from neural network 26 is input into neural network 28, which processes the output from neural network 26 with stage/order 6 actions. Neural network 28 is the highest neural network in the hierarchical stack and produces output 62).  

	Regarding claim 14, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, and DeFelice further teaches a particular enhanced tag prediction of the output set of enhanced tag predictions corresponds to a particular enhanced tag that is a temporal refinement of a particular baseline tag (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text. For example, if the prospect refers to “the University,” this is ambiguous without greater context. However, grouping information according to geography would indicate geographic proximity to the University of Missouri. A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch). [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199. In addition, the decoder RNN 1184 takes two additional inputs. First, attention network 1186 takes as input the internal state of the encoder network (outputs 0 . . . n of each layer of the encoder network, corresponding to each output 1109 as shown in FIG. 11b ) and itself provides an input to decoder RNN 1184).  

	Regarding claim 15, DeFelice teaches a first artificial neural network system generating a first set of enhanced tag predictions for a query segment of a natural language query based on a set of baseline tag predictions generated for the query segment by a named entity recognition system (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text....A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0090]-[0091]; First, attention network 1186 takes as input the internal state of the encoder network (outputs 0 . . . n of each layer of the encoder network, corresponding to each output 1109 as shown in FIG. 11b ) and itself provides an input to decoder RNN 1184), wherein the first artificial neural network system is trained using a first training data set comprising a set of baseline tag predictions (DeFelice: [0061]; Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199);

 Inventor(s): Zhang, Xiaohai et al.Examiner: Ho, Andrew N. 	Application No.: 16/455,389- 5/9- Art Unit: 2162a second artificial neural network system generating a second set of enhanced tag predictions for the query segment based on a set of one or more pretrained word vectors obtained for the query segment (DeFelice: [0061]; the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text...inform the probabilities of each particular assertion ascertainable within the text....A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch). [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199), wherein

the second artificial neural network system is trained using a second training data set comprising a set of one or more pretrained word embeddings from the set of pretrained word embeddings for a respective sample query segment (DeFelice: [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch)); 

	DeFelice does not explicitly teach one or more non-transitory computer-readable media storing instructions which, when executed by one or more processors, causes the one or more processors to perform: a third artificial neural network system generating a third set of enhanced tag predictions for the query segment based on both: (a) a set of pre-context pretrained word vectors obtained for words preceding the query segment in the natural language query and (b) a set of post-context pretrained word vectors for words following the query segment in the natural language query, wherein the third artificial neural network is trained using a set of pre- context pretrained word embeddings and a set of post-context pretrained word embeddings for a respective sample query segment; and a prediction mixing system mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment, wherein the prediction mixing system comprises a sub-prediction mixer that includes a set of mixing weights that are learned during a training of the second artificial neural network system.

	However, Commons teaches one or more non-transitory computer-readable media storing instructions which (Commons: Col 52, lines 33-35; Such instructions may be read into main memory 406 from another machine-read-able medium, such as storage device 410), when executed by one or more processors, causes the one or more processors to perform (Commons: Col 52, lines 30-33; According to one embodiment of the invention, those techniques are performed by computer system 400 in response to processor 404 executing one or more sequences of one or more instructions contained in main memory 406):

a third artificial neural network system generating a third set of enhanced tag predictions for the query segment based on both: (a) a set of pre-context pretrained word vectors obtained for words preceding the query segment in the natural language query (Commons: Col 22, lines 17-23; The pattern recognizers may be statistically based, rule based, or the like, and extract the “object” having an unrecognized pattern from the input space of the ANN system. Advantageously, the unrecognized pattern may be presented to a knowledge base as a query, which will then return either an “identification” of the object, or information related to the object. Col 22, lines 42-45; In a more general sense, this technique permits a vast and dynamic knowledge base to be integrated into the neural network scheme, and thus avoid a need for retraining of the neural network as the environment changes. Col 23, lines 7-12; In some cases, the object is readily identified, and based on that identification, processed within the same level. For example, in a semantic network, a new word may be encountered. Reference to a knowledge base may produce a synonym, which the neural network can then process. Col 38, lines 4-5; The stage/order at which a stacked neural network begins and ends and the number of neural networks in a hierarchical stack depend on the nature of the problem to be solved. Moreover, each neural network in a hierarchical stack may use different architectures, algorithms, and training methods) and

(b) a set of post-context pretrained word vectors for words following the query segment in the natural language query (Commons: Col 23, lines 4-7; The neural network at each level preferably includes logic for formulating an external search of an appropriate database or databases in dependence on the type of information and/or context, and for receiving and interpreting the response. Col 30, lines 12-17; The third neural network organizes the characters in the text of the message into meaningful strings of characters, such as words, phrases, sentences, paragraphs, etc., and either provides an output or stores an indicia representing the meaningful strings of characters. Col 41, lines 18-23; Neural network...is trained by inputting patterns of words and sentences that it needs to identify. When neural network...associates a pattern with a word or a sentence, the network outputs to neural network...the pattern's classification as a word or a sentence, as well as the position in the text as a whole of the word or the sentence), wherein 

the third artificial neural network is trained using a set of pre- context pretrained word embeddings and a set of post-context pretrained word embeddings for a respective sample query segment (Commons: Col 22, lines 42-45; In a more general sense, this technique permits a vast and dynamic knowledge base to be integrated into the neural network scheme, and thus avoid a need for retraining of the neural network as the environment changes. Col 23, lines 7-12; In some cases, the object is readily identified, and based on that identification, processed within the same level. For example, in a semantic network, a new word may be encountered. Reference to a knowledge base may produce a synonym, which the neural network can then process. Col 38, lines 4-5; Moreover, each neural network in a hierarchical stack may use different architectures, algorithms, and training methods. Col 41, lines 15-23; neural network 114 analyzes patterns output by neural network 112 and determines logical stopping places for strings of text, such as spaces, punctuation marks, or ends of lines. Neural network...is trained by inputting patterns of words and sentences that it needs to identify. When neural network...associates a pattern with a word or a sentence, the network outputs to neural network...the pattern's classification as a word or a sentence, as well as the position in the text as a whole of the word or the sentence); and 

a prediction mixing system mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment (Commons: Col 35, 28-34; Referring to FIG. 1, a hierarchical stacked neural network 10 of the present invention comprises a plurality of up to O architecturally distinct, ordered neural networks 20, 22, 24, 26, etc., of which only four (Nm, Nm+1, Nm+2, Nm(O−1)) are shown. The number of neural networks in hierarchical stacked neural network 10 is the number of consecutive stages/orders needed to complete the task assigned. Col 36, lines 20-26; The output from neural network 24 is input into neural network 26, which processes the output from neural network 24 with stage/order 5 actions. The output from neural network 26 is input into neural network 28, which processes the output from neural network 26 with stage/order 6 actions. Neural network 28 is the highest neural network in the hierarchical stack and produces output 62. Col 36, lines 40-43; The actions and tasks in each successive neural network are a combination, reordering and transforming the tasks of the immediately preceding neural network in the hierarchical stack), wherein

the prediction mixing system comprises a sub-prediction mixer that includes a set of mixing weights that are learned during a training of the second artificial neural network system (Commons: Col 37, lines 20-26; In the case of unsupervised training the neural network continues to learn, adapt, and alter its actions throughout the course of its operation. It can respond to new patterns not presented during the initial training and assignment of weights. This capacity allows a network to learn from new external stimuli in a manner similar to how learning takes place in the real world. Col 37, lines 32-35; This type of training constitutes a transfer of learning from one neural network to another; the new neural network does not have to be independently trained, thereby saving time and resources. Col 38, lines 1-5; The stage/order at which a stacked neural network begins and ends and the number of neural networks in a hierarchical stack depend on the nature of the problem to be solved. Moreover, each neural network in a hierarchical stack may use different architectures, algorithms, and training methods).  

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify DeFelice (teaches a first artificial neural network system of generating a first set of enhanced tag predictions for query segment of a natural language query based on a set of baseline tag predictions, a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment) with the teachings of Commons (teaches a third artificial neural network system of generating a third set of enhanced tag predictions for the query segment based on both a set of pre-context pretrained word vectors and post-context pretrained vector for words following the query segment in the natural language query and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment). One of ordinary skill in the art would have been motivated to make such a combination of providing better results in improving the training of the ranking network through feedback without the need to retrain data (See: Commons Col 9, lines 51-56). In addition, the references (DeFelice and Commons) teach features that are directed to analogous art and they are directed to the same field of endeavor as DeFelice and Gruhn are directed to neural networks being utilized to deliver results according to recognitions.

	Regarding claim 19, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, and DeFelice further teaches
the query segment comprises a word (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213....named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input); and wherein

the instructions, when executed by the one or more processors, causes the one or more processors to perform the second artificial neural network system generating the second set of enhanced tag predictions for the query segment based on a pretrained word vector obtained for the word (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text. For example, if the prospect refers to “the University,” this is ambiguous without greater context. However, grouping information according to geography would indicate geographic proximity to the University of Missouri. A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch). [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199).  

	Regarding claim 20, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, and DeFelice further teaches the query segment comprises a plurality of words (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input); and wherein 

the instructions, when executed by the one or more processors, causes the one or more processors to perform the second artificial neural network system generating the second set of enhanced tag predictions for the query segment based on a respective pretrained word vector obtained for each word of the plurality of words (DeFelice: [0061]; As it is expected that much of the data received from the Internet 250 will be in the form of free text, FIGS. 4a-4d show the operation of the Named Entity Recognition component 320 according to one embodiment. Information in the form of text is received from the Internet 250 and is passed (through intermediaries as necessary) to the named entity recognition component 320 according to arrow 213. In this embodiment, the named entity recognition component receives each sentence or group of sentences and uses a processor to tag the words according to the part of speech (4 a) and identify particular noun phrases (4 b) within the input. [0062]-[0063]; The input of a disambiguation component 330 is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text. In one embodiment, this is done as a form of supervised learning, where known information (or information that has a high-enough likelihood of being correct) is used to inform the probabilities of each particular assertion ascertainable within the text. For example, if the prospect refers to “the University,” this is ambiguous without greater context. However, grouping information according to geography would indicate geographic proximity to the University of Missouri. A tweet (retrieved from the Internet 250) that refers to “AR” may be interpreted better as “Accounts Receivable” instead of “Augmented Reality” when the prospect's background as a small business owner is taken into account. [0088]; The model takes a set of input documents 1101 (in this case, each sentence of source text 201), preprocesses and converts the words into corresponding embeddings 1103 based upon an existing trained embedding model (e.g. word2vec, Glove, Conceptnet Numberbatch). [0090]-[0091] The forward/backward decoder networks 1172 a-e each create a prediction at 1177 a-e and outputs 1179 a-e and the final prediction output 1179 e is a representation of the highest-likelihood next token considering the full context of the sentence based upon the latent representation learned by the encoder 1160. These components take the embeddings 1103 and create an output that is embodied in output text 1199).

Claims 2, 4, 8, 10, 16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over U.S Patent Application Publication 2019/0236148 issued to Michael DeFelice (hereinafter as "DeFelice") in view of U.S Patent 9,053,431 issued to Michael Lamport Commons (hereinafter as “Commons”) in further view of U.S Patent Application Publication 2018/0181592 issued to Chen et al. (hereinafter as “Chen”).

Regarding claim 2, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, however the modification of DeFelice and Commons does not explicitly teach each prediction of the set of baseline tag predictions is a probability value, and  each prediction of the first set of enhanced tag predictions is a probability value.

	Chen teaches each prediction of the set of baseline tag predictions is a probability value, and Inventor(s): Zhang, Xiaohai et al.Examiner: Ho, Andrew N.Application No.: 16/455,389- 2/9- Art Unit: 2162each prediction of the first set of enhanced tag predictions is a probability value (Chen:[0021]-[0022]; For example, a search for “Golden Gate Bridge” could focus on “Bridge” and “Landmark,” whereas a search for “Ocean Sunset” could focus on “Sea,” “Dawn,” and “Sun” for the same image. A keyword modality refers to a language modality which evaluates keywords, such as keyword tags, of an image. Captions of images may describe the images in a different manner than keywords. By using separate modalities, the ranking network can capture these differences in evaluating the images against queries, which can improve the accuracy of image rank rankings. [0065]; network trainer 220 may calculate the loss function for prediction based on the sample used as an input to ranking network 350, which resulted in the multi-modal ranking. Network trainer 220 can back-propagate the calculated loss to each neural network using a stochastic gradient descent method. In implementations where both positive and negative samples are employed, the loss can be determined using binary-classification where the ground truth for a positive sample is set to 1 and for a negative sample is set to 0). 

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify DeFelice (teaches a first artificial neural network system of generating a first set of enhanced tag predictions for query segment of a natural language query based on a set of baseline tag predictions, a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment) with the teachings of Commons (teaches a third artificial neural network system of generating a third set of enhanced tag predictions for the query segment based on both a set of pre-context pretrained word vectors and post-context pretrained vector for words following the query segment in the natural language query and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment) to further include the teachings of Chen (teaches each prediction of the set of baseline tag predictions is a probability value, and Inventor(s): Zhang, Xiaohai et al.Examiner: Ho, Andrew N.Application No.: 16/455,389- 2/9- Art Unit: 2162each prediction of the first set of enhanced tag predictions is a probability value). One of ordinary skill in the art would have been motivated to make such a combination of providing better results in improving the training of the ranking network through feedback without the need to retrain data (See: Chen: [0049]). In addition, the references (DeFelice, Commons, and Chen) teach features that are directed to analogous art and they are directed to the same field of endeavor as DeFelice, Commons, and Chen are directed to neural networks being utilized to deliver results according to recognitions.

Regarding claim 4, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, however the modification of DeFelice and Commons does not explicitly teach mixing the first set of enhanced tag predictions with the second set of enhanced tag predictions based on a set of learned mixing weights to generate an intermediate set of enhanced tag predictions; and mixing the intermediate set of enhanced tag predictions with the third set of enhanced tag predictions to generate the output set of enhanced tag predictions for the query segment.

Chen teaches mixing the first set of enhanced tag predictions with the second set of enhanced tag predictions based on a set of learned mixing weights to generate an intermediate set of enhanced tag predictions (Chen: [0021]-[0022]; For example, a search for “Golden Gate Bridge” could focus on “Bridge” and “Landmark,” whereas a search for “Ocean Sunset” could focus on “Sea,” “Dawn,” and “Sun” for the same image. A keyword modality refers to a language modality which evaluates keywords, such as keyword tags, of an image. Captions of images may describe the images in a different manner than keywords. By using separate modalities, the ranking network can capture these differences in evaluating the images against queries, which can improve the accuracy of image rank rankings. [0063]; Multi-modal network model 322 can reweight the ranking features of each modality model using its corresponding weight score and combine (e.g., average) the reweighted modality ranking features to generate multi-modal ranking features (e.g., embeddings). One example of multi-modal network model 322 will later be described in additional detail with respect to FIG. 4C. [0065]; As indicated above, the neural networks of language network model 314, language network model 316, visual network model 320, and multi-modal network model 322 can be jointly trained end-to-end.In training the neural networks, network trainer 220 may calculate the loss function for prediction based on the sample used as an input to ranking network 350, which resulted in the multi-modal ranking); and

 	mixing the intermediate set of enhanced tag predictions with the third set of enhanced tag predictions to generate the output set of enhanced tag predictions for the query segment (Chen: [0021]-[0022]; For example, a search for “Golden Gate Bridge” could focus on “Bridge” and “Landmark,” whereas a search for “Ocean Sunset” could focus on “Sea,” “Dawn,” and “Sun” for the same image. A keyword modality refers to a language modality which evaluates keywords, such as keyword tags, of an image. Captions of images may describe the images in a different manner than keywords. By using separate modalities, the ranking network can capture these differences in evaluating the images against queries, which can improve the accuracy of image rank rankings. [0063]; Multi-modal network model 322 can reweight the ranking features of each modality model using its corresponding weight score and combine (e.g., average) the reweighted modality ranking features to generate multi-modal ranking features (e.g., embeddings). One example of multi-modal network model 322 will later be described in additional detail with respect to FIG. 4C. [0065]; As indicated above, the neural networks of language network model 314, language network model 316, visual network model 320, and multi-modal network model 322 can be jointly trained end-to-end.In training the neural networks, network trainer 220 may calculate the loss function for prediction based on the sample used as an input to ranking network 350, which resulted in the multi-modal ranking).  
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify DeFelice (teaches a first artificial neural network system of generating a first set of enhanced tag predictions for query segment of a natural language query based on a set of baseline tag predictions, a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment) with the teachings of Commons (teaches a third artificial neural network system of generating a third set of enhanced tag predictions for the query segment based on both a set of pre-context pretrained word vectors and post-context pretrained vector for words following the query segment in the natural language query and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment) to further include the teachings of Chen (teaches mixing the first set of enhanced tag predictions with the second set of enhanced tag predictions based on a set of learned mixing weights to generate an intermediate set of enhanced tag predictions and mixing the intermediate set of enhanced tag predictions with the third set of enhanced tag predictions to generate the output set of enhanced tag predictions for the query segment). One of ordinary skill in the art would have been motivated to make such a combination of providing better results in improving the training of the ranking network through feedback without the need to retrain data (See: Chen: [0049]). In addition, the references (DeFelice, Commons, and Chen) teach features that are directed to analogous art and they are directed to the same field of endeavor as DeFelice, Commons, and Chen are directed to neural networks being utilized to deliver results according to recognitions.

	Regarding claim 8, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, however the modification of DeFelice and Commons does not explicitly teach each prediction of the set of baseline tag predictions is a probability value, and each prediction of the first set of enhanced tag predictions is a probability value.

	Chen teaches each prediction of the set of baseline tag predictions is a probability value, and each prediction of the first set of enhanced tag predictions is a probability value (Chen: [0021]-[0022]; For example, a search for “Golden Gate Bridge” could focus on “Bridge” and “Landmark,” whereas a search for “Ocean Sunset” could focus on “Sea,” “Dawn,” and “Sun” for the same image. A keyword modality refers to a language modality which evaluates keywords, such as keyword tags, of an image. Captions of images may describe the images in a different manner than keywords. By using separate modalities, the ranking network can capture these differences in evaluating the images against queries, which can improve the accuracy of image rank rankings. [0065]; network trainer 220 may calculate the loss function for prediction based on the sample used as an input to ranking network 350, which resulted in the multi-modal ranking. Network trainer 220 can back-propagate the calculated loss to each neural network using a stochastic gradient descent method. In implementations where both positive and negative samples are employed, the loss can be determined using binary-classification where the ground truth for a positive sample is set to 1 and for a negative sample is set to 0).  

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify DeFelice (teaches a first artificial neural network system of generating a first set of enhanced tag predictions for query segment of a natural language query based on a set of baseline tag predictions, a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment) with the teachings of Commons (teaches a third artificial neural network system of generating a third set of enhanced tag predictions for the query segment based on both a set of pre-context pretrained word vectors and post-context pretrained vector for words following the query segment in the natural language query and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment) to further include the teachings of Chen (teaches mixing the first set of enhanced tag predictions with the second set of enhanced tag predictions based on a set of learned mixing weights to generate an intermediate set of enhanced tag predictions and mixing the intermediate set of enhanced tag predictions with the third set of enhanced tag predictions to generate the output set of enhanced tag predictions for the query segment). One of ordinary skill in the art would have been motivated to make such a combination of providing better results in improving the training of the ranking network through feedback without the need to retrain data (See: Chen: [0049]). In addition, the references (DeFelice, Commons, and Chen) teach features that are directed to analogous art and they are directed to the same field of endeavor as DeFelice, Commons, and Chen are directed to neural networks being utilized to deliver results according to recognitions.

Regarding claim 10, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, however the modification of DeFelice and Commons does not explicitly teach the prediction mixing system mixing the first set of enhanced tag predictions with the second set of enhanced tag predictions based on a set of learned mixing weights to generate an intermediate set of enhanced tag predictions; and  Application No.: 16/455,389- 4/9- Art Unit: 2162the prediction mixing system mixing the intermediate set of enhanced tag predictions with the third set of enhanced tag predictions to generate the output set of enhanced tag predictions for the query segment.

	Chen teaches the prediction mixing system mixing the first set of enhanced tag predictions with the second set of enhanced tag predictions based on a set of learned mixing weights to generate an intermediate set of enhanced tag predictions (Chen: [0021]-[0022]; For example, a search for “Golden Gate Bridge” could focus on “Bridge” and “Landmark,” whereas a search for “Ocean Sunset” could focus on “Sea,” “Dawn,” and “Sun” for the same image. A keyword modality refers to a language modality which evaluates keywords, such as keyword tags, of an image. Captions of images may describe the images in a different manner than keywords. By using separate modalities, the ranking network can capture these differences in evaluating the images against queries, which can improve the accuracy of image rank rankings. [0063]; Multi-modal network model 322 can reweight the ranking features of each modality model using its corresponding weight score and combine (e.g., average) the reweighted modality ranking features to generate multi-modal ranking features (e.g., embeddings). One example of multi-modal network model 322 will later be described in additional detail with respect to FIG. 4C. [0065]; As indicated above, the neural networks of language network model 314, language network model 316, visual network model 320, and multi-modal network model 322 can be jointly trained end-to-end.In training the neural networks, network trainer 220 may calculate the loss function for prediction based on the sample used as an input to ranking network 350, which resulted in the multi-modal ranking); and 

Inventor(s): Zhang, Xiaohai et al.Examiner: Ho, Andrew N. Application No.: 16/455,389- 4/9- Art Unit: 2162the prediction mixing system mixing the intermediate set of enhanced tag predictions with the third set of enhanced tag predictions to generate the output set of enhanced tag predictions for the query segment (Chen: [0021]-[0022]; For example, a search for “Golden Gate Bridge” could focus on “Bridge” and “Landmark,” whereas a search for “Ocean Sunset” could focus on “Sea,” “Dawn,” and “Sun” for the same image. A keyword modality refers to a language modality which evaluates keywords, such as keyword tags, of an image. Captions of images may describe the images in a different manner than keywords. By using separate modalities, the ranking network can capture these differences in evaluating the images against queries, which can improve the accuracy of image rank rankings. [0063]; Multi-modal network model 322 can reweight the ranking features of each modality model using its corresponding weight score and combine (e.g., average) the reweighted modality ranking features to generate multi-modal ranking features (e.g., embeddings). One example of multi-modal network model 322 will later be described in additional detail with respect to FIG. 4C. [0065]; As indicated above, the neural networks of language network model 314, language network model 316, visual network model 320, and multi-modal network model 322 can be jointly trained end-to-end.In training the neural networks, network trainer 220 may calculate the loss function for prediction based on the sample used as an input to ranking network 350, which resulted in the multi-modal ranking).  

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify DeFelice (teaches a first artificial neural network system of generating a first set of enhanced tag predictions for query segment of a natural language query based on a set of baseline tag predictions, a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment) with the teachings of Commons (teaches a third artificial neural network system of generating a third set of enhanced tag predictions for the query segment based on both a set of pre-context pretrained word vectors and post-context pretrained vector for words following the query segment in the natural language query and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment) to further include the teachings of Chen (teaches mixing the first set of enhanced tag predictions with the second set of enhanced tag predictions based on a set of learned mixing weights to generate an intermediate set of enhanced tag predictions and mixing the intermediate set of enhanced tag predictions with the third set of enhanced tag predictions to generate the output set of enhanced tag predictions for the query segment). One of ordinary skill in the art would have been motivated to make such a combination of providing better results in improving the training of the ranking network through feedback without the need to retrain data (See: Chen: [0049]). In addition, the references (DeFelice, Commons, and Chen) teach features that are directed to analogous art and they are directed to the same field of endeavor as DeFelice, Commons, and Chen are directed to neural networks being utilized to deliver results according to recognitions.

	Regarding claim 16, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, however the modification of DeFelice and Commons does not explicitly teach each prediction of the set of baseline tag predictions is a probability value, and each prediction of the first set of enhanced tag predictions is a probability value.

	Chen teaches each prediction of the set of baseline tag predictions is a probability value, and each prediction of the first set of enhanced tag predictions is a probability value (Chen:[0021]-[0022]; For example, a search for “Golden Gate Bridge” could focus on “Bridge” and “Landmark,” whereas a search for “Ocean Sunset” could focus on “Sea,” “Dawn,” and “Sun” for the same image. A keyword modality refers to a language modality which evaluates keywords, such as keyword tags, of an image. Captions of images may describe the images in a different manner than keywords. By using separate modalities, the ranking network can capture these differences in evaluating the images against queries, which can improve the accuracy of image rank rankings. [0065]; network trainer 220 may calculate the loss function for prediction based on the sample used as an input to ranking network 350, which resulted in the multi-modal ranking. Network trainer 220 can back-propagate the calculated loss to each neural network using a stochastic gradient descent method. In implementations where both positive and negative samples are employed, the loss can be determined using binary-classification where the ground truth for a positive sample is set to 1 and for a negative sample is set to 0).  

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify DeFelice (teaches a first artificial neural network system of generating a first set of enhanced tag predictions for query segment of a natural language query based on a set of baseline tag predictions, a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment) with the teachings of Commons (teaches a third artificial neural network system of generating a third set of enhanced tag predictions for the query segment based on both a set of pre-context pretrained word vectors and post-context pretrained vector for words following the query segment in the natural language query and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment) to further include the teachings of Chen (teaches mixing the first set of enhanced tag predictions with the second set of enhanced tag predictions based on a set of learned mixing weights to generate an intermediate set of enhanced tag predictions and mixing the intermediate set of enhanced tag predictions with the third set of enhanced tag predictions to generate the output set of enhanced tag predictions for the query segment). One of ordinary skill in the art would have been motivated to make such a combination of providing better results in improving the training of the ranking network through feedback without the need to retrain data (See: Chen: [0049]). In addition, the references (DeFelice, Commons, and Chen) teach features that are directed to analogous art and they are directed to the same field of endeavor as DeFelice, Commons, and Chen are directed to neural networks being utilized to deliver results according to recognitions.

	Regarding claim 18, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, however the modification of DeFelice and Commons does not explicitly teach Application No.: 16/455,389- 6/9- Art Unit: 2162the prediction mixing system mixing the first set of enhanced tag predictions with the second set of enhanced tag predictions based on a set of learned mixing weights to generate an intermediate set of enhanced tag predictions; and the prediction mixing system mixing the intermediate set of enhanced tag predictions with the third set of enhanced tag predictions to generate the output set of enhanced tag predictions for the query segment.

	Chen teaches the the prediction mixing system mixing the first set of enhanced tag predictions with the second set of enhanced tag predictions based on a set of learned mixing weights to generate an intermediate set of enhanced tag predictions (Chen: [0063]; Multi-modal network model 322 can reweight the ranking features of each modality model using its corresponding weight score and combine (e.g., average) the reweighted modality ranking features to generate multi-modal ranking features (e.g., embeddings). One example of multi-modal network model 322 will later be described in additional detail with respect to FIG. 4C. [0065]; As indicated above, the neural networks of language network model 314, language network model 316, visual network model 320, and multi-modal network model 322 can be jointly trained end-to-end.In training the neural networks, network trainer 220 may calculate the loss function for prediction based on the sample used as an input to ranking network 350, which resulted in the multi-modal ranking. Network trainer 220 can back-propagate the calculated loss to each neural network using a stochastic gradient descent method. In implementations where both positive and negative samples are employed, the loss can be determined using binary-classification where the ground truth for a positive sample is set to 1 and for a negative sample is set to 0); and

the prediction mixing system mixing the intermediate set of enhanced tag predictions with the third set of enhanced tag predictions to generate the output set of enhanced tag predictions for the query segment (Chen: [0021]-[0022]; For example, a search for “Golden Gate Bridge” could focus on “Bridge” and “Landmark,” whereas a search for “Ocean Sunset” could focus on “Sea,” “Dawn,” and “Sun” for the same image. A keyword modality refers to a language modality which evaluates keywords, such as keyword tags, of an image. Captions of images may describe the images in a different manner than keywords. By using separate modalities, the ranking network can capture these differences in evaluating the images against queries, which can improve the accuracy of image rank rankings. [0063]; Multi-modal network model 322 can reweight the ranking features of each modality model using its corresponding weight score and combine (e.g., average) the reweighted modality ranking features to generate multi-modal ranking features (e.g., embeddings). One example of multi-modal network model 322 will later be described in additional detail with respect to FIG. 4C. [0065]; As indicated above, the neural networks of language network model 314, language network model 316, visual network model 320, and multi-modal network model 322 can be jointly trained end-to-end.In training the neural networks, network trainer 220 may calculate the loss function for prediction based on the sample used as an input to ranking network 350, which resulted in the multi-modal ranking).  

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify DeFelice (teaches a first artificial neural network system of generating a first set of enhanced tag predictions for query segment of a natural language query based on a set of baseline tag predictions, a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment) with the teachings of Commons (teaches a third artificial neural network system of generating a third set of enhanced tag predictions for the query segment based on both a set of pre-context pretrained word vectors and post-context pretrained vector for words following the query segment in the natural language query and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment) to further include the teachings of Chen (teaches mixing the first set of enhanced tag predictions with the second set of enhanced tag predictions based on a set of learned mixing weights to generate an intermediate set of enhanced tag predictions and mixing the intermediate set of enhanced tag predictions with the third set of enhanced tag predictions to generate the output set of enhanced tag predictions for the query segment). One of ordinary skill in the art would have been motivated to make such a combination of providing better results in improving the training of the ranking network through feedback without the need to retrain data (See: Chen: [0049]). In addition, the references (DeFelice, Commons, and Chen) teach features that are directed to analogous art and they are directed to the same field of endeavor as DeFelice, Commons, and Chen are directed to neural networks being utilized to deliver results according to recognitions.

Claims 3, 9, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over U.S Patent Application Publication 2019/0236148 issued to Michael DeFelice (hereinafter as "DeFelice") in view of U.S Patent 9,053,431 issued to Michael Lamport Commons (hereinafter as “Commons”) in further view of U.S Patent Application Publication 2019/0130251 issued to Lao et al. (hereinafter as “Lao”). 
 
	Regarding claim 3, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, however the modification of DeFelice and Commons does not explicitly teach each prediction of the set of baseline tag predictions is a logit value, and each prediction of the first set of enhanced tag predictions is a logit value,

	Lao teaches each prediction of the set of baseline tag predictions is a logit value, and each prediction of the first set of enhanced tag predictions is a logit value (Lao: [0070]; At step 740, the neural question answering system generates a logit for each possible output in the vocabulary of possible outputs. The neural question answering system may generate the logit for the possible outputs using the calculated similarity measure between the initial output vector and the respective encoded representations for possible outputs in the vocabulary of possible outputs. [0072]; For example, the system can select the valid output having the highest logit or set the logits for invalid outputs to negative infinity, apply a softmax the logits for the possible outputs to generate a respective probability for each possible output).  

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify DeFelice (teaches a first artificial neural network system of generating a first set of enhanced tag predictions for query segment of a natural language query based on a set of baseline tag predictions, a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment) with the teachings of Commons (teaches a third artificial neural network system of generating a third set of enhanced tag predictions for the query segment based on both a set of pre-context pretrained word vectors and post-context pretrained vector for words following the query segment in the natural language query and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment) to further include the teachings of Lao (teaches each prediction of the set of baseline tag predictions is a logit value, and each prediction of the first set of enhanced tag predictions is a logit value). One of ordinary skill in the art would have been motivated to make such a combination of providing better results in addressing the correct answer according to the given input for better function leading to an efficient neural system (See: Lao: [0032]). In addition, the references (DeFelice, Commons, and Lao) teach features that are directed to analogous art and they are directed to the same field of endeavor as DeFelice, Commons, and Lao are directed to neural networks being utilized to deliver results according to recognitions.

	Regarding claim 9, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, however the modification of DeFelice and Commons does not explicitly teach each prediction of the set of baseline tag predictions is a logit value, and each prediction of the first set of enhanced tag predictions is a logit value.

	Lao teaches each prediction of the set of baseline tag predictions is a logit value, and each prediction of the first set of enhanced tag predictions is a logit value (Lao: [0070]; At step 740, the neural question answering system generates a logit for each possible output in the vocabulary of possible outputs. The neural question answering system may generate the logit for the possible outputs using the calculated similarity measure between the initial output vector and the respective encoded representations for possible outputs in the vocabulary of possible outputs. [0072]; For example, the system can select the valid output having the highest logit or set the logits for invalid outputs to negative infinity, apply a softmax the logits for the possible outputs to generate a respective probability for each possible output).  

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify DeFelice (teaches a first artificial neural network system of generating a first set of enhanced tag predictions for query segment of a natural language query based on a set of baseline tag predictions, a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment) with the teachings of Commons (teaches a third artificial neural network system of generating a third set of enhanced tag predictions for the query segment based on both a set of pre-context pretrained word vectors and post-context pretrained vector for words following the query segment in the natural language query and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment) to further include the teachings of Lao (teaches each prediction of the set of baseline tag predictions is a logit value, and each prediction of the first set of enhanced tag predictions is a logit value). One of ordinary skill in the art would have been motivated to make such a combination of providing better results in addressing the correct answer according to the given input for better function leading to an efficient neural system (See: Lao: [0032]). In addition, the references (DeFelice, Commons, and Lao) teach features that are directed to analogous art and they are directed to the same field of endeavor as DeFelice, Commons, and Lao are directed to neural networks being utilized to deliver results according to recognitions.

	Regarding claim 17, the modification of DeFelice and Commons teaches claimed invention substantially as claimed, however the modification of DeFelice and Commons does not explicitly teach each prediction of the set of baseline tag predictions is a logit value, and each prediction of the first set of enhanced tag predictions is a logit value.

	Lao teaches each prediction of the set of baseline tag predictions is a logit value, and each prediction of the first set of enhanced tag predictions is a logit value (Lao: [0070]; At step 740, the neural question answering system generates a logit for each possible output in the vocabulary of possible outputs. The neural question answering system may generate the logit for the possible outputs using the calculated similarity measure between the initial output vector and the respective encoded representations for possible outputs in the vocabulary of possible outputs. [0072]; For example, the system can select the valid output having the highest logit or set the logits for invalid outputs to negative infinity, apply a softmax the logits for the possible outputs to generate a respective probability for each possible output).  

It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify DeFelice (teaches a first artificial neural network system of generating a first set of enhanced tag predictions for query segment of a natural language query based on a set of baseline tag predictions, a second artificial neural network system capable of generating a second set of enhanced tag predictions for the query segment) with the teachings of Commons (teaches a third artificial neural network system of generating a third set of enhanced tag predictions for the query segment based on both a set of pre-context pretrained word vectors and post-context pretrained vector for words following the query segment in the natural language query and a prediction mixing system capable of mixing the first set of enhanced tag predictions, the second set of enhanced tag predictions, and the third set of enhanced tag predictions to generate an output set of enhanced tag predictions for the query segment) to further include the teachings of Lao (teaches each prediction of the set of baseline tag predictions is a logit value, and each prediction of the first set of enhanced tag predictions is a logit value). One of ordinary skill in the art would have been motivated to make such a combination of providing better results in addressing the correct answer according to the given input for better function leading to an efficient neural system (See: Lao: [0032]). In addition, the references (DeFelice, Commons, and Lao) teach features that are directed to analogous art and they are directed to the same field of endeavor as DeFelice, Commons, and Lao are directed to neural networks being utilized to deliver results according to recognitions.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
U.S Patent Application Publication 2018/0018585 issued to Marin et al. (hereinafter as “Marin”) teaches a evaluation platform receiving data set and predicting results of trends and recognizing patterns and evaluating options according to the specific criteria.
U.S Patent Application Publication 2018/0300400 issued to Romain Paulus teaches inputting token embedding of a document through an encoder to produces hidden states and applies decoder hidden states to produce the vector. 

					Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANDREW N HO whose telephone number is (571)270-0590. The examiner can normally be reached M-F 10:30 -7.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Vital can be reached on (571)272-4215. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
6/21/2022
/ANDREW N HO/Examiner
Art Unit 2162    


/PIERRE M VITAL/Supervisory Patent Examiner, Art Unit 2162