Detailed Action
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim 1-2, 4-8, 10-14, 16-18 are pending.
Claim 3, 9, 15 are cancelled.

Claim Rejections - 35 USC § 112
	Amended claims are received 06/16/2022. Amended claims are acceptable.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1, 4, 7, 10, 13, and 16 is/are rejected under 35 U.S.C. 103 over Jaech (US 20180349477 A1) in view of Huang (Huang, 2016, “Instance-aware Image and Sentence Matching with Selective Multimodal LSTM”), in view of Phan (Phan, 2016, “Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks”), and further in view of Athavale (Athavale et al, 2016, “Towards Deep Learning in Hindi NER: An approach to tackle the Labelled Data Scarcity”).

Regarding claim 1, Jaech teaches a processor implemented method, comprising:  
obtaining by a Bidirectional Long-Short Term Memory (BiLSTM)-Siamese network based classifier, via one or more hardware processors ([Jaech, 0139, the first sentence] ”In particular embodiments, computer system 1200 includes a processor 1202, memory 1204, storage 1206, an input/output (I/O) interface 1208, a communication interface 810, and a bus 812” teaches the hardware processor), one or more user queries, wherein the one or more user queries comprises of a sequence of words, wherein the BiLSTM-Siamese network based classifier comprises a Siamese model and a classification model, and wherein each of the Siamese model and the classification model comprise a common base network that includes an embedding layer, a single BiLSTM layer and a Time Distributed Dense (TDD) Layer ([Jaech, 0006, line 3-7] “The social-networking system may receive a search query comprising a plurality of query terms from a client system. The social-networking system may generate a query match matrix for the search query”, teaches the query and query comprises of a sequence of words. [Jaech, 0090, line 16-30] “Two Recurrent Neural Networks, specifically bi-directional LSTMs (bi-LSTMs) [11, 16] may then encode the query (respectively document) word-embedding sequence into a sequence of LSTM states … ” discloses the bi-LSTM structure which composes the BiLSTM-Siamese network classifier. [Jaech, entire 0094] “The secondary neural network may begin with the match-tensor and may apply a convolutional layer … Finally, the model may apply 2-D max-pooling to coalesce the peaks from the ReLU into a single fixed size vector. This may be fed into a fully-connected layer and through a sigmoid to produce a single probability of relevance on the output of the model”, discloses the secondary neural network. [Jaech, Figure 6] shows the base network comprising the bi-LSTM, embedding layer, and a Time Distributed Layer (Figure 6, 635a Linear Projection) is shared by both branches of the Siamese model and connected to both classification network and Siamese network. The process from the beginning to the 640 corresponds to the Siamese model, and the network after the 640 corresponds to the classification model. Jaech reference still teaches the each of the Siamese model and the classification model comprises the a common base network, as they are all connected); 
iteratively performing: 
representing in the embedding layer of the common base network, the one or more user queries as a sequence of vector representation of each word learnt using a word to vector model ([Jaech, 0077, line 14-last line - following page line 11] “The social-networking system 160 may perform an iterative process for a number of iterations. The number of iterations may be greater or equal to the number of the pairs. The social-networking system 160 may, as a first step of the iterative process, select a pair of a search query and an object in order from the prepared set. The social-networking system 160 may, as a second step of the iterative process, construct a three-dimensional tensor by taking an element-wise product of the query match-matrix for the selected search query and the object match-matrix for the selected object. The social-networking system 160 may, as a third step of the iterative process, compute a relevance score based on the tensor for the selected pair. The social-networking system 160 may, as a fourth step of the iterative process, compare the computed relevance score with the known desired relevance score. The social-networking system 160 may, as a fifth step of the iterative process, adjust the non-zero value based on the comparison. The social-networking system 160 may repeat the iterative processes until the difference between the computed relevance score and the known desired relevance score is within a predetermined value for all the prepare pairs” shows the iteratively performing representing in the embedding layer, and [Jaech, 0090] “To begin, a word-embedding lookup layer may convert query and document terms into separate sequences of word-embeddings. The word-embedding table may be itself computed offline from a large corpus of social media documents using the word2vec package [30] in an unsupervised manner and may be held fixed during the training of the Match-Tensor network. In particular embodiments, word-embeddings may be 256-dimensional vectors of floating point numbers. The word embeddings may be then passed through a linear projection layer to a reduced l-dimensional space (e.g., l=40); the same linear projection matrix may be applied to both the query and the document word vectors” shows that embedding lookup layer performs word2vec operation), wherein the sequence of words is replaced by corresponding vectors initialized using the word to vector model, wherein the corresponding vectors are continually updated during training of the BiLSTM-Siamese network based classifier ([Jaech, 0090] “To begin, a word-embedding lookup layer may convert query and document terms into separate sequences of word-embeddings. The word-embedding table may be itself computed offline from a large corpus of social media documents using the word2vec package [30] in an unsupervised manner and may be held fixed during the training of the Match-Tensor network. In particular embodiments, word-embeddings may be 256-dimensional vectors of floating point numbers. The word embeddings may be then passed through a linear projection layer to a reduced l-dimensional space (e.g., l=40); the same linear projection matrix may be applied to both the query and the document word vectors. This linear projection may allow the size of the embeddings to be varied and tuned as a hyperparameter without relearning the embeddings from scratch each time. Two Recurrent Neural Networks, specifically bi-directional LSTMs (bi-LSTMs) [11, 16] may then encode the query (respectively document) word-embedding sequence into a sequence of LSTM states. The bi-LSTM states may capture separate representations in vector form of the query and the document, respectively, that may reflect their sequential structure, looking beyond the granularity of a word to phrases of arbitrary size. During hyperparameter tuning, the models may use a linear projection layer inside the bi-LSTM recurrent connection, as defined in Sak et al. [38]. In particular embodiments, a separate linear projection after the bi-LSTM to establish the same number k of dimensions in the representation of query and document (e.g., k=50) may be applied. Thus, at the end, each token in the query and the document may be represented as a k-dimensional vector”), 
inputting, to the single BiLSTM layer of the common base network wherein the vector representation of each word is inputted in at least one of a forward order and a reverse order ([Jaech, 0084; Figure 6] “At step 625, the social-networking system 160 may perform linear projection on the query term-embeddings 602a to transform the query term-embeddings 602a into a reduced query term-embeddings 603a. At step 630a, the social-networking system 160 may encode the reduced term-embeddings 603a with a bi-LSTM network to produce a query match-matrix 604a. At step 635a, the social-networking system 160 may adjust the size of the query match-matrix 604a by performing a linear projection on the query match-matrix 604a and produce an adjusted query match-matrix 605a”, shows vector representation of each word (query term-embeddings) goes into the bi-LSTM layer, 
and [Jaech, 0006, left column line 27 of paragraph 0006  – right column line 4] “In particular embodiments, the social-networking system may use a bi-directional Long Short-Term Memory (LSTM) network as the neural network for encoding the generated term-embeddings. A bi-LSTM may comprise a series of states connected in forward and backward directions. Each state of the bi-LSTM may take a term embedding for a respective term in the search query as an input and may produce an encoded term embedding as an output by processing input term embedding and signals from both neighboring states. The output encoded term embedding may represent the contextual meaning of the corresponding term in the search query”, teaches the biLSTM layer with forward and reverse direction); 
processing through the Time Distributed Dense (TDD) Layer of the common base network, an output obtained from the BiLSTM layer to obtain a sequence of vector ([Jaech, 0084, line 13-20; Fig 6] “At step 630a, the social-networking system 160 may encode the reduced term-embeddings 603a with a bi-LSTM network to produce a query match-matrix 604a. At step 635a, the social-networking system 160 may adjust the size of the query match-matrix 604a by performing a linear projection on the query match-matrix 604a and produce an adjusted query match-matrix 605a”, the 604a is the result of the process 630a, which is the Bi-LSTM. The process 635a receives the 604a, which is the output from bi-LSTM 630a, and produce an adjusted query matrix (i.e. sequence of vector));
obtaining, using a maxpool layer of the classification model, dimension-wise maximum value of the sequence of vector to form a final vector ([Jaech, entire paragraph of 0082] “In particular embodiments, the social-networking system 160 may construct a vector of a predetermined size by performing a max-pooling procedure on the second three-dimensional matrix. The social-networking system 160 may prepare memory space for the vector. The size of the vector may be equal to the number of the convolution layers on the second three-dimensional matrix. In particular embodiments, the social-networking system 160 may choose, as a first step of the max-pooling procedure, for each convolution layer of the third three-dimensional matrix, a maximum value. In particular embodiments, the social-networking system 160 may fill, as a second step of the max-pooling procedure, a corresponding element of the vector with the chosen value. As an example and not by way of limitation, the social-networking system 160 may have a 20-by-80-20 second convolution matrix. The social-networking system 160 may prepare a memory space for a vector of size 20. The social-networking system 160 may choose a maximum value from each convolution layer on the second convolution matrix and fill the value to the corresponding element of the vector. Although this disclosure describes generating a vector using a max-pooling procedure in a particular manner, this disclosure contemplates generating a vector using a max-pooling procedure in any suitable manner”, 
[Jaech, 0084] “At step 670, the social-networking system 160 may create a vector 610 by performing max-pooling on the second convolution matrix 609. At step 675, the social-networking system 160 may produce a relevance score 611 by performing sigmoid activation on the vector 610”, the max-pooling layer is embedded in the classification model, [Jaech, 0084, the last sentence] “At step 670, the social-networking system 160 may create a vector 610 by performing max-pooling on the second convolution matrix 609”); 
Jaech does not specifically disclose inputting in the LSTM layer the sequence of word vector representation of each word to generate 't' hidden states at every timestep, and determining at least one target class by a softmax layer of the classification model, at least one target class of the one or more queries based on the final vector and outputting a response to the one or more data based on the determined target class, and wherein the word to vector model is used to initialize weights of the embedding layer which takes the one or more user queries as a sequence of 1-hot encoded word vectors and outputs encoded sequence of the corresponding vectors.
Huang teaches inputting in the LSTM layer the sequence of word vector representation of each word to generate 't' hidden states at every timestep ([Huang, page 3, middle of the left column, 3.1 Instance Candidate Extraction, the first paragraph] “For a sentence, its underlying instances mostly exist in word-level or phrase level, e.g., “dog” and “man”. So we simply tokenlize and split the sentence into words, and then obtain their representations by sequentially processing them with a bidirectional LSTM (BLSTM) [30], where two sequences of hidden states with different directions (forward and backward) are learnt”, [Huang, page 2, left column, 2nd paragraph] “During multiple timesteps, the sm-LSTM exploits hidden states to capture different local similarities of selected pairwise image-sentence instances”, teaches hidden states at every timestep).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Huang and Jaech, to use the process of generating hidden states at every timestep of Huang to implement the BiLSTM-Siamese network based classifier of Jaech. The suggestion and/or motivation for doing so is to process time-series data (collection of observations obtained through repeated measurements over time) efficiently. 
Jaech in view of Huang does not specifically teach determining at least one target class by a softmax layer of the classification model, at least one target class of the one or more queries based on the final vector and outputting a response to the one or more data based on the determined target class, and wherein the word to vector model is used to initialize weights of the embedding layer which takes the one or more user queries as a sequence of 1-hot encoded word vectors and outputs encoded sequence of the corresponding vectors.
Phan teaches 37determining at least one target class by a softmax layer of the classification model, at least one target class of the one or more data based on the final vector and outputting a response to the one or more data based on the determined target class ([Phan, Figure 1; page 2, right column, 2.2.3 Softmax layer] “As a result, with the feature map induced by convolving one of the filters on an input signal, we only select the most prominent feature. The prominent features produced by all filters are finally concatenated and presented to the final softmax layer for classification” and [Phan, page 2, right column, 2.2.3 Softmax layer, the first sentence] “The fixed-size feature vector after the pooling layer is subsequently presented to the standard softmax layer to compute the predicted probability over the class labels” teaches determining the target class (class labels) by a softmax layer based on the final vector (the fixed-size feature vector after the pooling layer) ).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Phan, Huang and Jaech, to use the process of determining the target class using softmax layer of Phan to implement the BiLSTM-Siamese network based classifier of Huang and Jaech. The suggestion and/or motivation for doing so is to perform multi-class classification. Softmax layer returns the probabilities of each classes, which makes multi-class classification possible.
Jaech in view of Huang, and further in view of Phan does not specifically disclose wherein the word to vector model is used to initialize weights of the embedding layer which takes the one or more user queries as a sequence of 1-hot encoded word vectors and outputs encoded sequence of the corresponding vectors.
Athavale teaches wherein the word to vector model is used to initialize weights of the embedding layer which takes the one or more user queries as a sequence of 1-hot encoded word vectors and outputs encoded sequence of the corresponding vectors ([Athavale, page 3, left column, third paragraph, line 1-4; Figure 1] “In the second stage, as illustrated in Figure 1, we use the deep-learning based models. We initialize their embedding layers with the wordvectors for every word”, [Athavale, page 4, right column, line 1-4] “For the embedding layer, it is initialized with the concatenation of the wordvector and the one-hot vector indicating its POS Tag”, Figure 1 shows the One-hot POS vector inputs to the Embedding layers. [Athavale, page 3, right column, entire paragraph 3.1 Generating Word Embeddings for Hindi – page 4, left column, entire first paragraph] “Word2Vec based approaches use the idea that words which occur in similar context are similar … However, for Hindi language we train using above mentioned methods(Word2Vec and GloVe) and generate word vectors. We start with One hot encoding for the words and random initializations for their wordvectors and then train them to finally arrive at the word vectors. We use the Hindi text from LTRC IIIT Hyderabad Corpus for training. The data is 385 MB in size and the encoding used is the UTF-8 format (The unsupervised training corpus contains 27 million tokens and 500,000 distinct tokens). The training Hindi word embeddings were trained using a window of context size of 5. The trained model is then used to generate the embeddings for the words in the vocabulary …”); 
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Athavale, Phan, Huang and Jaech, to use the process of initializing embedding layers that receives 1-hot encoded word of Athavale to implement the BiLSTM-Siamese network based classifier of Phan, Huang and Jaech. The suggestion and/or motivation for doing so is to avoid unpredictable output, because uninitialized variables or layers can lead to unpredictable output if used in operations.

Regarding claim 7, Jaech in view of Huang, in view of Phan, and further in view of Athavale teaches a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions ([Jaech, 0139, the first sentence] ”In particular embodiments, computer system 1200 includes a processor 1202, memory 1204, storage 1206, an input/output (I/O) interface 1208, a communication interface 810, and a bus 812” teaches the processor). Claim 7 is a system claim having similar limitation to claim 1 above. Therefore, it is an abstract idea under the same rational as of claim 1 above.

Regarding claim 13, Jaech in view of Huang, in view of Phan, and further in view of Athavale teaches one or more non-transitory machine readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors ([Jaech, 0139, the first sentence] ”In particular embodiments, computer system 1200 includes a processor 1202, memory 1204, storage 1206, an input/output (I/O) interface 1208, a communication interface 810, and a bus 812” teaches the processor). Claim 13 is a non-transitory machine readable information storage medium claim having similar limitation to claim 1 above. Therefore, it is an abstract idea under the same rational as of claim 1 above.

Regarding claim 4, Jaech in view of Huang, in view of Phan, and further in view of Athavale teaches the processor implemented method of claim 1, further comprising determining, during training of the BiLSTM-Siamese network based classifier, one or more errors pertaining to a set of queries, wherein the one or more errors comprise one or more target classes being determined for the set of queries ([Jaech, 0062] “The Match-Tensor architecture may help address this problem of mismatch between query intent and retrieved results … The social-networking system 160 may produce a 4-by-m-by-k Match-Tensor for the query and the article by taking an element-wise product of the query match-matrix and the article match-matrix ... The social-networking system may determine that the article has low relevance to the given query in this example based on the exact-match channel. After adding the exact-match channel to the tensor, the size of the Match-Tensor may become 4-by-m-by-k+1. The social-networking system 160 may compute a relevance score reflecting a degree of relevance of the article to the query by processing the Match-Tensor with a downstream neural network. The produced relevance score may be low even though the query and the article have a number of common words”, discloses the process of using Match-Tensor architecture to determine wrong classification for the set of queries); and38 iteratively training, the Siamese model outputting responses for one or more subsequent queries, wherein one or more weights of the common base network are shared with the Siamese model and the Classification model during the training of the BiLSTM-Siamese network based classifier ([Jaech, 0077, line 5-14 of the page 7] “The social-networking system 160 may, as a fifth step of the iterative process, adjust the non-zero value based on the comparison. The social-networking system 160 may repeat the iterative processes until the difference between the computed relevance score and the known desired relevance score is within a predetermined value for all the prepare pairs. Although this disclosure describes a particular example backpropagation process, this disclosure contemplates any backpropagation process for training a neural network”, teaches the iterative training process, [Jaech, 0089-0090, line 1-19] “Input to the Match-Tensor Layer: To begin, a word-embedding lookup layer may convert query and document terms into separate sequences of word-embeddings … In particular embodiments, word-embeddings may be 256-dimensional vectors of floating point numbers. The word embeddings may be then passed through a linear projection layer to a reduced l-dimensional space (e.g., l=40); the same linear projection matrix may be applied to both the query and the document word vectors … Two Recurrent Neural Networks, specifically bi-directional LSTMs (bi-LSTMs) [11, 16] may then encode the query (respectively document) word-embedding sequence into a sequence of LSTM states”, the linear projection matrix is the shared matrix of the Match-Tensor corresponds to the shared matrix of the Siamese model. [Jaech, 0015; Fig 6] “FIG. 6 illustrates an example process of computing a relevance score of an object for a query with the Match-Tensor model”, 635a and 635b of the Figure 6 are the Linear Projection Matrix).
Jaech does not specifically teach generating a set of misclassified query-query pairs based on the one or more errors; and using the set of misclassified query-query pairs along with one or more correct pairs for determining a target class.
Huang teaches generating a set of misclassified data pairs based on the one or more errors ([Huang, page 5, 3.5 Model Learning, line 1-11] “The proposed sm-LSTM can be trained with a structured objective function that encourages the matching scores of matched images and sentences to be larger than those of mismatched ones … We empirically set the total number of mismatched pairs for each matched pair as 100 in our experiments”); 
and using the set of misclassified data pairs along with one or more correct pairs for determining a target class ([Huang, page 5, 3.5 Model Learning, line 1-9] “The proposed sm-LSTM can be trained with a structured objective function that encourages the matching scores of matched images and sentences to be larger than those of mismatched ones … where m is a tuning parameter, and sii is the score of matched i-th image and i-th sentence. s_ik is the score of mismatched i-th image and k-th sentence, and vice-versa with s_ki”, Huang uses both matched pairs and mismatched pairs to calculate matching score. [Huang, page 1, Abstract, the last two sentences] “By similarly measuring multiple local similarities within a few timesteps, the sm-LSTM sequentially aggregates them with hidden states to obtain a final matching score as the desired global similarity. Extensive experiments show that our model can well match image and sentence with complex content, and achieve the state-of-the-art results on two public benchmark datasets”, shows Huang matches the image and sentence by using matching score, which corresponds to the process of determining a target class).
Claim 10 is a system claim having similar limitation to claim 4 above. Therefore, it is an abstract idea under the same rational as of claim 4 above.
Claim 16 is a non-transitory machine readable information storage medium claim having similar limitation to claim 4 above. Therefore, it is an abstract idea under the same rational as of claim 4 above.

Claim 2, 8, and 14 is/are rejected under 35 U.S.C. 103 over Jaech (US 20180349477 A1) in view of Huang (Huang, 2016, “Instance-aware Image and Sentence Matching with Selective Multimodal LSTM”), in view of Phan (Phan, 2016, “Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks”), in view of Athavale (Athavale et al, 2016, “Towards Deep Learning in Hindi NER: An approach to tackle the Labelled Data Scarcity”), and further in view of Hold-Geoffroy (US 20180359416 A1).

Regarding claim 2, Jaech in view of Huang, in view of Phan, and further in view of Athavale teaches processor implemented method of claim 1, wherein a Loss Function is applied to the sequence of vector to optimize the classification model ([Jaech, 0120] “To evaluate the sensitivity of the model performance to the amount of training data, for each of the NN architectures we sub-sampled the training set, retrained models (keeping the hyperparameters fixed), and computed the test-loss. FIG. 10 shows the test loss of each model as a function of its final accuracy. Each considered architecture benefits from the availability of large training sets, and the accuracy improves substantially as the size of the training set increases”).
Jaech in view of Huang, in view of Phan, and further in view of Athavale does not specifically teach calculating loss using Square root Kullback- Leibler divergence (KLD).
Hold-Geoffroy teaches calculating loss using Square root Kullback- Leibler divergence (KLD) ([Hold-Geoffroy, 0070; Figure 5] “As mentioned, the fully connected layer 504 of the CNN splits into two heads 506a and 506b. The first head 506a registers a first output 508 (e.g., vector) describing the sun position made up of 160 elements representing a probability distribution on the discretized sky hemisphere, and the second head 506b registers a second output 510 (e.g., vector) made up of five elements describing three sky parameters and two camera parameters. As described above, the Kullback-Leibler divergence is used as the loss for the first head 506a while a Euclidean norm (also called custom-character.sup.2) is used for the second head 506b”).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Hold-Geoffroy, Athavale, Phan, Huang and Jaech, to use the process of calculating loss using Kullback-Leibler divergence of Hold-Geoffroy to implement the BiLSTM-Siamese network based classifier of Jaech, Huang, Phan, and Athavale. The suggestion and/or motivation for doing so is to test the model performance and improve the accuracy of the classification model.
Claim 8 is a system claim having similar limitation to claim 2 above. Therefore, it is an abstract idea under the same rational as of claim 2 above.
Claim 14 is a non-transitory machine readable information storage medium claim having similar limitation to claim 2 above. Therefore, it is an abstract idea under the same rational as of claim 2 above.

Claim 5-6, 11-12, and 17-18 is/are rejected under 35 U.S.C. 103 over Jaech (US 20180349477 A1) in view of Huang (Huang, 2016, “Instance-aware Image and Sentence Matching with Selective Multimodal LSTM”), in view of Phan (Phan, 2016, “Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks”), in view of Athavale (Athavale et al, 2016, “Towards Deep Learning in Hindi NER: An approach to tackle the Labelled Data Scarcity”), and further in view of Tsatsin (US 20170357896 A1).

Regarding claim 5, Jaech in view of Huang, in view of Phan, and further in view of Athavale teaches processor implemented method of claim 4, further comprising: obtaining, using the one more shared weights, a plurality query embeddings by passing the one or more queries through the Siamese model ([Jaech, 0089-0090, line 1-19] “Input to the Match-Tensor Layer: To begin, a word-embedding lookup layer may convert query and document terms into separate sequences of word-embeddings … In particular embodiments, word-embeddings may be 256-dimensional vectors of floating point numbers. The word embeddings may be then passed through a linear projection layer to a reduced l-dimensional space (e.g., l=40); the same linear projection matrix may be applied to both the query and the document word vectors … Two Recurrent Neural Networks, specifically bi-directional LSTMs (bi-LSTMs) [11, 16] may then encode the query (respectively document) word-embedding sequence into a sequence of LSTM states”, the linear projection matrix is the shared matrix of the Match-Tensor corresponds to the shared matrix of the Siamese model. [Jaech, 0015; Fig 6] “FIG. 6 illustrates an example process of computing a relevance score of an object for a query with the Match-Tensor model”, 635a and 635b of the Figure 6 are the Linear Projection Matrix); 
Jaech in view of Huang, in view of Phan, and further in view of Athavale does not specifically mentions applying a contrastive divergence loss on the plurality of data to optimize the Siamese model and updating one or more parameters of the BiLSTM-Siamese network based classifier.
Tsatsin teaches applying a contrastive divergence loss on the plurality of data to optimize the Siamese model ([Tsatsin, 0041, the last two sentences] “A Siamese network can compute an embedding vector for each of its input images and then computes a measure of similarity (or dissimilarity) between, for example, two embedding vectors. This similarity (or dissimilarity) can then be used to form a loss function. The loss function can be used to train a neural network to compute similar embedding vectors for similar images and dissimilar embedding vectors for dissimilar images. In other words, the loss function can be used to further train the neural network to be able to distinguish between similar pairs of data and pairs of data that are not similar”); and updating one or more parameters of the BiLSTM-Siamese network based classifier ([Tsatsin, 0099, the last two sentences] “This way, the weights of the neural network Net can be continually adjusted based on the loss L. In other words, the back propagation repeatedly adjusts parameters of the neural network until a sum of differences calculated from (i) a distance between the vector y.sup.t and the vector y and (ii) distances between the vector y.sup.t and each of the vectors y.sub.i.sup.u satisfies a predetermined criteria”, Tsatsin shows the weights adjusted based on the loss function).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Tsatsin, Athavale, Phan, Huang and Jaech, to use the process of computing divergence loss to optimize Siamese model and adjusting the weights of the neural network of Tsatsin to implement the BiLSTM-Siamese network based classifier of Athavale, Phan, Huang and Jaech. The suggestion and/or motivation for doing so is to improve the performance of the BiLSTM-Siamese network based classifier, as computing divergence provides the performance score of the classifier which can be used to improve the performance of the classifier.
Claim 11 is a system claim having similar limitation to claim 5 above. Therefore, it is an abstract idea under the same rational as of claim 5 above.
Claim 17 is a non-transitory machine readable information storage medium claim having similar limitation to claim 5 above. Therefore, it is an abstract idea under the same rational as of claim 5 above.

Regarding claim 6, Jaech in view of Huang, in view of Phan, and further in view of Athavale teaches the processor implemented method of claim 5, wherein the step of applying a contrastive divergence loss comprises: calculating, Euclidean distance between the plurality of query embeddings ([Jaech, 0054] “As another example and not by way of limitation, a similarity metric of               
                
                    →
                    
                        
                            
                                v
                            
                            
                                1
                            
                        
                    
                
            
          and              
                
                    →
                    
                        
                            
                                v
                            
                            
                                2
                            
                        
                    
                
            
          may be a Euclidean distance             
                
                    
                        
                            →
                            
                                
                                    
                                        v
                                    
                                    
                                        1
                                    
                                
                            
                        
                        -
                        
                            →
                            
                                
                                    
                                        v
                                    
                                    
                                        2
                                    
                                
                            
                        
                         
                    
                
            
         . A similarity metric of two vectors may represent how similar the two objects or n-grams corresponding to the two vectors, respectively, are to one another, as measured by the distance between the two vectors in the vector space 300. As an example and not by way of limitation, vector 310 and vector 320 may correspond to objects that are more similar to one another than the objects corresponding to vector 310 and vector 330, based on the distance between the respective vectors. Although this disclosure describes calculating a similarity metric between vectors in a particular manner, this disclosure contemplates calculating a similarity metric between vectors in any suitable manner”, Jaech discloses calculating Euclidean distance between two vectors v1 and v2, and v1 and v2 are the embedded queries); 
Jaech in view of Huang, in view of Phan, and further in view of Athavale does not specifically mentions computing ‘divergence’ using Euclidean distance.
Tsatsin teaches computing the contrastive divergence loss based on the calculated Euclidean distance ([Tsatsin, 0041, the last two sentences] “A Siamese network can compute an embedding vector for each of its input images and then computes a measure of similarity (or dissimilarity) between, for example, two embedding vectors. This similarity (or dissimilarity) can then be used to form a loss function. The loss function can be used to train a neural network to compute similar embedding vectors for similar images and dissimilar embedding vectors for dissimilar images. In other words, the loss function can be used to further train the neural network to be able to distinguish between similar pairs of data and pairs of data that are not similar”, teaches the loss function is used to optimize the Siamese model, and [Tsatsin, 0056, the last sentence] “In the context of a metric space it means that it is possible to define a metric (or distance) between those objects, which allows the set of all such objects to be treated as a metric space. Vector spaces allow the use of a variety of standard measures of distance (divergence) including the Euclidean distance”, teaches the divergence computation is based on the Euclidean distance).
Claim 12 is a system claim having similar limitation to claim 6 above. Therefore, it is an abstract idea under the same rational as of claim 6 above.
Claim 18 is a non-transitory machine readable information storage medium claim having similar limitation to claim 6 above. Therefore, it is an abstract idea under the same rational as of claim 6 above.

Response to Argument
Applicant’s arguments filed 06/16/2022 have been fully considered but they are not persuasive.
Regarding claim 1, 7, 13, the applicant respectfully argues that the combination of the cited reference Jaech, Huang, and Phan failed to disclose or suggest ‘wherein each of the Siamese model and the classification model comprise a common base network that includes an embedding layer’, and ‘wherein the sequence of words is replaced by corresponding vectors initialized using the word to vector model, wherein the corresponding vectors are continually updated during training of the BiLSTM-Siamese network based classifier, and wherein the word to vector model is used to initialize weights of the embedding layer which takes the one or more user queries as a sequence of 1-hot encoded word vectors and outputs encoded sequence of the corresponding vectors’. 
The examiner respectfully disagrees. Jaech teaches each of the Siamese model and the classification model comprise a common base network that includes an embedding layer, because the Siamese model and the classification in the Jaech are connected and it is reasonable to interpret that both model comprises the common base network including embedding layer. Jaech teaches wherein the sequence of words is replaced by corresponding vectors initialized using the word to vector model, wherein the corresponding vectors are continually updated during training of the BiLSTM-Siamese network based classifier [Jaech, 0090] “To begin, a word-embedding lookup layer may convert query and document terms into separate sequences of word-embeddings. The word-embedding table may be itself computed offline from a large corpus of social media documents using the word2vec package [30] in an unsupervised manner and may be held fixed during the training of the Match-Tensor network. In particular embodiments, word-embeddings may be 256-dimensional vectors of floating point numbers. The word embeddings may be then passed through a linear projection layer to a reduced l-dimensional space (e.g., l=40); the same linear projection matrix may be applied to both the query and the document word vectors. This linear projection may allow the size of the embeddings to be varied and tuned as a hyperparameter without relearning the embeddings from scratch each time. Two Recurrent Neural Networks, specifically bi-directional LSTMs (bi-LSTMs) [11, 16] may then encode the query (respectively document) word-embedding sequence into a sequence of LSTM states. The bi-LSTM states may capture separate representations in vector form of the query and the document, respectively, that may reflect their sequential structure, looking beyond the granularity of a word to phrases of arbitrary size. During hyperparameter tuning, the models may use a linear projection layer inside the bi-LSTM recurrent connection, as defined in Sak et al. [38]. In particular embodiments, a separate linear projection after the bi-LSTM to establish the same number k of dimensions in the representation of query and document (e.g., k=50) may be applied. Thus, at the end, each token in the query and the document may be represented as a k-dimensional vector”
Applicant’s arguments with 35 U.S.C. 103 prior arts respect to ‘wherein the word to vector model is used to initialize weights of the embedding layer which takes the one or more user queries as a sequence of 1-hot encoded word vectors and outputs encoded sequence of the corresponding vectors’ of claim(s) 1, 7, and 13 have been considered but are moot because the new ground of rejection does not rely on reference applied in the prior rejection of record. The new reference Athavale is used to reject the ‘wherein the word to vector model is used to initialize weights of the embedding layer which takes the one or more user queries as a sequence of 1-hot encoded word vectors and outputs encoded sequence of the corresponding vectors’ of claim 1, 7, and 13  [Athavale, page 3, left column, third paragraph, line 1-4; Figure 1] “In the second stage, as illustrated in Figure 1, we use the deep-learning based models. We initialize their embedding layers with the wordvectors for every word”, [Athavale, page 4, right column, line 1-4] “For the embedding layer, it is initialized with the concatenation of the wordvector and the one-hot vector indicating its POS Tag”, Figure 1 shows the One-hot POS vector inputs to the Embedding layers. [Athavale, page 3, right column, entire paragraph 3.1 Generating Word Embeddings for Hindi] “Word2Vec based approaches use the idea that words which occur in similar context are similar … However, for Hindi language we train using above mentioned methods(Word2Vec and GloVe) and generate word vectors. We start with One hot encoding for the words and random initializations for their wordvectors and then train them to finally arrive at the word vectors. We use the Hindi text from LTRC IIIT Hyderabad Corpus for training. The data is 385 MB in size and the encoding used is the UTF-8 format (The unsupervised training corpus contains 27 million tokens and 500,000 distinct tokens). The training Hindi word embeddings were trained using a window of context size of 5. The trained model is then used to generate the embeddings for the words in the vocabulary …”

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Regarding Siamese model,
Koch, 2015, “Siamese Neural Networks for One-Shot Image Recognition”

THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JUN KWON whose telephone number is (571)272-2072. The examiner can normally be reached on 7:30 AM - 5:30 PM. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ABDULLAH KAWSAR can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/JUN KWON/
Examiner, Art Unit 2127
/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127