Detailed Action

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim 1-2, 5-8, 11-14, and 17-18 are pending.
Claim 3-4, 9-10, and 15-16 are canceled.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 9/20/2022 has been entered.
 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1, 5-7, 11-13, and 17-18 is/are rejected under 35 U.S.C. 103 over Jaech (US 20180349477 A1) in view of Geng (Geng et al, 2016, “Deep Transfer Learning for Person Re-identification”), in view of Neculoiu (Neculoiu et al, 2016, “Learning Text Similarity with Siamese Recurrent Networks”), and further in view of Kottur (Kottur et al, 2016, “Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes”).

Regarding claim 1, Jaech teaches a processor implemented method, comprising:  
obtaining by a Bidirectional Long-Short Term Memory (BiLSTM)-Siamese network based classifier, via one or more hardware processors ([Jaech, 0139, the first sentence] ”In particular embodiments, computer system 1200 includes a processor 1202, memory 1204, storage 1206, an input/output (I/O) interface 1208, a communication interface 810, and a bus 812” teaches the hardware processor), one or more user queries, wherein the one or more user queries comprises of a sequence of words, wherein the BiLSTM-Siamese network based classifier comprises a Siamese model and a classification model, and wherein each of the Siamese model and the classification model comprise a common base network that includes an embedding layer, a single BiLSTM layer and a Time Distributed Dense (TDD) Layer ([Jaech, 0006, line 3-7] “The social-networking system may receive a search query comprising a plurality of query terms from a client system. The social-networking system may generate a query match matrix for the search query”, teaches the query and query comprises of a sequence of words. [Jaech, 0090, line 16-30] “Two Recurrent Neural Networks, specifically bi-directional LSTMs (bi-LSTMs) [11, 16] may then encode the query (respectively document) word-embedding sequence into a sequence of LSTM states … ” discloses the bi-LSTM structure which composes the BiLSTM-Siamese network classifier. [Jaech, entire 0094] “The secondary neural network may begin with the match-tensor and may apply a convolutional layer … Finally, the model may apply 2-D max-pooling to coalesce the peaks from the ReLU into a single fixed size vector. This may be fed into a fully-connected layer and through a sigmoid to produce a single probability of relevance on the output of the model”, discloses the secondary neural network. [Jaech, Figure 6] shows the base network comprising the bi-LSTM, embedding layer, and a Time Distributed Layer (Figure 6, 635a Linear Projection) is shared by both branches of the Siamese model and connected to both classification network and Siamese network. The process from the beginning to the 640 corresponds to the Siamese model, and the network after the 640 corresponds to the classification model. Jaech reference still teaches the each of the Siamese model and the classification model comprises the a common base network, as they are all connected); 
iteratively performing: 
representing in the embedding layer of the common base network, the one or more user queries as a sequence of vector representation of each word learnt using a word to vector model ([Jaech, 0077, line 14-last line - following page line 11] “The social-networking system 160 may perform an iterative process for a number of iterations. The number of iterations may be greater or equal to the number of the pairs. The social-networking system 160 may, as a first step of the iterative process, select a pair of a search query and an object in order from the prepared set. The social-networking system 160 may, as a second step of the iterative process, construct a three-dimensional tensor by taking an element-wise product of the query match-matrix for the selected search query and the object match-matrix for the selected object. The social-networking system 160 may, as a third step of the iterative process, compute a relevance score based on the tensor for the selected pair. The social-networking system 160 may, as a fourth step of the iterative process, compare the computed relevance score with the known desired relevance score. The social-networking system 160 may, as a fifth step of the iterative process, adjust the non-zero value based on the comparison. The social-networking system 160 may repeat the iterative processes until the difference between the computed relevance score and the known desired relevance score is within a predetermined value for all the prepare pairs” shows the iteratively performing representing in the embedding layer, and [Jaech, 0090] “To begin, a word-embedding lookup layer may convert query and document terms into separate sequences of word-embeddings. The word-embedding table may be itself computed offline from a large corpus of social media documents using the word2vec package [30] in an unsupervised manner and may be held fixed during the training of the Match-Tensor network. In particular embodiments, word-embeddings may be 256-dimensional vectors of floating point numbers. The word embeddings may be then passed through a linear projection layer to a reduced l-dimensional space (e.g., l=40); the same linear projection matrix may be applied to both the query and the document word vectors” shows that embedding lookup layer performs word2vec operation), wherein the sequence of words is replaced by corresponding vectors initialized using the word to vector model, wherein the corresponding vectors are continually updated during training of the BiLSTM-Siamese network based classifier ([Jaech, 0090] “To begin, a word-embedding lookup layer may convert query and document terms into separate sequences of word-embeddings. The word-embedding table may be itself computed offline from a large corpus of social media documents using the word2vec package [30] in an unsupervised manner and may be held fixed during the training of the Match-Tensor network. In particular embodiments, word-embeddings may be 256-dimensional vectors of floating point numbers. The word embeddings may be then passed through a linear projection layer to a reduced l-dimensional space (e.g., l=40); the same linear projection matrix may be applied to both the query and the document word vectors. This linear projection may allow the size of the embeddings to be varied and tuned as a hyperparameter without relearning the embeddings from scratch each time. Two Recurrent Neural Networks, specifically bi-directional LSTMs (bi-LSTMs) [11, 16] may then encode the query (respectively document) word-embedding sequence into a sequence of LSTM states. The bi-LSTM states may capture separate representations in vector form of the query and the document, respectively, that may reflect their sequential structure, looking beyond the granularity of a word to phrases of arbitrary size. During hyperparameter tuning, the models may use a linear projection layer inside the bi-LSTM recurrent connection, as defined in Sak et al. [38]. In particular embodiments, a separate linear projection after the bi-LSTM to establish the same number k of dimensions in the representation of query and document (e.g., k=50) may be applied. Thus, at the end, each token in the query and the document may be represented as a k-dimensional vector”), 
inputting, to the single BiLSTM layer of the common base network wherein the vector representation of each word is inputted in at least one of a forward order and a reverse order and wherein the vector representation retains context of other words both on a left hand side and a right hand side as a result at each word in the one or more user queries: ([Jaech, 0084; Figure 6] “At step 625, the social-networking system 160 may perform linear projection on the query term-embeddings 602a to transform the query term-embeddings 602a into a reduced query term-embeddings 603a. At step 630a, the social-networking system 160 may encode the reduced term-embeddings 603a with a bi-LSTM network to produce a query match-matrix 604a. At step 635a, the social-networking system 160 may adjust the size of the query match-matrix 604a by performing a linear projection on the query match-matrix 604a and produce an adjusted query match-matrix 605a”, shows vector representation of each word (query term-embeddings) goes into the bi-LSTM layer, 
and [Jaech, 0006, left column line 27 of paragraph 0006  – right column line 4] “In particular embodiments, the social-networking system may use a bi-directional Long Short-Term Memory (LSTM) network as the neural network for encoding the generated term-embeddings. A bi-LSTM may comprise a series of states connected in forward and backward directions. Each state of the bi-LSTM may take a term embedding for a respective term in the search query as an input and may produce an encoded term embedding as an output by processing input term embedding and signals from both neighboring states. The output encoded term embedding may represent the contextual meaning of the corresponding term in the search query”, teaches the biLSTM layer with forward and reverse direction with the output be contextual meaning of the corresponding term in the search query.); 
determining, during training of the BiLSTM-Siamese network based classifier, one or more errors pertaining to a set of queries, wherein the one or more errors comprise one or more target classes being determined for the one or more user queries ([Jaech, 0062] “The Match-Tensor architecture may help address this problem of mismatch between query intent and retrieved results … The social-networking system 160 may produce a 4-by-m-by-k Match-Tensor for the query and the article by taking an element-wise product of the query match-matrix and the article match-matrix ... The social-networking system may determine that the article has low relevance to the given query in this example based on the exact-match channel. After adding the exact-match channel to the tensor, the size of the Match-Tensor may become 4-by-m-by-k+1. The social-networking system 160 may compute a relevance score reflecting a degree of relevance of the article to the query by processing the Match-Tensor with a downstream neural network. The produced relevance score may be low even though the query and the article have a number of common words”, discloses the process of using Match-Tensor architecture to determine wrong classification for the set of queries); 
iteratively training, the Siamese model, wherein one or more weights of the common base network are shared with the Siamese model and the classification model during the training of the BiLSTM-Siamese network based classifier ([Jaech, 0077, line 5-14 of the page 7] “The social-networking system 160 may, as a fifth step of the iterative process, adjust the non-zero value based on the comparison. The social-networking system 160 may repeat the iterative processes until the difference between the computed relevance score and the known desired relevance score is within a predetermined value for all the prepare pairs ... this disclosure contemplates any backpropagation process for training a neural network”, teaches the iterative training process, 
[Jaech, 0089-0090, line 1-19] “Input to the Match-Tensor Layer: To begin, a word-embedding lookup layer may convert query and document terms into separate sequences of word-embeddings … In particular embodiments, word-embeddings may be 256-dimensional vectors of floating point numbers. The word embeddings may be then passed through a linear projection layer to a reduced l-dimensional space (e.g., l=40); the same linear projection matrix may be applied to both the query and the document word vectors … Two Recurrent Neural Networks, specifically bi-directional LSTMs (bi-LSTMs) [11, 16] may then encode the query (respectively document) word-embedding sequence into a sequence of LSTM states”, the linear projection matrix is the shared matrix of the Match-Tensor corresponds to the shared matrix of the Siamese model. [Jaech, 0015; Fig 6] “FIG. 6 illustrates an example process of computing a relevance score of an object for a query with the Match-Tensor model”, 635a and 635b of the Figure 6 are the Linear Projection Matrix. )
processing through the Time Distributed Dense (TDD) Layer of the common base network, an output obtained from the BiLSTM layer to obtain a sequence of vector ([Jaech, 0084, line 13-20; Fig 6] “At step 630a, the social-networking system 160 may encode the reduced term-embeddings 603a with a bi-LSTM network to produce a query match-matrix 604a. At step 635a, the social-networking system 160 may adjust the size of the query match-matrix 604a by performing a linear projection on the query match-matrix 604a and produce an adjusted query match-matrix 605a”, the 604a is the result of the process 630a, which is the Bi-LSTM. The process 635a receives the 604a, which is the output from bi-LSTM 630a, and produce an adjusted query matrix (i.e. sequence of vector));
obtaining, using a maxpool layer of the classification model, dimension-wise maximum value of the sequence of vector to form a final vector ([Jaech, entire paragraph of 0082] “In particular embodiments, the social-networking system 160 may construct a vector of a predetermined size by performing a max-pooling procedure on the second three-dimensional matrix. The social-networking system 160 may prepare memory space for the vector. The size of the vector may be equal to the number of the convolution layers on the second three-dimensional matrix. In particular embodiments, the social-networking system 160 may choose, as a first step of the max-pooling procedure, for each convolution layer of the third three-dimensional matrix, a maximum value. In particular embodiments, the social-networking system 160 may fill, as a second step of the max-pooling procedure, a corresponding element of the vector with the chosen value. As an example and not by way of limitation, the social-networking system 160 may have a 20-by-80-20 second convolution matrix. The social-networking system 160 may prepare a memory space for a vector of size 20. The social-networking system 160 may choose a maximum value from each convolution layer on the second convolution matrix and fill the value to the corresponding element of the vector. Although this disclosure describes generating a vector using a max-pooling procedure in a particular manner, this disclosure contemplates generating a vector using a max-pooling procedure in any suitable manner”, 
[Jaech, 0084] “At step 670, the social-networking system 160 may create a vector 610 by performing max-pooling on the second convolution matrix 609. At step 675, the social-networking system 160 may produce a relevance score 611 by performing sigmoid activation on the vector 610”, the max-pooling layer is embedded in the classification model, [Jaech, 0084, the last sentence] “At step 670, the social-networking system 160 may create a vector 610 by performing max-pooling on the second convolution matrix 609”); 
Jaech does not specifically disclose inputting in the LSTM layer the sequence of word vector representation of each word to generate 't' hidden states at every timestep, and determining at least one target class by a softmax layer of the classification model, at least one target class of the one or more queries based on the final vector and outputting a response to the one or more data based on the determined target class, and wherein the word to vector model is used to initialize weights of the embedding layer which takes the one or more user queries as a sequence of 1-hot encoded word vectors and outputs encoded sequence of the corresponding vectors and wherein the weights of the embedding layer are updated through back-propagation; generating a set of misclassified query-query pairs based on the one or more errors; iteratively training, the Siamese model using the set of misclassified query-query pairs along with one or more correct pairs.
Geng teaches wherein the BiLSTM-Siamese network based classifier comprises a Siamese model and a classification model, and wherein each of the Siamese model and the classification model comprise a common base network (Both Jaech and Geng teaches the limitation of the claim. [Geng, page 4, Figure 1] The Base network is interpreted as the Siamese network, and the Classification Subnet is interpreted as the classification model. For more information, see [Geng, page 3, right column, last paragraph, Base network – page 4, 1st paragraph], and [Geng, page 4, right column, paragraph Person ID classification subnet].)
iteratively training, the Siamese model, wherein one or more weights of the common base network are shared with the Siamese model and the classification model during the training of the BiLSTM-Siamese network based classifier (Both Jaech and Geng teaches the limitation of the claim. [Geng, page 4, Figure 1] The Base network is interpreted as the Siamese network, and the Classification Subnet is interpreted as the classification model. For more information, see [Geng, page 3, right column, last paragraph, Base network – page 4, 1st paragraph], and [Geng, page 4, right column, paragraph Person ID classification subnet]. [Geng, page 6, right column, 2nd paragraph, line7-11] “With these soft-labels, another round of self-training of the deep model is carried out and the updated base network then produces input vectors and new graph for the subspace learning model. This iterative process normally converges after 2-3 iterations.” The paragraph discloses the iterative training process.); 
37determining at least one target class by a softmax layer of the classification model, at least one target class of the one or more data based on the final vector and outputting a response to the one or more data based on the determined target class ([Geng, page 4, right column, paragraph Person ID classification subnet] “Person ID classification subnet The person ID classification part learns a softmax classifier with a cross-entropy loss that distinguishes different people from each other. After the features are extracted from the base network and the random dropout is applied, a softmax layer with N nodes are then connected, where N is the unique person number in the training set.”
[Geng, page 4, right column, paragraph Pairwise verification subnet] “After a fully connected (FC) layer, the last layer of the verification network is a softmax layer with two output nodes, corresponding to whether or not the input image pair contains the same person.” ).
generating a set of misclassified query-query pairs ([Geng, page 7, left column, paragraph 5.2. Implementation Details, line 23-27] “For pair generation, we first exhaustively generative all the positive and negative pairs according to person identity and then randomly duplicate the positive pairs till the numbers of the positive and negative pairs are equal, i.e., balanced.” The negative pair is interpreted as misclassified pairs. [Geng, page 6, left column, paragraph Self-training, line 15-20] “In a self-training strategy, the fine-tuned network will produce an updated mapping function e- which will be used to generate another set of soft labels for retraining. Model drift is thus a big problem: the errors in the soft labels will be propagated with the iterations and quickly magnified.”); training, the Siamese model using the set of misclassified query-query pairs along with one or more correct pairs ([Geng, page 7, left column, paragraph 5.2. Implementation Details, line 23-27] “For pair generation, we first exhaustively generative all the positive and negative pairs according to person identity and then randomly duplicate the positive pairs till the numbers of the positive and negative pairs are equal, i.e., balanced.” The negative pair is interpreted as misclassified pairs, and the negative pairs will be inputted to the network as training data. For detailed training process, see [Geng, page 7, right column, paragraph Training setting].
[Geng, page 6, left column, paragraph Self-training, line 15-20] “In a self-training strategy, the fine-tuned network will produce an updated mapping function e- which will be used to generate another set of soft labels for retraining. Model drift is thus a big problem: the errors in the soft labels will be propagated with the iterations and quickly magnified.” This paragraph also teaches the negative pairs (the errors in the soft labels) inputted to the network.)
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Geng and Jaech, to use determining at least one target class by a softmax layer of Geng to implement the BiLSTM-Siamese network based classifier of Jaech. The suggestion and/or motivation for doing so is to perform multi-class classification. Softmax layer returns the probabilities of each classes, which makes multi-class classification possible.
Jaech in view of Geng does not specifically disclose inputting in the LSTM layer the sequence of word vector representation of each word to generate 't' hidden states at every timestep; wherein the word to vector model is used to initialize weights of the embedding layer which takes the one or more user queries as a sequence of 1-hot encoded word vectors and outputs encoded sequence of the corresponding vectors; generating a set of misclassified query-query pairs based on the one or more errors.
Neculoiu teaches inputting in the LSTM layer the sequence of word vector representation of each word to generate 't' hidden states at every timestep ([Neculoiu, page 150, left column, 3 Siamese Recurrent Neural Network, line 1-9] “Recurrent Neural Networks (RNN) are neural networks adapted for sequence data (x1, . . . , xT ). At each time step t 2 {1, . . . , T}, the hidden-state vector ht is updated by the equation ht = a(Wxt + Uht−1), in which xt is the input at time t, W is the weight matrix from inputs to the hidden-state vector and U is the weight matrix on the hidden-state vector from the previous time step ht−1.”
[Neculoiu, page 151, left column, last paragraph – right column, line 1-2] “The network used in this study contains four BLSTM layers with 64-dimensional hidden vectors ht and memory ct. There are connections at each time step between the layers. The outputs of the last layer are averaged over time and this 128-dimensional vector is used as input to a dense feedforward layer. The input strings are padded to produce a sequence of 100 characters, with the input string randomly placed in this sequence.”); 
generating a set of misclassified query-query pairs based on the one or more errors; training, the Siamese model using the set of misclassified query-query pairs along with one or more correct pairs ([Neculoiu, page 150, right column, 2nd paragraph, line 6-12] “The training set for a Siamese network consists of triplets (x1, x2, y), where x1 and x2 are character sequences and y 2 {0, 1} indicates whether x1 and x2 are similar (y = 1) or dissimilar (y = 0). The aim of training is to minimize the distance in an embedding space between similar pairs and maximize the distance between dissimilar pairs.” 
[Neculoiu, page 150, right column, 3.1 Contrastive loss function, 2nd paragraph – page 151, left column, entire 1st paragraph and right column, 1st paragraph] “Let fW(x1) and fW(x2) be the projections of x1 and x2 in the embedding space computed by the network function fW. We define the energy of the model EW to be the cosine similarity between the embeddings of x1 and x2: … Figure 2 gives a geometric perspective on the loss function, showing the positive and negative components separately. Note that the positive loss is scaled down to compensate for the sampling ratios of positive and negative pairs (see below) … The input strings are padded to produce a sequence of 100 characters, with the input string randomly placed in this sequence. The parameters of the model are optimized using the Adam method (Kingma and Ba, 2014) and each model is trained until convergence. We use the dropout technique (Srivastava et al., 2014)” 
Neculoiu receives dissimilar query-query pairs and optimize the parameters of the model (training process) using the result. Neculoiu does not specifically teach iterative training process, but the combination of Jaech, Geng, and Neculoiu teaches iterative training process.);
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Neculoiu, Geng, Jaech, to use generating a set of misclassified query-query pairs and training the Siamese model using the set of misclassified pairs of Neculoiu to implement the BiLSTM-Siamese network based classifier of Geng, and Jaech. The suggestion and/or motivation for doing so is to improve the accuracy of the classifier. 
Jaech, in view of Geng, and further in view of Neculoiu does not specifically disclose wherein the word to vector model is used to initialize weights of the embedding layer which takes the one or more user queries as a sequence of 1-hot encoded word vectors and outputs encoded sequence of the corresponding vectors and wherein the weights of the embedding layer are updated through back-propagation.
Kottur teaches wherein the word to vector model is used to initialize weights of the embedding layer which takes the one or more user queries as a sequence of 1-hot encoded word vectors and outputs encoded sequence of the corresponding vectors and wherein the weights of the embedding layer are updated through back-propagation ([Kottur, page 4988, left column, 2nd paragraph, line 1-2] “Initialization: We initialize the projection matrix parameters WI with those from training w2v on large text corpora … ii) Training on a large corpus gives us good coverage in terms of the vocabulary. Further, since the gradients during backpropagation only affect parameters/embeddings for words seen during training, one can view vis-w2v as augmenting w2v with visual information when available”); 
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Kottur, Neculoiu, Geng, and Jaech, to use the process of initializing embedding layers that receives 1-hot encoded word of Kottur to implement the BiLSTM-Siamese network based classifier of Neculoiu, Geng, and Jaech. The suggestion and/or motivation for doing so is to avoid unpredictable output, because uninitialized variables or layers can lead to unpredictable output if used in operations.

Regarding claim 7, Jaech in view of Geng, in view of Neculoiu, and further in view of Kottur teaches a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions ([Jaech, 0139, the first sentence] ”In particular embodiments, computer system 1200 includes a processor 1202, memory 1204, storage 1206, an input/output (I/O) interface 1208, a communication interface 810, and a bus 812” teaches the processor). Claim 7 is a system claim having similar limitation to claim 1 above. Therefore, it is an abstract idea under the same rational as of claim 1 above.

Regarding claim 13, Jaech in view of Geng, in view of Neculoiu, and further in view of Kottur teaches one or more non-transitory machine readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors ([Jaech, 0139, the first sentence] ”In particular embodiments, computer system 1200 includes a processor 1202, memory 1204, storage 1206, an input/output (I/O) interface 1208, a communication interface 810, and a bus 812” teaches the processor). Claim 13 is a non-transitory machine readable information storage medium claim having similar limitation to claim 1 above. Therefore, it is an abstract idea under the same rational as of claim 1 above.

Regarding claim 5, Jaech teaches processor implemented method of claim 4, further comprising: obtaining, using the one more shared weights, a plurality query embeddings by passing the one or more queries through the Siamese model ([Jaech, 0089-0090, line 1-19] “Input to the Match-Tensor Layer: To begin, a word-embedding lookup layer may convert query and document terms into separate sequences of word-embeddings … In particular embodiments, word-embeddings may be 256-dimensional vectors of floating point numbers. The word embeddings may be then passed through a linear projection layer to a reduced l-dimensional space (e.g., l=40); the same linear projection matrix may be applied to both the query and the document word vectors … Two Recurrent Neural Networks, specifically bi-directional LSTMs (bi-LSTMs) [11, 16] may then encode the query (respectively document) word-embedding sequence into a sequence of LSTM states”, the linear projection matrix is the shared matrix of the Match-Tensor corresponds to the shared matrix of the Siamese model. [Jaech, 0015; Fig 6] “FIG. 6 illustrates an example process of computing a relevance score of an object for a query with the Match-Tensor model”, 635a and 635b of the Figure 6 are the Linear Projection Matrix); 
Jaech does not specifically mentions applying a contrastive divergence loss on the plurality of data to optimize the Siamese model and updating one or more parameters of the BiLSTM-Siamese network based classifier.
Jaech in view of Geng in view of Neculoiu, and further in view of Kottur teaches applying a contrastive divergence loss on the plurality of data to optimize the Siamese model ([Geng, page 4, right column, paragraph Pairwise verification subnet, line 5-10] “After a fully connected (FC) layer, the last layer of the verification network is a softmax layer with two output nodes, corresponding to whether or not the input image pair contains the same person. Note that for pairwise verification, the margin based contrastive loss is much widely used beyond Re-ID [47].”); and updating one or more parameters of the BiLSTM-Siamese network based classifier ([Geng, 0099, the last two sentences] “”, Geng shows the weights adjusted based on the loss function).
Claim 11 is a system claim having similar limitation to claim 5 above. Therefore, it is an abstract idea under the same rational as of claim 5 above.
Claim 17 is a non-transitory machine readable information storage medium claim having similar limitation to claim 5 above. Therefore, it is an abstract idea under the same rational as of claim 5 above.

Regarding claim 6, Jaech in view of Geng, in view of Neculoiu, and further in view of Kottur teaches the processor implemented method of claim 5, wherein the step of applying a contrastive divergence loss comprises: calculating, Euclidean distance between the plurality of query embeddings ([Jaech, 0054] “As another example and not by way of limitation, a similarity metric of               
                
                    →
                    
                        
                            
                                v
                            
                            
                                1
                            
                        
                    
                
            
          and              
                
                    →
                    
                        
                            
                                v
                            
                            
                                2
                            
                        
                    
                
            
          may be a Euclidean distance             
                
                    
                        
                            →
                            
                                
                                    
                                        v
                                    
                                    
                                        1
                                    
                                
                            
                        
                        -
                        
                            →
                            
                                
                                    
                                        v
                                    
                                    
                                        2
                                    
                                
                            
                        
                         
                    
                
            
         . A similarity metric of two vectors may represent how similar the two objects or n-grams corresponding to the two vectors, respectively, are to one another, as measured by the distance between the two vectors in the vector space 300. As an example and not by way of limitation, vector 310 and vector 320 may correspond to objects that are more similar to one another than the objects corresponding to vector 310 and vector 330, based on the distance between the respective vectors. Although this disclosure describes calculating a similarity metric between vectors in a particular manner, this disclosure contemplates calculating a similarity metric between vectors in any suitable manner”, Jaech discloses calculating Euclidean distance between two vectors v1 and v2, and v1 and v2 are the embedded queries); 
Jaech does not specifically mentions computing ‘divergence’ using Euclidean distance.
Jaech in view of Geng in view of Neculoiu, and further in view of Kottur teaches computing the contrastive divergence loss based on the calculated Euclidean distance ([Geng, page 5, left column, paragraph 3.2.Model Training and Testing, line 19-24] “when any probe comes in, we compute its feature output and compare with the gallery output vectors using a simple Euclidean distance, which is about 3 magnitude faster in our model than entering the verification subnet and computing the softmax score as the distance.”).
Claim 12 is a system claim having similar limitation to claim 6 above. Therefore, it is an abstract idea under the same rational as of claim 6 above.
Claim 18 is a non-transitory machine readable information storage medium claim having similar limitation to claim 6 above. Therefore, it is an abstract idea under the same rational as of claim 6 above.

Claim 2, 8, and 14 is/are rejected under 35 U.S.C. 103 over Jaech (US 20180349477 A1) in view of Geng (Geng et al, 2016, “Deep Transfer Learning for Person Re-identification”), in view of Neculoiu (Neculoiu et al, 2016, “Learning Text Similarity with Siamese Recurrent Networks”), in view of Kottur (Kottur et al, 2016, “Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes”), and further in view of Hold-Geoffroy (US 20180359416 A1).

Regarding claim 2, Jaech in view of Geng, in view of Neculoiu, and further in view of Kottur teaches processor implemented method of claim 1, wherein a Loss Function is applied to the sequence of vector to optimize the classification model ([Jaech, 0120] “To evaluate the sensitivity of the model performance to the amount of training data, for each of the NN architectures we sub-sampled the training set, retrained models (keeping the hyperparameters fixed), and computed the test-loss. FIG. 10 shows the test loss of each model as a function of its final accuracy. Each considered architecture benefits from the availability of large training sets, and the accuracy improves substantially as the size of the training set increases”).
Jaech in view of Geng, in view of Neculoiu, and further in view of Kottur does not specifically teach calculating loss using Square root Kullback- Leibler divergence (KLD).
Hold-Geoffroy teaches calculating loss using Square root Kullback- Leibler divergence (KLD) ([Hold-Geoffroy, 0070; Figure 5] “As mentioned, the fully connected layer 504 of the CNN splits into two heads 506a and 506b. The first head 506a registers a first output 508 (e.g., vector) describing the sun position made up of 160 elements representing a probability distribution on the discretized sky hemisphere, and the second head 506b registers a second output 510 (e.g., vector) made up of five elements describing three sky parameters and two camera parameters. As described above, the Kullback-Leibler divergence is used as the loss for the first head 506a while a Euclidean norm (also called custom-character.sup.2) is used for the second head 506b”).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Hold-Geoffroy, Kottur, Neculoiu, Geng, and Jaech, to use the process of calculating loss using Kullback-Leibler divergence of Hold-Geoffroy to implement the BiLSTM-Siamese network based classifier of Jaech, Geng, Neculoiu, and Kottur. The suggestion and/or motivation for doing so is to test the model performance and improve the accuracy of the classification model.
Claim 8 is a system claim having similar limitation to claim 2 above. Therefore, it is an abstract idea under the same rational as of claim 2 above.
Claim 14 is a non-transitory machine readable information storage medium claim having similar limitation to claim 2 above. Therefore, it is an abstract idea under the same rational as of claim 2 above.


Response to Argument
Applicant asserts that Athavale teaches only about one hot encoding for the words and random initializations for their word vectors and nowhere in Athavale it is mentioned about the word to vector model is used to initialize weights of the embedding layer which takes the one or more user queries as a sequence of 1-hot encoded word vectors and outputs encoded sequence of the corresponding vectors and wherein the weights of the embedding layer are updated through back-propagation.
Applicant’s arguments with respect to claim(s) 1 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Applicant asserts that Huang teaches only about instance extraction of candidates from a sentence using two directional hidden states at the same timestep and nowhere in Huang it is mentioned about the sequence of vector representation of each word to generate 't' hidden states at every timestep, wherein the vector representation of each word is inputted in at least one of a forward order and a reverse order and wherein the vector representation retains context of other words both on a left hand side and a right hand side as a result at each word in the one or more user queries;
Applicant’s arguments with respect to claim(s) 1 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. New combination Jaech in view of Geng, and in view of Neculoiu is used to disclose the sequence of vector representation of each word to generate 't' hidden states at every timestep.
Furthermore, Jaech discloses the vector representation retains context of other words both on a left hand side and a right hand side as a result at each word in queries in Figure 9 and in left column line 27 of paragraph 0006  – right column line 4. Jaech discloses a pair of neural networks sharing weights in Fig.9 and “The output encoded term embedding may represent the contextual meaning of the corresponding term in the search query” discloses the output of the neural networks represents the contextual meaning of the search query.
Applicant’s arguments filed 09/20/2022 have been fully considered but they are not persuasive.

Applicant asserts that Huang teaches only empirically setting a number of mismatched pairs for each matched pair as 100 and nowhere in Huang it is mentioned about generating a set of misclassified query-query pairs based on the one or more errors; iteratively training, the Siamese model using the set of misclassified query-query pairs along with one or more correct pairs.
Applicant’s arguments with respect to claim(s) 1 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Regarding generation of mismatched pairs.
US-20110145290-A1
US-20110131205-A1
Szoke et al, 2016, “COPING WITH CHANNEL MISMATCH IN QUERY-BY-EXA MPLE - BUT QUESST 2014”
Zmolikova et al, 2016, “Data selection by sequence summarizing neural network in mismatch condition training”
Hsieh & Chen, 1993, “A Neural Network Model which Combines Unsupervised and Supervised Learning”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JUN KWON whose telephone number is (571)272-2072. The examiner can normally be reached M-F 7:30AM – 4:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JUN KWON/
Examiner, Art Unit 2127

/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127