DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The present application was filed on 12/01/2017 and claims benefit of provisional application 62/431,224 filed on 12/07/2016.
Claims 1-20 are pending and have been examined.
Information Disclosure Statement

The information disclosure statement (IDS) was submitted 03/06/2018.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Objections

Claims 7, 14, and 20 are objected to because of the following informality:
The word “reparamaterization” should be replaced with “reparameterization”. Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 2, 9, 14, 16, 18 and 20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
In claim 2, the claim limitation “monitoring the semantic use of the word based on new text added to the corpus of text and the smoothed model” lacks clarity.  It is not clear if the new text is only added to the corpus of text or if the new text is also added to the smoothed model.   For examination purposes, “monitoring the semantic use of the word based on new text added to the corpus of text and the smoothed model” has been interpreted as “monitoring the semantic use of the word based on the smoothed model and the new text added to the corpus of text”.
In claim 9, the claim limitation “monitoring the semantic use of the word based on new text added to the corpus of text and the smoothed model” lacks clarity.  It is not clear if the new text is only added to the corpus of text or if the new text is also added to the smoothed model.   For examination purposes, “monitoring the semantic use of the word based on new text added to the corpus of text and the smoothed model” has been interpreted as “monitoring the semantic use of the word based on the smoothed model and the new text added to the corpus of text”.
Claim 14 recites the limitation “wherein the variational inference operation comprises a smoothing algorithm applied to the text of each time step”.  There is insufficient antecedent basis for this claim. Claim 11 does not recite “variational inference operation”.  Claim 8 recites “variational inference operation”.  For examination purposes, claim 14 is assumed to be a dependent claim of claim 8.  
monitoring the semantic use of the word based on new text added to the corpus of text and the smoothed model” lacks clarity.  It is not clear if the new text is only added to the corpus of text or if the new text is also added to the smoothed model.   For examination purposes, “monitoring the semantic use of the word based on new text added to the corpus of text and the smoothed model” has been interpreted as “monitoring the semantic use of the word based on the smoothed model and the new text added to the corpus of text”.   
Each dependent claim of claim 16 is rejected under the same rationale as the claim from which it depends. 
Claim 18 recites the limitation “wherein deriving the machine learning data model further comprises”.  There is insufficient antecedent basis for this claim. Claim 16 does not recite “deriving the machine learning data model”.  Claim 15 recites “deriving, based on a corpus of electronic text, a machine learning data model”.  For examination purposes, claim 18 is assumed to be a dependent claim of claim 15.  
Claim 20 recites the limitation “wherein the variational inference operation comprises a smoothing algorithm applied to the text of each time step”.  There is insufficient antecedent basis for this claim. Claim 16 does not recite “variational inference operation”.  Claim 15 recites “variational inference operation”.  For examination purposes, claim 20 is assumed to be a dependent claim of claim 15.  
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 6, 8, 10, 13, 15, 17, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Trask et al. (US 2016/0247061 A1) in view of Cotterell et al. (“Morphological Smoothing and Extrapolation of Word Embeddings”) and in further view of Hamilton et al. (“Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change”).
Any limitation that recites “at least one of” has been interpreted as requiring one of the alternatives and not all of the alternatives.
Regarding Claim 1,
	Trask et al. teaches a method comprising: deriving, based on a corpus of electronic text, a machine learning data model that associates words with corresponding usage contexts over a window of time, according to a diffusion process (p. 1 [0003] “Natural Language Processing (NLP) systems seek to automate the extraction of useful information from sequences of symbols in human language.  Some NLP systems may encounter difficulty due to the complexity and sparsity of information in natural language.  Neural network language models (NNLMs) may overcome limitations of the performance of traditional systems.  A NNLM may learn distributed representations for words, and may embed a vocabulary into a smaller dimensional linear space that models a probability for word sequences, expressed in terms of those representations”  teaches a neural network language model (NNLM) as a machine learning data model and teaches a corpus (vocabulary) of electronic text (sequence of symbols); p. 1 [0004] “ NNLMs may generate word embeddings by training a symbol prediction task over a moving local-context window.  The ordered set of weights associated with each word becomes the word’s dense vector embedding.  The result is a vector space model that encodes semantic and syntactic relationships” teaches a machine learning data model (neural network language model) that associates words (word’s dense vector embedding) with corresponding usage contexts (within different context windows) over a window of time (movement of context window is time dependent); p. 1 [0004] “ NNLMs may generate word embeddings by training a symbol prediction task over a moving local-context window.  The ordered set of weights associated with each word becomes the word’s dense vector embedding.  The result is a vector space model that encodes semantic and syntactic relationships. These distributed representations encodes shades of meaning across their dimensions, allowing for two words to have multiple, real-valued relationships encoded in a single representation. This feature flows from the distributional hypothesis: words that appear in similar contexts have similar meaning. Words that appear in similar contexts will experience similar training examples, training outcomes, and converge to similar weights” teaches a diffusion process based on distribution of similar weights to dense vector embeddings of similar words), 
wherein the machine learning data model comprises a plurality of skip-gram models, wherein each skip-gram model comprises a word embedding vector and a context embedding vector for a respective time step associated with the respective skip gram model (p. 1 [0004] “ NNLMs may generate word embeddings by training a symbol prediction task over a moving local-context window.  The ordered set of weights associated with each word becomes the word’s dense vector embedding …” and p. 4 [0045] “In some embodiments, a neural network of the present disclosure can be trained using a “skip-gram” training style.  The skip-gram training style optimizes the following objective function
 
    PNG
    media_image1.png
    106
    552
    media_image1.png
    Greyscale

where                         
                            
                                
                                    c
                                
                                
                                    j
                                
                                
                                    i
                                
                            
                        
                     is the location specific representation (partition j) for the word at window position j relative to the focus word w” teaches a word embedding vector (ordered set of weights associated with a focus word) and a context embedding vector (ordered set of weights associated with word at a position relative to focus word) comprising a skip-model for a respective time step (moving local context window position j);  p. 3 [0033] “In some embodiments, the neural network has a layer of hidden nodes 220.  In some embodiments, this layer of hidden nodes 220 may be divided into partitions.  Where a layer of hidden nodes is divided into partitions, it is referred to as a partitioned embedding neural network (PENN).  In some embodiments, each partition relating to a position, or window, in a phrase (one word before the focus term, one word after the focus term, etc).  This can be referred to as windowed embedding.  FIG. 4 depicts a windowed partitioned embodiment having two partitions, one for the word immediately preceding the focus term (p=+1) and one for the word immediately following the focus term (p=-1). The network is shown here, analyzing the phrase, “SEE SPOT RUN,” where the focus term is “SPOT”. Here, three hidden nodes are used for the p=+1 partition, and three hidden nodes are used for the p=-1 partition.  Again, as a result of inputting “SEE” to the p=+1 partition, and “RUN” to the p=-1 partition, the network predicts that the focus term is “SPOT” and p. 4 [0046] “Thus, in the example listed above, the neural network would consist of two neural networks, one modeling the probability that the focus term is a given value based on the position one-ahead of the focus term, and one modeling the probability that the focus term  is a given value based on the word in the position one after the focus term … After training, to arrive at a final result, the outputs of the various output partitions can be added together, or otherwise combined to produce a final probability of the focus term”  teaches a plurality of skip-gram models (one neural network focused on word/context pair where context word is in the position one-ahead of the word and the other neural network focused on word/context pair where context word is in the position one after the word) used to predict a focus term (word embedding vector) based on neighboring words (context embedding vectors)).
	Trask et al. does not appear to explicitly teach generating a smoothed model by applying a variational inference operation over the machine learning data model.
	Cotterell et al. teaches generating a smoothed model by applying a variational inference operation over the machine learning data model (pp. 1655-1656, section 8, paragraph 4, “ … Given a finite training corpus … and a lexicon … we generate embeddings … for all word types … using the GENSIM implementation … of the WORD2VEC hierarchical softmax skip-gram model …” and pp. 1651-1652, section I, paragraph 3 “ Our proposed method runs a fast post-processor on the output of any existing tool that constructs word embeddings, such as WORD2VEC… some embeddings are noisy or missing, due to sparse training data.  We correct these problems by using a Gaussian graphical model that jointly models the embeddings of morphologically related words.  Inference under this model can smooth the noisy embeddings that were observed in the WORD2VEC output …” teaches applying a variational inference operation (jointly modeling the embeddings of morphologically related words using a Gaussian graphical model) to smooth noisy word embeddings observed by WORD2VEC (skip-gram model) output). 
	Trask et al. and Cotterell et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate generating a smoothed model by applying a variational inference operation over the machine learning data model as taught by Cotterell et al. to the disclosed method of Trask et al.
One of ordinary skill in the art would have been motivated to make this modification in order to exploit lexical relations documented in existing morphological resources to smooth vectors for observed words and extrapolate vectors for new words.  (Cotterell et al. p. 1659, section 9, paragraph 1).
Trask et al. in view of Cotterell et al. does not appear to explicitly teach identifying, based on the smoothed model and the corpus of electronic text, a change in a semantic use of a word over at least a portion of the window of time.
Hamilton et al. teaches identifying, based on the smoothed model and the corpus of electronic text, a change in a semantic use of a word over at least a portion of the window of time (p. 8, section 4.4, paragraph 3, “We measure a word’s contextual diversity … by examining its neighborhood in an empirical co-occurrence network… words are connected to each other if they co-occur more than one would expect by chance (after smoothing) …” teaches measurement of contextual diversity (semantic p. 12, section A, paragraph 2 “ … to improve the computational efficiency of SGNS (which works with text streams and not co-occurrence counts), we downsampled the larger years in the Google N-Grams data to have at most 109 tokens …”  and  p. 4, section 2.4, paragraph 1 “… we can measure how an individual word’s embedding shifts over time…” teaches identifying a change in semantic use of a word over time based on the corpus of electronic text (text stream)). 
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate identifying, based on the smoothed model and the corpus of electronic text, a change in a semantic use of a word over at least a portion of the window of time as taught by Hamilton et al. to the disclosed method of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate identifying, based on the smoothed model and the corpus of electronic text, a change in a semantic use of a word over at least a portion of the Hamilton et al. to the disclosed method of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Regarding Claim 3, 
	Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the method of claim 1.
	Hamilton et al. further teaches wherein each time step is of a plurality of time steps (p. 4, section 2.4, paragraph 2 “ … We quantify shifts by computing the similarity time-series                         
                            
                                
                                    s
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            
                                
                                    
                                        
                                            w
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            w
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            c
                            o
                            s
                            -
                            s
                            i
                            m
                            (
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                             
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                     … between two words                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     over a time period (t, …, t + Δ)” teaches each time step in the  time period (t, …, t + Δ) as a plurality of time steps), 
wherein the word embedding vectors comprise word embeddings for each word in the corpus of text, wherein the context embedding vectors comprise context embeddings p. 3, section 2.1.3, paragraph 1
    PNG
    media_image2.png
    416
    624
    media_image2.png
    Greyscale

teaches every word wi in a corpus being represented by word (embedding) vector and context (embedding) vector), 
wherein the method further comprises segmenting each text element in the corpus into a respective time step of the plurality of time steps based on a respective timestamp of each element (p. 4, section 2.4, paragraph 2 “ … We quantify shifts by computing the similarity time-series                         
                            
                                
                                    s
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            
                                
                                    
                                        
                                            w
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            w
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            c
                            o
                            s
                            -
                            s
                            i
                            m
                            (
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                             
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                     … between two words                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     over a time period (t, …, t + Δ)” teaches each text element in the corpus (                        
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     or                         
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     segmented into a time step in the plurality of time steps (t, …, t + Δ) based on the timestamp (which is the (t) value)).  
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
Hamilton et al. to the disclosed method of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Regarding Claim 6,
	Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the method of claim 1.
	Hamilton et al. further teaches wherein the variational inference operation comprises a filtering algorithm applied to the text of each time step (p. 13  section C “ … We removed stop words and proper nouns by (i) removing all stop-words from the available lists in Python’s NLTK package … and (ii) restricting our analysis to words with part-of-speech (POS) tags corresponding to four main linguistic categories (common nouns, verbs, adverbs, and adjectives) …” teaches filtering of stop words from all text in the available lists), wherein the filtering algorithm comprises: 
	initializing a plurality of variational parameters for the word embedding vectors; initializing a plurality of variational parameters for the context embedding vectors (p. 3, section 2.1.3, paragraph 2 “SGNS has the benefit of allowing incremental initialization during learning, where the embeddings for time t are initialized with the embeddings from time t – Δ …”  and  p. 3, section 2.2, paragraph 2 “We follow the recommendations of … in setting the hyperparameters for the embedding methods, though preliminary experiments were used to tune key settings … we used symmetric context windows of size 4 (on each side).  For SGNS … we used embeddings of size 300…” teach the initializing of the variational parameters (hyperparameters) of the word and context embeddings at time t for skip-gram negative sampling (SGNS) using context window of a particular size and embedding of a particular size); 
	receiving data for a first time step of the plurality of time steps (  p. 4, section 2.4, paragraph 2 “ … We quantify shifts by computing the similarity time-series                         
                            
                                
                                    s
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            
                                
                                    
                                        
                                            w
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            w
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            c
                            o
                            s
                            -
                            s
                            i
                            m
                            (
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                             
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                     … between two words                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     over a time period (t, …, t + Δ)” teaches receiving data (                        
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     or                         
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     for a first time step (t) in the plurality of time steps (t, …, t + Δ)); 
preprocessing the text corpus to generate a positive count matrix for the first time step (pp. 2-3, section 2.1.1, paragraph 1  
    PNG
    media_image3.png
    525
    607
    media_image3.png
    Greyscale
 
    PNG
    media_image4.png
    267
    610
    media_image4.png
    Greyscale
teaches the Positive Pointwise Mutual Information matrix being the positive count matrix being generated within each time step (sliding window j of text containing word                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and surrounding context                         
                            
                                
                                    c
                                
                                
                                    j
                                
                            
                        
                    ))
and optimizing the plurality of variational parameters for the word embedding vectors and the plurality of variational parameters for the context embedding vectors p. 3, section 2.1.3, paragraph 1
    PNG
    media_image2.png
    416
    624
    media_image2.png
    Greyscale

teaches optimization of word and context embedding vectors using stochastic gradient descent).
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the variational inference operation comprises a filtering algorithm applied to the text of each time step, wherein the filtering algorithm comprises: initializing a plurality of variational parameters for the word embedding vectors; initializing a plurality of variational parameters for the context embedding vectors; receiving data for a first time step of the plurality of time steps; preprocessing the text corpus to generate a positive count matrix for the first time step; and optimizing the plurality of variational parameters for the word embedding vectors Hamilton et al. to the disclosed method of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Regarding Claim 8,
	Trask et al. teaches a non-transitory computer readable storage medium having computer-readable program code embodied therewith, the computer-readable code executable to perform an operation (pp. 10-11 [0098] “ … computer-readable storage media can include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-storage instructions, data structures, program modules, or other data … Computer-readable storage media as described herein does not include transitory signals” ) comprising: deriving, based on a corpus of electronic text, a machine learning data model that associates words with corresponding usage contexts over a window of time, according to a diffusion process (p. 1 [0003] “Natural Language Processing (NLP) systems seek to automate the extraction of useful information from sequences of symbols in human language.  Some NLP systems may encounter difficulty due to the complexity and sparsity of information in natural language.  Neural network language models (NNLMs) may overcome limitations of the performance of traditional systems.  A NNLM may learn distributed representations for words, and may embed a vocabulary into a smaller dimensional linear space that models a probability for word sequences, expressed in terms of those representations”  teaches a neural network language model (NNLM) as a machine p. 1 [0004] “ NNLMs may generate word embeddings by training a symbol prediction task over a moving local-context window.  The ordered set of weights associated with each word becomes the word’s dense vector embedding.  The result is a vector space model that encodes semantic and syntactic relationships” teaches a machine learning data model (neural network language model) that associates words (word’s dense vector embedding) with corresponding usage contexts (within different context windows) over a window of time (movement of context window is time dependent); p. 1 [0004] “ NNLMs may generate word embeddings by training a symbol prediction task over a moving local-context window.  The ordered set of weights associated with each word becomes the word’s dense vector embedding.  The result is a vector space model that encodes semantic and syntactic relationships. These distributed representations encodes shades of meaning across their dimensions, allowing for two words to have multiple, real-valued relationships encoded in a single representation. This feature flows from the distributional hypothesis: words that appear in similar contexts have similar meaning. Words that appear in similar contexts will experience similar training examples, training outcomes, and converge to similar weights” teaches a diffusion process based on distribution of similar weights to dense vector embeddings of similar words), 
wherein the machine learning data model comprises a plurality of skip-gram models, wherein each skip-gram model comprises a word embedding vector and a context embedding vector for a respective time step associated with the respective skip gram model (p. 1 [0004] “ NNLMs may generate word embeddings by training a symbol prediction task over a moving local-context window.  The ordered set of weights associated with each word becomes the word’s dense vector embedding …” and p. 4 [0045] “In some embodiments, a neural network of the present disclosure can be trained using a “skip-gram” training style.  The skip-gram training style optimizes the following objective function
 
    PNG
    media_image1.png
    106
    552
    media_image1.png
    Greyscale

where                         
                            
                                
                                    c
                                
                                
                                    j
                                
                                
                                    i
                                
                            
                        
                     is the location specific representation (partition j) for the word at window position j relative to the focus word w” teaches a word embedding vector (ordered set of weights associated with a focus word) and a context embedding vector (ordered set of weights associated with word at a position relative to focus word) comprising a skip-model for a respective time step (moving local context window position j);  p. 3 [0033] “In some embodiments, the neural network has a layer of hidden nodes 220.  In some embodiments, this layer of hidden nodes 220 may be divided into partitions.  Where a layer of hidden nodes is divided into partitions, it is referred to as a partitioned embedding neural network (PENN).  In some embodiments, each partition relating to a position, or window, in a phrase (one word before the focus term, one word after the focus term, etc).  This can be referred to as windowed embedding.  FIG. 4 depicts a windowed partitioned embodiment having two partitions, one for the word immediately preceding the focus term (p=+1) and one for the word immediately following the focus term (p=-1). The network is shown here, analyzing the phrase, “SEE SPOT RUN,” where the focus term is “SPOT”. Here, three hidden nodes are used for the p=+1 partition, and three hidden nodes are used for the p=-1 partition.  Again, as a result of inputting “SEE” to the p=+1 partition, and “RUN” to the p=-1 partition, the network predicts that the focus term is “SPOT” and p. 4 [0046] “Thus, in the example listed above, the neural network would consist of two neural networks, one modeling the probability that the focus term is a given value based on the position one-ahead of the focus term, and one modeling the probability that the focus term  is a given value based on the word in the position one after the focus term … After training, to arrive at a final result, the outputs of the various output partitions can be added together, or otherwise combined to produce a final probability of the focus term”  teaches a plurality of skip-gram models (one neural network focused on word/context pair where context word is in the position one-ahead of the word and the other neural network focused on word/context pair where context word is in the position one after the word) used to predict a focus term (word embedding vector) based on neighboring words (context embedding vectors)).
	Trask et al. does not appear to explicitly teach generating a smoothed model by applying a variational inference operation over the machine learning data model.
	Cotterell et al. teaches generating a smoothed model by applying a variational inference operation over the machine learning data model (pp. 1655-1656, section 8, paragraph 4, “ … Given a finite training corpus … and a lexicon … we generate embeddings … for all word types … using the GENSIM implementation … of the WORD2VEC hierarchical softmax skip-gram model …” and pp. 1651-1652, section I, paragraph 3 “ Our proposed method runs a fast post-processor on the output of any existing tool that constructs word embeddings, such as WORD2VEC… some embeddings are noisy or missing, due to sparse training data.  We correct these problems by using a Gaussian graphical model that jointly models the embeddings of morphologically related words.  Inference under this model can smooth the noisy embeddings that were observed in the WORD2VEC output …” teaches applying a variational inference operation (jointly modeling the embeddings of morphologically related 
	Trask et al. and Cotterell et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate generating a smoothed model by applying a variational inference operation over the machine learning data model as taught by Cotterell et al. to the disclosed method of Trask et al.
One of ordinary skill in the art would have been motivated to make this modification in order to exploit lexical relations documented in existing morphological resources to smooth vectors for observed words and extrapolate vectors for new words.  (Cotterell et al. p. 1659, section 9, paragraph 1).
Trask et al. in view of Cotterell et al. does not appear to explicitly teach identifying, based on the smoothed model and the corpus of electronic text, a change in a semantic use of a word over at least a portion of the window of time.
Hamilton et al. teaches identifying, based on the smoothed model and the corpus of electronic text, a change in a semantic use of a word over at least a portion of the window of time (p. 8, section 4.4, paragraph 3, “We measure a word’s contextual diversity … by examining its neighborhood in an empirical co-occurrence network… words are connected to each other if they co-occur more than one would expect by chance (after smoothing) …” teaches measurement of contextual diversity (semantic change) using a smoothed model; p. 12, section A, paragraph 2 “ … to improve the computational efficiency of SGNS (which works with text streams and not co-occurrence counts), we downsampled the larger years in the Google N-Grams data to have at most 109 tokens …”  and  p. 4, section 2.4, paragraph 1 “… we can measure how an individual word’s embedding shifts over time…” teaches identifying a change in semantic use of a word over time based on the corpus of electronic text (text stream)). 
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate identifying, based on the smoothed model and the corpus of electronic text, a change in a semantic use of a word over at least a portion of the window of time as taught by Hamilton et al. to the disclosed method of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate identifying, based on the smoothed model and the corpus of electronic text, a change in a semantic use of a word over at least a portion of the window of time as taught by Hamilton et al. to the disclosed method of Trask et al. in view of Cotterell et al.
Hamilton et al. p. 4 section 2.4, paragraph 3).
Regarding Claim 10, 
	Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the computer-readable storage medium of claim 8.
	Hamilton et al. further teaches wherein each time step is of a plurality of time steps (p. 4, section 2.4, paragraph 2 “ … We quantify shifts by computing the similarity time-series                         
                            
                                
                                    s
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            
                                
                                    
                                        
                                            w
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            w
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            c
                            o
                            s
                            -
                            s
                            i
                            m
                            (
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                             
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                     … between two words                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     over a time period (t, …, t + Δ)” teaches each time step in the  time period (t, …, t + Δ) as a plurality of time steps), 
wherein the word embedding vectors comprise word embeddings for each word in the corpus of text, wherein the context embedding vectors comprise context embeddings for each word in the corpus of text (p. 3, section 2.1.3, paragraph 1
    PNG
    media_image2.png
    416
    624
    media_image2.png
    Greyscale

i in a corpus being represented by word (embedding) vector and context (embedding) vector), 
wherein the method further comprises segmenting each text element in the corpus into a respective time step of the plurality of time steps based on a respective timestamp of each element (p. 4, section 2.4, paragraph 2 “ … We quantify shifts by computing the similarity time-series                         
                            
                                
                                    s
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            
                                
                                    
                                        
                                            w
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            w
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            c
                            o
                            s
                            -
                            s
                            i
                            m
                            (
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                             
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                     … between two words                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     over a time period (t, …, t + Δ)” teaches each text element in the corpus (                        
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     or                         
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     segmented into a time step in the plurality of time steps (t, …, t + Δ) based on the timestamp (which is the (t) value)).  
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein each time step is of a plurality of time steps, wherein the word embedding vectors comprise word embeddings for each word in the corpus of text, wherein the context embedding vectors comprise context embeddings for each word in the corpus of text, wherein the method further comprises segmenting each text element in the corpus into a respective time step of the plurality of time steps based on a respective timestamp of each element as taught by Hamilton et al. to the disclosed computer-readable storage medium of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Regarding Claim 13,
	Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the computer-readable storage medium of claim 8.
	Hamilton et al. further teaches wherein the variational inference operation comprises a filtering algorithm applied to the text of each time step (p. 13  section C “ … We removed stop words and proper nouns by (i) removing all stop-words from the available lists in Python’s NLTK package … and (ii) restricting our analysis to words with part-of-speech (POS) tags corresponding to four main linguistic categories (common nouns, verbs, adverbs, and adjectives) …” teaches filtering of stop words from all text in the available lists), wherein the filtering algorithm comprises: 
	initializing a plurality of variational parameters for the word embedding vectors; initializing a plurality of variational parameters for the context embedding vectors (p. 3, section 2.1.3, paragraph 2 “SGNS has the benefit of allowing incremental initialization during learning, where the embeddings for time t are initialized with the embeddings from time t – Δ …”  and  p. 3, section 2.2, paragraph 2 “We follow the recommendations of … in setting the hyperparameters for the embedding methods, though preliminary experiments were used to tune key settings … we used symmetric context windows of size 4 (on each side).  For SGNS … we used embeddings of size 300…” teach the initializing of the variational parameters (hyperparameters) of the word and context embeddings at time t for skip-gram negative sampling (SGNS) using context window of a particular size and embedding of a particular size); 
	receiving data for a first time step of the plurality of time steps (  p. 4, section 2.4, paragraph 2 “ … We quantify shifts by computing the similarity time-series                         
                            
                                
                                    s
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            
                                
                                    
                                        
                                            w
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            w
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            c
                            o
                            s
                            -
                            s
                            i
                            m
                            (
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                             
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                     … between two words                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     over a time period (t, …, t + Δ)” teaches receiving data (                        
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     or                         
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     for a first time step (t) in the plurality of time steps (t, …, t + Δ)); 
preprocessing the text corpus to generate a positive count matrix for the first time step (pp. 2-3, section 2.1.1, paragraph 1  
    PNG
    media_image3.png
    525
    607
    media_image3.png
    Greyscale
 
    PNG
    media_image4.png
    267
    610
    media_image4.png
    Greyscale
teaches the                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and surrounding context                         
                            
                                
                                    c
                                
                                
                                    j
                                
                            
                        
                    ))
and optimizing the plurality of variational parameters for the word embedding vectors and the plurality of variational parameters for the context embedding vectors using stochastic gradient descent (p. 3, section 2.1.3, paragraph 1
    PNG
    media_image2.png
    416
    624
    media_image2.png
    Greyscale

teaches optimization of word and context embedding vectors using stochastic gradient descent).
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the variational inference operation comprises a filtering algorithm applied to the text of each time step, wherein Hamilton et al. to the disclosed computer-readable storage medium of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Regarding Claim 15,
	Trask et al. teaches a system, comprising: a computer processor; and a memory containing a program which when executed by the processor performs an operation (p. 10 [0097] “ … The computing system includes a computer 700 that can be configured to perform one or more functions associated with the present disclosed technology. The computer 700 includes a processing unit 702, a system memory 704, and a system bus 706 that couples the memory 704 to the processing unit 702”) comprising: deriving, based on a corpus of electronic text, a machine learning data model that associates words with corresponding usage contexts over a window of time, according to a diffusion process (p. 1 [0003] “Natural Language Processing (NLP) systems seek to automate the extraction of useful information from sequences of symbols in human language.  Some NLP systems may encounter difficulty due to the complexity and sparsity of information in natural language.  Neural network language models (NNLMs) may overcome limitations of the performance of traditional systems.  A NNLM may learn distributed representations for words, and may embed a vocabulary into a smaller dimensional linear space that models a probability for word sequences, expressed in terms of those representations”  teaches a neural network language model (NNLM) as a machine learning data model and teaches a corpus (vocabulary) of electronic text (sequence of symbols); p. 1 [0004] “ NNLMs may generate word embeddings by training a symbol prediction task over a moving local-context window.  The ordered set of weights associated with each word becomes the word’s dense vector embedding.  The result is a vector space model that encodes semantic and syntactic relationships” teaches a machine learning data model (neural network language model) that associates words (word’s dense vector embedding) with corresponding usage contexts (within different context windows) over a window of time (movement of context window is time dependent); p. 1 [0004] “ NNLMs may generate word embeddings by training a symbol prediction task over a moving local-context window.  The ordered set of weights associated with each word becomes the word’s dense vector embedding.  The result is a vector space model that encodes semantic and syntactic relationships. These distributed representations encodes shades of meaning across their dimensions, allowing for two words to have multiple, real-valued relationships encoded in a single representation. This feature flows from the distributional hypothesis: words that appear in similar contexts have similar meaning. Words that appear in similar contexts will experience similar training examples, training outcomes, and converge to similar weights” teaches a diffusion process based on distribution of similar weights to dense vector embeddings of similar words), 
wherein the machine learning data model comprises a plurality of skip-gram models, wherein each skip-gram model comprises a word embedding vector and a context embedding p. 1 [0004] “ NNLMs may generate word embeddings by training a symbol prediction task over a moving local-context window.  The ordered set of weights associated with each word becomes the word’s dense vector embedding …” and p. 4 [0045] “In some embodiments, a neural network of the present disclosure can be trained using a “skip-gram” training style.  The skip-gram training style optimizes the following objective function
 
    PNG
    media_image1.png
    106
    552
    media_image1.png
    Greyscale

where                         
                            
                                
                                    c
                                
                                
                                    j
                                
                                
                                    i
                                
                            
                        
                     is the location specific representation (partition j) for the word at window position j relative to the focus word w” teaches a word embedding vector (ordered set of weights associated with a focus word) and a context embedding vector (ordered set of weights associated with word at a position relative to focus word) comprising a skip-model for a respective time step (moving local context window position j);  p. 3 [0033] “In some embodiments, the neural network has a layer of hidden nodes 220.  In some embodiments, this layer of hidden nodes 220 may be divided into partitions.  Where a layer of hidden nodes is divided into partitions, it is referred to as a partitioned embedding neural network (PENN).  In some embodiments, each partition relating to a position, or window, in a phrase (one word before the focus term, one word after the focus term, etc).  This can be referred to as windowed embedding.  FIG. 4 depicts a windowed partitioned embodiment having two partitions, one for the word immediately preceding the focus term (p=+1) and one for the word immediately following the focus term (p=-1). The network is shown here, analyzing the phrase, “SEE SPOT RUN,” where the focus term is “SPOT”. Here, three hidden nodes are used for the p=+1 partition, and three hidden nodes are used for the p=-1 partition.  Again, as a result of inputting “SEE” to the p=+1 partition, and “RUN” to the p=-1 partition, the network predicts that the focus term is “SPOT” and p. 4 [0046] “Thus, in the example listed above, the neural network would consist of two neural networks, one modeling the probability that the focus term is a given value based on the position one-ahead of the focus term, and one modeling the probability that the focus term  is a given value based on the word in the position one after the focus term … After training, to arrive at a final result, the outputs of the various output partitions can be added together, or otherwise combined to produce a final probability of the focus term”  teaches a plurality of skip-gram models (one neural network focused on word/context pair where context word is in the position one-ahead of the word and the other neural network focused on word/context pair where context word is in the position one after the word) used to predict a focus term (word embedding vector) based on neighboring words (context embedding vectors)).
	Trask et al. does not appear to explicitly teach generating a smoothed model by applying a variational inference operation over the machine learning data model.
	Cotterell et al. teaches generating a smoothed model by applying a variational inference operation over the machine learning data model (pp. 1655-1656, section 8, paragraph 4, “ … Given a finite training corpus … and a lexicon … we generate embeddings … for all word types … using the GENSIM implementation … of the WORD2VEC hierarchical softmax skip-gram model …” and pp. 1651-1652, section I, paragraph 3 “ Our proposed method runs a fast post-processor on the output of any existing tool that constructs word embeddings, such as WORD2VEC… some embeddings are noisy or missing, due to sparse training data.  We correct these problems by using a Gaussian graphical model that jointly models the embeddings of morphologically related words.  Inference under this model can smooth the noisy embeddings that were observed in the WORD2VEC output …” teaches applying a variational inference operation (jointly modeling the embeddings of morphologically related words using a Gaussian graphical model) to smooth noisy word embeddings observed by WORD2VEC (skip-gram model) output). 
	Trask et al. and Cotterell et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate generating a smoothed model by applying a variational inference operation over the machine learning data model as taught by Cotterell et al. to the disclosed system of Trask et al.
One of ordinary skill in the art would have been motivated to make this modification in order to exploit lexical relations documented in existing morphological resources to smooth vectors for observed words and extrapolate vectors for new words.  (Cotterell et al. p. 1659, section 9, paragraph 1).
Trask et al. in view of Cotterell et al. does not appear to explicitly teach identifying, based on the smoothed model and the corpus of electronic text, a change in a semantic use of a word over at least a portion of the window of time.
Hamilton et al. teaches identifying, based on the smoothed model and the corpus of electronic text, a change in a semantic use of a word over at least a portion of the window of time (p. 8, section 4.4, paragraph 3, “We measure a word’s contextual diversity … by examining its neighborhood in an empirical co-occurrence network… words are connected to each other if they co-occur more than one would expect by chance (after smoothing) …” teaches measurement of contextual diversity (semantic change) using a smoothed model; p. 12, section A, paragraph 2 “ … to improve the computational efficiency of SGNS (which works with text streams and not co-occurrence counts), we downsampled the larger years in the Google N-Grams data to have at most 109 tokens …”  and  p. 4, section 2.4, paragraph 1 “… we can measure how an individual word’s embedding shifts over time…” teaches identifying a change in semantic use of a word over time based on the corpus of electronic text (text stream)). 
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate identifying, based on the smoothed model and the corpus of electronic text, a change in a semantic use of a word over at least a portion of the window of time as taught by Hamilton et al. to the disclosed system of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
Hamilton et al. to the disclosed method of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Regarding Claim 17, 
	Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the system of claim 15.
	Hamilton et al. further teaches wherein each time step is of a plurality of time steps (p. 4, section 2.4, paragraph 2 “ … We quantify shifts by computing the similarity time-series                         
                            
                                
                                    s
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            
                                
                                    
                                        
                                            w
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            w
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            c
                            o
                            s
                            -
                            s
                            i
                            m
                            (
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                             
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                     … between two words                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     over a time period (t, …, t + Δ)” teaches each time step in the  time period (t, …, t + Δ) as a plurality of time steps), 
wherein the word embedding vectors comprise word embeddings for each word in the corpus of text, wherein the context embedding vectors comprise context embeddings p. 3, section 2.1.3, paragraph 1
    PNG
    media_image2.png
    416
    624
    media_image2.png
    Greyscale

teaches every word wi in a corpus being represented by word (embedding) vector and context (embedding) vector), 
wherein the method further comprises segmenting each text element in the corpus into a respective time step of the plurality of time steps based on a respective timestamp of each element (p. 4, section 2.4, paragraph 2 “ … We quantify shifts by computing the similarity time-series                         
                            
                                
                                    s
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            
                                
                                    
                                        
                                            w
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            w
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            c
                            o
                            s
                            -
                            s
                            i
                            m
                            (
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                             
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                     … between two words                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     over a time period (t, …, t + Δ)” teaches each text element in the corpus (                        
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     or                         
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     segmented into a time step in the plurality of time steps (t, …, t + Δ) based on the timestamp (which is the (t) value)).  
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
Hamilton et al. to the disclosed system of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Regarding Claim 19,
	Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the system of claim 15.
	Hamilton et al. further teaches wherein the variational inference operation comprises a filtering algorithm applied to the text of each time step (p. 13  section C “ … We removed stop words and proper nouns by (i) removing all stop-words from the available lists in Python’s NLTK package … and (ii) restricting our analysis to words with part-of-speech (POS) tags corresponding to four main linguistic categories (common nouns, verbs, adverbs, and adjectives) …” teaches filtering of stop words from all text in the available lists), wherein the filtering algorithm comprises: 
	initializing a plurality of variational parameters for the word embedding vectors; initializing a plurality of variational parameters for the context embedding vectors (p. 3, section 2.1.3, paragraph 2 “SGNS has the benefit of allowing incremental initialization during learning, where the embeddings for time t are initialized with the embeddings from time t – Δ …”  and  p. 3, section 2.2, paragraph 2 “We follow the recommendations of … in setting the hyperparameters for the embedding methods, though preliminary experiments were used to tune key settings … we used symmetric context windows of size 4 (on each side).  For SGNS … we used embeddings of size 300…” teach the initializing of the variational parameters (hyperparameters) of the word and context embeddings at time t for skip-gram negative sampling (SGNS) using context window of a particular size and embedding of a particular size); 
	receiving data for a first time step of the plurality of time steps (  p. 4, section 2.4, paragraph 2 “ … We quantify shifts by computing the similarity time-series                         
                            
                                
                                    s
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            
                                
                                    
                                        
                                            w
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            w
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            c
                            o
                            s
                            -
                            s
                            i
                            m
                            (
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            ,
                             
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                     … between two words                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     over a time period (t, …, t + Δ)” teaches receiving data (                        
                            
                                
                                    w
                                
                                
                                    i
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     or                         
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                        
                     for a first time step (t) in the plurality of time steps (t, …, t + Δ)); 
preprocessing the text corpus to generate a positive count matrix for the first time step (pp. 2-3, section 2.1.1, paragraph 1  
    PNG
    media_image3.png
    525
    607
    media_image3.png
    Greyscale
 
    PNG
    media_image4.png
    267
    610
    media_image4.png
    Greyscale
teaches the Positive Pointwise Mutual Information matrix being the positive count matrix being generated within each time step (sliding window j of text containing word                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and surrounding context                         
                            
                                
                                    c
                                
                                
                                    j
                                
                            
                        
                    ))
and optimizing the plurality of variational parameters for the word embedding vectors and the plurality of variational parameters for the context embedding vectors p. 3, section 2.1.3, paragraph 1
    PNG
    media_image2.png
    416
    624
    media_image2.png
    Greyscale

teaches optimization of word and context embedding vectors using stochastic gradient descent).
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the variational inference operation comprises a filtering algorithm applied to the text of each time step, wherein the filtering algorithm comprises: initializing a plurality of variational parameters for the word embedding vectors; initializing a plurality of variational parameters for the context embedding vectors; receiving data for a first time step of the plurality of time steps; preprocessing the text corpus to generate a positive count matrix for the first time step; and optimizing the plurality of variational parameters for the word embedding vectors Hamilton et al. to the disclosed system of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Claims 2, 9, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Trask et al. (US 2016/0247061 A1) in view of Cotterell et al. (“Morphological Smoothing and Extrapolation of Word Embeddings”) and in view of Hamilton et al. (“Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change”) and in further view of Zelevinsky et al. (US 2016/0085853 A1).
Regarding Claim 2, 
	Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the method of claim 1.
	Hamilton et al. further teaches monitoring the semantic use of the word based on new text added to the corpus of text and the smoothed model (p. 8, section 4.4, paragraph 3, “We measure a word’s contextual diversity … by examining its neighborhood in an empirical co-occurrence network… words are connected to each other if they co-occur more than one would expect by chance (after smoothing) …” teaches the requirement of smoothed model prior to monitoring (examining the word’s neighborhood);  p. 4, section 3.2, paragraph 1 “We evaluate the diachronic validity of the methods on two historical semantic tasks: detecting known shifts and discovering shifts from data.   For both these tasks, we performed detailed evaluations on a small set of examples … Using these reasonably-sized evaluation sets allowed the authors to evaluate each case rigorously using existing literature and historical corpora”  teaches monitoring the semantic use of the word (discovering, detecting, evaluating) based on new text (historical data) being added to the corpus of text (existing literature)); identifying the change in the semantic use of the word based on at least one of: (i) a distance between two of the word embedding vectors, and (ii) a distance between two of the context embedding vectors ( 

    PNG
    media_image5.png
    470
    624
    media_image5.png
    Greyscale
                     teaches identification of change in semantic use of the word based on the cosine-distance between two word embedding vectors (                         
                            
                                
                                    w
                                
                                
                                    t
                                
                            
                        
                     and                         
                            
                                
                                    w
                                
                                
                                    t
                                    +
                                    ∆
                                
                            
                        
                    )); generating an indication of the semantic use of the word (p.7, section 4.1, paragraph 3   

    PNG
    media_image6.png
    145
    616
    media_image6.png
    Greyscale

    PNG
    media_image7.png
    389
    624
    media_image7.png
    Greyscale

 teaches the indication of change (equation 7) based on the effects of frequency and polysemy); and outputting the indication ( 
    PNG
    media_image8.png
    515
    1270
    media_image8.png
    Greyscale
      
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate monitoring the semantic use of the word based on new text added to the corpus of text and the smoothed model; generating an indication of the semantic use of the word; and outputting the indication as taught by Hamilton et al. to the disclosed method of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. does not appear to explicitly teach prior to identifying the change in the semantic use of the word, receiving a request to monitor the semantic use of the word. 
Zelevinsky et al. teaches prior to identifying the change in the semantic use of the word, receiving a request to monitor the semantic use of the word (p. 6 [0001] “One embodiment is a system for performing semantic search.  The system receives an electronic text corpus and separates the text corpus into a plurality of sentences. The system parses and converts each sentence into a sentence tree.  The system receives a search query and matches the search query with one or more of the sentence trees” teaches a request to monitor the semantic use of the word 
Trask et al., Cotterell et al., Hamilton et al. and Zelevinsky et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate prior to identifying the change in the semantic use of the word, receiving a request to monitor the semantic use of the word as taught by Zelevinsky et al. to the disclosed method of Trask et al. in view of Cotterell et al. and in further view of Hamilton et al.
One of ordinary skill in the art would have been motivated to make this modification in order to provide improved technical solutions to the problem of searching massive volumes of electronic documents for useful information (Zelevinsky et al. p.1 [0016]).
Regarding Claim 9, 
	Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the computer-readable storage medium of claim 8.
	Hamilton et al. further teaches monitoring the semantic use of the word based on new text added to the corpus of text and the smoothed model (p. 8, section 4.4, paragraph 3, “We measure a word’s contextual diversity … by examining its neighborhood in an empirical co-occurrence network… words are connected to each other if they co-occur more than one would expect by chance (after smoothing) …” teaches the requirement of smoothed model prior to monitoring (examining the word’s neighborhood);  p. 4, section 3.2, paragraph 1 “We evaluate the diachronic validity of the methods on two historical semantic tasks: detecting known shifts and discovering shifts from data.   For both these tasks, we performed detailed evaluations on a small set of examples … Using these reasonably-sized evaluation sets allowed the authors to evaluate each case rigorously using existing literature and historical corpora”  teaches monitoring the semantic use of the word (discovering, detecting, evaluating) based on new text (historical data) being added to the corpus of text (existing literature)); identifying the change in the semantic use of the word based on at least one of: (i) a distance between two of the word embedding vectors, and (ii) a distance between two of the context embedding vectors ( 

    PNG
    media_image5.png
    470
    624
    media_image5.png
    Greyscale
                     teaches identification of change in semantic use of the word based on the cosine-distance between two word embedding vectors (                         
                            
                                
                                    w
                                
                                
                                    t
                                
                            
                        
                     and                         
                            
                                
                                    w
                                
                                
                                    t
                                    +
                                    ∆
                                
                            
                        
                    )); generating an indication of the semantic use of the word (p.7, section 4.1, paragraph 3   

    PNG
    media_image6.png
    145
    616
    media_image6.png
    Greyscale

    PNG
    media_image7.png
    389
    624
    media_image7.png
    Greyscale

 teaches the indication of change (equation 7) based on the effects of frequency and polysemy); and outputting the indication ( 
    PNG
    media_image8.png
    515
    1270
    media_image8.png
    Greyscale
      
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate monitoring the semantic use of the word based on new text added to the corpus of text and the smoothed model; generating an indication of the semantic use of the word; and outputting the indication as taught by Hamilton et al. to the disclosed computer-readable storage medium of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. does not appear to explicitly teach prior to identifying the change in the semantic use of the word, receiving a request to monitor the semantic use of the word. 
Zelevinsky et al. teaches prior to identifying the change in the semantic use of the word, receiving a request to monitor the semantic use of the word (p. 6 [0001] “One embodiment is a system for performing semantic search.  The system receives an electronic text corpus and separates the text corpus into a plurality of sentences. The system parses and converts each sentence into a sentence tree.  The system receives a search query and matches the search query with one or more of the sentence trees” teaches a request to monitor the semantic use of the word 
Trask et al., Cotterell et al., Hamilton et al. and Zelevinsky et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate prior to identifying the change in the semantic use of the word, receiving a request to monitor the semantic use of the word as taught by Zelevinsky et al. to the disclosed computer-readable storage medium of Trask et al. in view of Cotterell et al. and in further view of Hamilton et al.
One of ordinary skill in the art would have been motivated to make this modification in order to provide improved technical solutions to the problem of searching massive volumes of electronic documents for useful information (Zelevinsky et al. p.1 [0016]).
Regarding Claim 16, 
	Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the system of claim 15.
	Hamilton et al. further teaches monitoring the semantic use of the word based on new text added to the corpus of text and the smoothed model (p. 8, section 4.4, paragraph 3, “We measure a word’s contextual diversity … by examining its neighborhood in an empirical co-occurrence network… words are connected to each other if they co-occur more than one would expect by chance (after smoothing) …” teaches the requirement of smoothed model prior to monitoring (examining the word’s neighborhood);  p. 4, section 3.2, paragraph 1 “We evaluate the diachronic validity of the methods on two historical semantic tasks: detecting known shifts and discovering shifts from data.   For both these tasks, we performed detailed evaluations on a small set of examples … Using these reasonably-sized evaluation sets allowed the authors to evaluate each case rigorously using existing literature and historical corpora”  teaches monitoring the semantic use of the word (discovering, detecting, evaluating) based on new text (historical data) being added to the corpus of text (existing literature)); identifying the change in the semantic use of the word based on at least one of: (i) a distance between two of the word embedding vectors, and (ii) a distance between two of the context embedding vectors ( 

    PNG
    media_image5.png
    470
    624
    media_image5.png
    Greyscale
                     teaches identification of change in semantic use of the word based on the cosine-distance between two word embedding vectors (                         
                            
                                
                                    w
                                
                                
                                    t
                                
                            
                        
                     and                         
                            
                                
                                    w
                                
                                
                                    t
                                    +
                                    ∆
                                
                            
                        
                    )); generating an indication of the semantic use of the word (p.7, section 4.1, paragraph 3   

    PNG
    media_image6.png
    145
    616
    media_image6.png
    Greyscale

    PNG
    media_image7.png
    389
    624
    media_image7.png
    Greyscale

 teaches the indication of change (equation 7) based on the effects of frequency and polysemy); and outputting the indication ( 
    PNG
    media_image8.png
    515
    1270
    media_image8.png
    Greyscale
      
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate monitoring the semantic use of the word based on new text added to the corpus of text and the smoothed model; generating an indication of the semantic use of the word; and outputting the indication as taught by Hamilton et al. to the disclosed system of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. does not appear to explicitly teach prior to identifying the change in the semantic use of the word, receiving a request to monitor the semantic use of the word. 
Zelevinsky et al. teaches prior to identifying the change in the semantic use of the word, receiving a request to monitor the semantic use of the word (p. 6 [0001] “One embodiment is a system for performing semantic search.  The system receives an electronic text corpus and separates the text corpus into a plurality of sentences. The system parses and converts each sentence into a sentence tree.  The system receives a search query and matches the search query with one or more of the sentence trees” teaches a request to monitor the semantic use of the word 
Trask et al., Cotterell et al., Hamilton et al. and Zelevinsky et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate prior to identifying the change in the semantic use of the word, receiving a request to monitor the semantic use of the word as taught by Zelevinsky et al. to the disclosed system of Trask et al. in view of Cotterell et al. and in further view of Hamilton et al.
One of ordinary skill in the art would have been motivated to make this modification in order to provide improved technical solutions to the problem of searching massive volumes of electronic documents for useful information (Zelevinsky et al. p.1 [0016]).
Claims 4-5, 11-12, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Trask et al. (US 2016/0247061 A1) in view of Cotterell et al. (“Morphological Smoothing and Extrapolation of Word Embeddings”) and in view of Hamilton et al. (“Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change”) and in further view of Schwartz et al. (“Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction”).
Regarding Claim 4, 
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the method of claim 3.
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. does not appear to explicitly teach wherein deriving the machine learning data model further comprises: 
Schwartz et al. teaches wherein deriving the machine learning data model further comprises: generating a positive count matrix, wherein the positive count matrix specifies, for each of a plurality of pairs of words in the corpus of text, a respective count of observed occurrences of the words in each respective pair within a predefined window of words in the text of each time step (p. 261, section 2, paragraph 10 “Symmetric patterns (SPs) were employed in various NLP tasks to capture different aspects of word similarity…” and  p. 262, section 3.2, paragraph 1 “In order to generate word embeddings, our model requires a large corpus C, and a set of SPs P.  The model first computes a symmetric matrix M of size                         
                            V
                            ×
                            V
                        
                     (where V is the size of the lexicon).  In this matrix,                         
                            
                                
                                    M
                                
                                
                                    i
                                    ,
                                    j
                                
                            
                        
                    is the co-occurrence count of both                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                    and                          
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     in all patterns                         
                            p
                             
                            ∈
                            P
                            .
                        
                     For example, if                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     co-occur 1 time in                         
                            
                                
                                    p
                                
                                
                                    1
                                
                            
                        
                     and 3 times in                         
                            
                                
                                    p
                                
                                
                                    5
                                
                            
                            ,
                        
                     while                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     co-occur 7 times in                         
                            
                                
                                    p
                                
                                
                                    9
                                
                            
                        
                    , then                         
                            
                                
                                    M
                                
                                
                                    i
                                    ,
                                    j
                                
                            
                            =
                            
                                
                                    M
                                
                                
                                    j
                                    ,
                                    i
                                
                            
                            =
                            1
                            +
                            3
                            +
                            7
                            =
                            11
                        
                    . We then compute the Positive Pointwise Mutual Information (PPMI) of M, denoted by M*. The vector representation of the word                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                    (denoted by                         
                            
                                
                                    v
                                
                                
                                    i
                                
                            
                        
                    ) is the ith row in M*” teaches the Positive Pointwise Mutual Information matrix M* being the positive count matrix, incorporating counts of observed pairs of words (symmetric patterns) in each respective pair throughout corpus of text).
Trask et al., Cotterell et al., Hamilton et al. and Schwartz et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
Schwartz et al. to the disclosed method of Trask et al. in view of Cotterell et al. and in further view of Hamilton et al.
One of ordinary skill in the art would have been motivated to make this modification in order to generate pattern-based word embeddings (Schwartz et al. p. 261, section 3, paragraph 1).
Regarding Claim 5, 
Trask et al., in view of Cotterell et al. and in view of Hamilton et al. and in further view of Schwartz et al. teaches the method of claim 4.
	Schwartz et al. further teaches wherein deriving the machine learning model further comprises: generating a negative count matrix based on a plurality of rejected pairs of words in a second training corpus of text (p. 261, section 2, paragraph 10 “Symmetric patterns (SPs) were employed in various NLP tasks to capture different aspects of word similarity…” and  p. 262, section 3.2, paragraph 1 “In order to generate word embeddings, our model requires a large corpus C, and a set of SPs P.  The model first computes a symmetric matrix M of size                         
                            V
                            ×
                            V
                        
                     (where V is the size of the lexicon).  In this matrix,                         
                            
                                
                                    M
                                
                                
                                    i
                                    ,
                                    j
                                
                            
                        
                    is the co-occurrence count of both                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                    and                          
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     in all patterns                         
                            p
                             
                            ∈
                            P
                            .
                        
                     For example, if                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     co-occur 1 time in                         
                            
                                
                                    p
                                
                                
                                    1
                                
                            
                        
                     and 3 times in                         
                            
                                
                                    p
                                
                                
                                    5
                                
                            
                            ,
                        
                     while                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     co-occur 7 times in                         
                            
                                
                                    p
                                
                                
                                    9
                                
                            
                        
                    , then                         
                            
                                
                                    M
                                
                                
                                    i
                                    ,
                                    j
                                
                            
                            =
                            
                                
                                    M
                                
                                
                                    j
                                    ,
                                    i
                                
                            
                            =
                            1
                            +
                            3
                            +
                            7
                            =
                            11
                        
                    . We then compute the Positive Pointwise Mutual Information (PPMI) of M, denoted by M*. The vector representation of the word                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                    (denoted by                         
                            
                                
                                    v
                                
                                
                                    i
                                
                            
                        
                    ) is the ith row in M*” teaches the Positive Pointwise Mutual Information matrix M* being the positive count matrix, incorporating counts of observed pairs of words (symmetric patterns) in each respective pair throughout corpus of text; pp. 262-263, section 4, paragraphs 6-7 “ … we present a variant of our model, which is designed to assign dissimilar vector representations to antonyms. We define two new matrices: MSP and MAP, which are computed similarly to M* … only with different SP sets.  MSP is computed using the original set of SPs excluding the two antonym patterns, while MAP is computed using the two antonym patterns only.  Then, we define an antonym-sensitive co-occurrence matrix M+AN to be M+AN = MSP – β * MAP, where β is a weighting parameter.  Similarly to M*, the antonym-sensitive word representation of the ith word is the ith row in M+AN” teaches the negative count matrix being the antonym-sensitive co-occurrence matrix M+AN, which is based on a plurality of rejected pairs of words (antonyms) in a second training corpus of text (symmetric patterns consisting only of antonym patterns)). 
Trask et al., Cotterell et al., Hamilton et al. and Schwartz et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein deriving the machine learning model further comprises: generating a negative count matrix based on a plurality of rejected pairs of words in a second training corpus of text as taught by Schwartz et al. to the disclosed method of Trask et al. in view of Cotterell et al. and in further view of Hamilton et al.
Schwartz et al. p. 261, section 3, paragraph 1).
Regarding Claim 11, 
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the computer-readable storage medium of claim 10.
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. does not appear to explicitly teach wherein deriving the machine learning data model further comprises: generating a positive count matrix, wherein the positive count matrix specifies, for each of a plurality of pairs of words in the corpus of text, a respective count of observed occurrences of the words in each respective pair within a predefined window of words in the text of each time step.
Schwartz et al. teaches wherein deriving the machine learning data model further comprises: generating a positive count matrix, wherein the positive count matrix specifies, for each of a plurality of pairs of words in the corpus of text, a respective count of observed occurrences of the words in each respective pair within a predefined window of words in the text of each time step (p. 261, section 2, paragraph 10 “Symmetric patterns (SPs) were employed in various NLP tasks to capture different aspects of word similarity…” and  p. 262, section 3.2, paragraph 1 “In order to generate word embeddings, our model requires a large corpus C, and a set of SPs P.  The model first computes a symmetric matrix M of size                         
                            V
                            ×
                            V
                        
                     (where V is the size of the lexicon).  In this matrix,                         
                            
                                
                                    M
                                
                                
                                    i
                                    ,
                                    j
                                
                            
                        
                    is the co-occurrence count of both                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                    and                          
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     in all patterns                         
                            p
                             
                            ∈
                            P
                            .
                        
                     For example, if                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     co-occur 1 time in                         
                            
                                
                                    p
                                
                                
                                    1
                                
                            
                        
                     and 3 times in                         
                            
                                
                                    p
                                
                                
                                    5
                                
                            
                            ,
                        
                     while                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     co-occur 7 times in                         
                            
                                
                                    p
                                
                                
                                    9
                                
                            
                        
                    , then                         
                            
                                
                                    M
                                
                                
                                    i
                                    ,
                                    j
                                
                            
                            =
                            
                                
                                    M
                                
                                
                                    j
                                    ,
                                    i
                                
                            
                            =
                            1
                            +
                            3
                            +
                            7
                            =
                            11
                        
                    . We then compute the Positive Pointwise Mutual Information (PPMI) of M, denoted by M*. The vector representation of the word                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                    (denoted by                         
                            
                                
                                    v
                                
                                
                                    i
                                
                            
                        
                    ) is the ith row in M*” teaches the Positive Pointwise Mutual Information matrix M* being the positive count matrix, incorporating counts of observed pairs of words (symmetric patterns) in each respective pair throughout corpus of text).
Trask et al., Cotterell et al., Hamilton et al. and Schwartz et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein deriving the machine learning data model further comprises: generating a positive count matrix, wherein the positive count matrix specifies, for each of a plurality of pairs of words in the corpus of text, a respective count of observed occurrences of the words in each respective pair within a predefined window of words in the text of each time step as taught by Schwartz et al. to the disclosed computer-readable storage medium of Trask et al. in view of Cotterell et al. and in further view of Hamilton et al.
One of ordinary skill in the art would have been motivated to make this modification in order to generate pattern-based word embeddings (Schwartz et al. p. 261, section 3, paragraph 1).
Regarding Claim 12, 
Trask et al., in view of Cotterell et al. and in view of Hamilton et al. and in further view of Schwartz et al. teaches the computer-readable storage medium of claim 11.
	Schwartz et al. further teaches wherein deriving the machine learning model further comprises: generating a negative count matrix based on a plurality of rejected pairs of words in a second training corpus of text (p. 261, section 2, paragraph 10 “Symmetric patterns (SPs) were employed in various NLP tasks to capture different aspects of word similarity…” and  p. 262, section 3.2, paragraph 1 “In order to generate word embeddings, our model requires a large corpus C, and a set of SPs P.  The model first computes a symmetric matrix M of size                         
                            V
                            ×
                            V
                        
                     (where V is the size of the lexicon).  In this matrix,                         
                            
                                
                                    M
                                
                                
                                    i
                                    ,
                                    j
                                
                            
                        
                    is the co-occurrence count of both                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                    and                          
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     in all patterns                         
                            p
                             
                            ∈
                            P
                            .
                        
                     For example, if                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     co-occur 1 time in                         
                            
                                
                                    p
                                
                                
                                    1
                                
                            
                        
                     and 3 times in                         
                            
                                
                                    p
                                
                                
                                    5
                                
                            
                            ,
                        
                     while                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     co-occur 7 times in                         
                            
                                
                                    p
                                
                                
                                    9
                                
                            
                        
                    , then                         
                            
                                
                                    M
                                
                                
                                    i
                                    ,
                                    j
                                
                            
                            =
                            
                                
                                    M
                                
                                
                                    j
                                    ,
                                    i
                                
                            
                            =
                            1
                            +
                            3
                            +
                            7
                            =
                            11
                        
                    . We then compute the Positive Pointwise Mutual Information (PPMI) of M, denoted by M*. The vector representation of the word                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                    (denoted by                         
                            
                                
                                    v
                                
                                
                                    i
                                
                            
                        
                    ) is the ith row in M*” teaches the Positive Pointwise Mutual Information matrix M* being the positive count matrix, incorporating counts of observed pairs of words (symmetric patterns) in each respective pair throughout corpus of text; pp. 262-263, section 4, paragraphs 6-7 “ … we present a variant of our model, which is designed to assign dissimilar vector representations to antonyms. We define two new matrices: MSP and MAP, which are computed similarly to M* … only with different SP sets.  MSP is computed using the original set of SPs excluding the two antonym patterns, while MAP is computed using the two antonym patterns only.  Then, we define an antonym-sensitive co-occurrence matrix M+AN to be M+AN = MSP – β * MAP, where β is a weighting parameter.  Similarly to M*, the antonym-sensitive word representation of the ith word is the ith row in M+AN” teaches the negative count matrix being the antonym-sensitive co-occurrence matrix M+AN, which is based on a plurality of rejected pairs of words (antonyms) in a second training corpus of text (symmetric patterns consisting only of antonym patterns)). 
Trask et al., Cotterell et al., Hamilton et al. and Schwartz et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
Schwartz et al. to the disclosed computer-readable storage medium of Trask et al. in view of Cotterell et al. and in further view of Hamilton et al.
One of ordinary skill in the art would have been motivated to make this modification in order to generate pattern-based word embeddings (Schwartz et al. p. 261, section 3, paragraph 1).
Regarding Claim 18, 
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the system of claim 15.
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. does not appear to explicitly teach wherein deriving the machine learning data model further comprises: generating a positive count matrix, wherein the positive count matrix specifies, for each of a plurality of pairs of words in the corpus of text, a respective count of observed occurrences of the words in each respective pair within a predefined window of words in the text of each time step; and generating a negative count matrix based on a plurality of rejected pairs of words in a second training corpus of text.
Schwartz et al. teaches wherein deriving the machine learning data model further comprises: generating a positive count matrix, wherein the positive count matrix specifies, for each of a plurality of pairs of words in the corpus of text, a respective count of observed occurrences of the words in each respective pair within a predefined window of words in the text of each time step (p. 261, section 2, paragraph 10 “Symmetric patterns (SPs) were employed in various NLP tasks to capture different aspects of word similarity…” and  p. 262, section 3.2, paragraph 1 “In order to generate word embeddings, our model requires a large corpus C, and a set of SPs P.  The model first computes a symmetric matrix M of size                         
                            V
                            ×
                            V
                        
                     (where V is the size of the lexicon).  In this matrix,                         
                            
                                
                                    M
                                
                                
                                    i
                                    ,
                                    j
                                
                            
                        
                    is the co-occurrence count of both                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                    and                          
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     in all patterns                         
                            p
                             
                            ∈
                            P
                            .
                        
                     For example, if                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     co-occur 1 time in                         
                            
                                
                                    p
                                
                                
                                    1
                                
                            
                        
                     and 3 times in                         
                            
                                
                                    p
                                
                                
                                    5
                                
                            
                            ,
                        
                     while                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     co-occur 7 times in                         
                            
                                
                                    p
                                
                                
                                    9
                                
                            
                        
                    , then                         
                            
                                
                                    M
                                
                                
                                    i
                                    ,
                                    j
                                
                            
                            =
                            
                                
                                    M
                                
                                
                                    j
                                    ,
                                    i
                                
                            
                            =
                            1
                            +
                            3
                            +
                            7
                            =
                            11
                        
                    . We then compute the Positive Pointwise Mutual Information (PPMI) of M, denoted by M*. The vector representation of the word                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                    (denoted by                         
                            
                                
                                    v
                                
                                
                                    i
                                
                            
                        
                    ) is the ith row in M*” teaches the Positive Pointwise Mutual Information matrix M* being the positive count matrix, incorporating counts of observed pairs of words (symmetric patterns) in each respective pair throughout corpus of text) and generating a negative count matrix based on a plurality of rejected pairs of words in a second training corpus of text (p. 261, section 2, paragraph 10 “Symmetric patterns (SPs) were employed in various NLP tasks to capture different aspects of word similarity…” and  p. 262, section 3.2, paragraph 1 “In order to generate word embeddings, our model requires a large corpus C, and a set of SPs P.  The model first computes a symmetric matrix M of size                         
                            V
                            ×
                            V
                        
                     (where V is the size of the lexicon).  In this matrix,                         
                            
                                
                                    M
                                
                                
                                    i
                                    ,
                                    j
                                
                            
                        
                    is the co-occurrence count of both                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                    and                          
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     in all patterns                         
                            p
                             
                            ∈
                            P
                            .
                        
                     For example, if                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     co-occur 1 time in                         
                            
                                
                                    p
                                
                                
                                    1
                                
                            
                        
                     and 3 times in                         
                            
                                
                                    p
                                
                                
                                    5
                                
                            
                            ,
                        
                     while                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                            ,
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     co-occur 7 times in                         
                            
                                
                                    p
                                
                                
                                    9
                                
                            
                        
                    , then                         
                            
                                
                                    M
                                
                                
                                    i
                                    ,
                                    j
                                
                            
                            =
                            
                                
                                    M
                                
                                
                                    j
                                    ,
                                    i
                                
                            
                            =
                            1
                            +
                            3
                            +
                            7
                            =
                            11
                        
                    . We then compute the Positive Pointwise Mutual Information (PPMI) of M, denoted by M*. The vector representation of the word                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                    (denoted by                         
                            
                                
                                    v
                                
                                
                                    i
                                
                            
                        
                    ) is the ith row in M*” teaches the Positive Pointwise Mutual Information matrix M* being the positive count matrix, incorporating counts of observed pairs of words (symmetric patterns) in each respective pair throughout corpus of text; pp. 262-263, section 4, paragraphs 6-7 “ … we present a variant of our model, which is designed to assign dissimilar vector representations to antonyms. We define two new matrices: MSP and MAP, which are computed similarly to M* … only with different SP sets.  MSP is computed using the original set of SPs excluding the two antonym patterns, while MAP is computed using the two antonym patterns only.  Then, we define an antonym-sensitive co-occurrence matrix M+AN to be M+AN = MSP – β * MAP, where β is a weighting parameter.  Similarly to M*, the antonym-sensitive word representation of the ith word is the ith row in M+AN” teaches the negative count matrix being the antonym-sensitive co-occurrence matrix M+AN, which is based on a plurality of rejected pairs of words (antonyms) in a second training corpus of text (symmetric patterns consisting only of antonym patterns)).
Trask et al., Cotterell et al., Hamilton et al. and Schwartz et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein deriving the machine learning data model further comprises: generating a positive count matrix, wherein the positive count matrix specifies, for each of a plurality of pairs of words in the corpus of text, a respective count of observed occurrences of the words in each respective pair within a predefined window of words in the text of each time step; and generating a negative count matrix based on a plurality of rejected pairs of words in a second training corpus of text as taught by Schwartz et al. to the disclosed system of Trask et al. in view of Cotterell et al. and in further view of Hamilton et al.
Schwartz et al. p. 261, section 3, paragraph 1).
Claims 7, 14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Trask et al. (US 2016/0247061 A1) in view of Cotterell et al. (“Morphological Smoothing and Extrapolation of Word Embeddings”) and in view of Hamilton et al. (“Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change”) and in further view of Boffy (“Large scale Singular Value Decomposition and applications in Machine Learning”).
Regarding Claim 7,
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the method of claim 1.
Cotterell et al. further teaches wherein the variational inference operation comprises a smoothing algorithm applied to the text of each time step (pp. 1655-1656, section 8, paragraph 4, “ … Given a finite training corpus … and a lexicon … we generate embeddings … for all word types … using the GENSIM implementation … of the WORD2VEC hierarchical softmax skip-gram model …” and pp. 1651-1652, section I, paragraph 3 “ Our proposed method runs a fast post-processor on the output of any existing tool that constructs word embeddings, such as WORD2VEC… some embeddings are noisy or missing, due to sparse training data.  We correct these problems by using a Gaussian graphical model that jointly models the embeddings of morphologically related words.  Inference under this model can smooth the noisy embeddings that were observed in the WORD2VEC output …” teaches a smoothing algorithm applied to all text in a training corpus), wherein the smoothing algorithm comprises: … sampling from a variational distribution (p. 1653, section 4, paragraphs 3-4 “Our system’s output will be a guess of all of the                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                    . Our system’s input consists of noisy estimates                         
                            
                                
                                    v
                                
                                
                                    i
                                
                            
                        
                     for some of the                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                    , as provided by a black-box word embedding system run on some large corpus … We assume that the black-box system would have recovered the “true”                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     if given enough data, but instead it gives a noisy small-sample estimate  

    PNG
    media_image9.png
    71
    461
    media_image9.png
    Greyscale
  
where                          
                            
                                
                                    n
                                
                                
                                    i
                                
                            
                        
                     is the count of word i in training corpus … This formula is inspired by the central limit theorem, which guarantees that                          
                            
                                
                                    v
                                
                                
                                    i
                                
                            
                            '
                            s
                        
                     distribution would approach (4) (as                          
                            
                                
                                    n
                                
                                
                                    i
                                     
                                
                            
                            →
                            ∞
                        
                    ) if it were estimated by averaging a set of                         
                            
                                
                                    n
                                
                                
                                    i
                                     
                                
                            
                        
                    noisy vectors drawn IID from any distribution with mean                         
                            
                                
                                    w
                                
                                
                                    i
                                     
                                
                            
                        
                     (the truth) and covariance matrix i’ “ teaches sampling of noisy word embedding vectors from any distribution).
	Trask et al. and Cotterell et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the variational inference operation comprises a smoothing algorithm applied to the text of each time step, wherein the smoothing algorithm comprises: … sampling from a variational distribution as taught by Cotterell et al. to the disclosed method of Trask et al.
One of ordinary skill in the art would have been motivated to make this modification in order to exploit lexical relations documented in existing morphological resources to smooth vectors for observed words and extrapolate vectors for new words.  (Cotterell et al. p. 1659, section 9, paragraph 1).
Trask et al., in view of Cotterell et al. does not appear to explicitly teach preprocessing the text corpus to generate a positive count matrix for each time step; initializing a plurality of variational parameters for the word embedding vectors; and initializing a plurality of variational parameters for the context embedding vectors. 
Hamilton et al. teaches preprocessing the text corpus to generate a positive count matrix for each time step (pp. 2-3, section 2.1.1, paragraph 1  
    PNG
    media_image3.png
    525
    607
    media_image3.png
    Greyscale
 
    PNG
    media_image4.png
    267
    610
    media_image4.png
    Greyscale
teaches the Positive Pointwise Mutual Information matrix being the positive count matrix being generated within each time step (sliding window j of text containing word                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and surrounding context                         
                            
                                
                                    c
                                
                                
                                    j
                                
                            
                        
                    )); 
initializing a plurality of variational parameters for the word embedding vectors; initializing a plurality of variational parameters for the context embedding vectors (p. 3, section 2.1.3, paragraph 2 “SGNS has the benefit of allowing incremental initialization during learning, where the embeddings for time t are initialized with the embeddings from time t – Δ …”  and  p. 3, section 2.2, paragraph 2 “We follow the recommendations of … in setting the hyperparameters for the embedding methods, though preliminary experiments were used to tune key settings … we used symmetric context windows of size 4 (on each side).  For SGNS … we used embeddings of size 300…” teach the initializing of the variational parameters (hyperparameters) of the word and context embeddings at time t for skip-gram negative sampling (SGNS) using context window of a particular size and embedding of a particular size).
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
Hamilton et al. to the disclosed method of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. does not appear to explicitly teach optimizing the plurality of variational parameters for the word embedding vectors using a first bidiagonal matrix; optimizing the plurality of variational parameters for the context embedding vectors using a second bidiagonal matrix; and … computing a reparameterization gradient.
Boffy teaches optimizing the plurality of variational parameters for the word embedding vectors using a first bidiagonal matrix; optimizing the plurality of variational parameters for the context embedding vectors using a second bidiagonal matrix (p. 11, section 1.3.1, paragraph 1 “ … let’s simply address the case of a document-term matrix … These matrices are very common in natural language processing to represent documents as mathematical objects (matrices).  Each row of such a matrix simply represents a document, while each column represents a word.  If a given term appears in a document, the corresponding entry is non-zero, and the entry is zero otherwise” 
p. 13, section 2.1 paragraphs 1-2 “The most common method used to perform SVD is the one implemented in LAPACK (Linear Algebra PACKage) and in most popular SVD algorithms. For instance, svd command of Matlab uses LAPACK routines.  The basis of these methods lies in the reduction of the original matrix X to a bidagonal form (i.e., matrix where only the main diagonal and superdiagonal entries are non-zero) by using orthogonal transformations called Householder reflections … It is then easy to compute the SVD of this bidiagonal matrix by using common methods for the computation of eigenvalues of symmetric matrices”  teaches the process of creating a bidiagonal matrix as a necessary step towards computing an eigenvalue of a symmetric matrix using singular value decomposition (SVD);  p. 18, section 2.2.2, paragraph 2 

    PNG
    media_image10.png
    799
    1116
    media_image10.png
    Greyscale

teaches variational parameters of a random vector u0 (word embedding vector) optimized to converge to eigenvector uk (word embedding vector) associated with an eigenvalue of  a first symmetric matrix AAT (eigenvalue of  AAT can be determined using SVD via a first bidiagonal matrix) and teaches variational parameters of a random vector v0 (context embedding vector) optimized to converge to eigenvector vk  associated with an eigenvalue of  a second symmetric matrix ATA (eigenvalue of ATA can be determined using SVD via a second bidiagonal matrix), wherein A is the positive count matrix); and … computing a reparameterization gradient (p. 46, section 4.2.2, paragraph 2 

    PNG
    media_image11.png
    623
    1130
    media_image11.png
    Greyscale

teaches the gradient with respect to word and context embedding vectors being the first derivatives of an error function;  p. 47, section 4.2.2, paragraph 5 “ … To perform one iteration of the gradient descent algorithm, one needs to compute all the derivatives of the error function, which form the full gradient … “ and p. 47, section 4.2.2, paragraph 7 “The Stochastic Gradient Descent is usually a good alternative to the standard gradient descent for Machine Learning applications when data sets are particularly large… instead of computing the gradient by taking all of them into account, we update some parameters after each training example…” teaches computation of a gradient via stochastic gradient descent). 
Trask et al., Cotterell et al., Hamilton et al. and Boffy are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate optimizing the plurality of variational parameters for the word embedding vectors using a first bidiagonal matrix; optimizing Boffy to the disclosed method of Trask et al. in view of Cotterell et al. and in further view of Hamilton et al.
One of ordinary skill in the art would have been motivated to make this modification in order to approximate the SVD of a large matrix by using sampling methods (Boffy, p. 23, section 2.4, paragraph 1).
Regarding Claim 14,
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the computer-readable storage medium of claim 8.
Cotterell et al. further teaches wherein the variational inference operation comprises a smoothing algorithm applied to the text of each time step (pp. 1655-1656, section 8, paragraph 4, “ … Given a finite training corpus … and a lexicon … we generate embeddings … for all word types … using the GENSIM implementation … of the WORD2VEC hierarchical softmax skip-gram model …” and pp. 1651-1652, section I, paragraph 3 “ Our proposed method runs a fast post-processor on the output of any existing tool that constructs word embeddings, such as WORD2VEC… some embeddings are noisy or missing, due to sparse training data.  We correct these problems by using a Gaussian graphical model that jointly models the embeddings of morphologically related words.  Inference under this model can smooth the noisy embeddings that were observed in the WORD2VEC output …” teaches a smoothing algorithm applied to all text in a training corpus), wherein the smoothing algorithm comprises: … sampling from a variational distribution (p. 1653, section 4, paragraphs 3-4 “Our system’s output will be a guess of all of the                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                    . Our system’s input consists of noisy estimates                         
                            
                                
                                    v
                                
                                
                                    i
                                
                            
                        
                     for some of the                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                    , as provided by a black-box word embedding system run on some large corpus … We assume that the black-box system would have recovered the “true”                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     if given enough data, but instead it gives a noisy small-sample estimate  

    PNG
    media_image9.png
    71
    461
    media_image9.png
    Greyscale
  
where                          
                            
                                
                                    n
                                
                                
                                    i
                                
                            
                        
                     is the count of word i in training corpus … This formula is inspired by the central limit theorem, which guarantees that                          
                            
                                
                                    v
                                
                                
                                    i
                                
                            
                            '
                            s
                        
                     distribution would approach (4) (as                          
                            
                                
                                    n
                                
                                
                                    i
                                     
                                
                            
                            →
                            ∞
                        
                    ) if it were estimated by averaging a set of                         
                            
                                
                                    n
                                
                                
                                    i
                                     
                                
                            
                        
                    noisy vectors drawn IID from any distribution with mean                         
                            
                                
                                    w
                                
                                
                                    i
                                     
                                
                            
                        
                     (the truth) and covariance matrix i’ “ teaches sampling of noisy word embedding vectors from any distribution).
	Trask et al. and Cotterell et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the variational inference operation comprises a smoothing algorithm applied to the text of each time step, wherein the smoothing algorithm comprises: … sampling from a variational distribution as taught by Cotterell et al. to the disclosed method of Trask et al.
One of ordinary skill in the art would have been motivated to make this modification in order to exploit lexical relations documented in existing morphological resources to smooth vectors for observed words and extrapolate vectors for new words.  (Cotterell et al. p. 1659, section 9, paragraph 1).
Trask et al., in view of Cotterell et al. does not appear to explicitly teach preprocessing the text corpus to generate a positive count matrix for each time step; 
Hamilton et al. teaches preprocessing the text corpus to generate a positive count matrix for each time step (pp. 2-3, section 2.1.1, paragraph 1  
    PNG
    media_image3.png
    525
    607
    media_image3.png
    Greyscale
 
    PNG
    media_image4.png
    267
    610
    media_image4.png
    Greyscale
teaches the Positive Pointwise Mutual Information matrix being the positive count matrix being  j of text containing word                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and surrounding context                         
                            
                                
                                    c
                                
                                
                                    j
                                
                            
                        
                    )); 
initializing a plurality of variational parameters for the word embedding vectors; initializing a plurality of variational parameters for the context embedding vectors (p. 3, section 2.1.3, paragraph 2 “SGNS has the benefit of allowing incremental initialization during learning, where the embeddings for time t are initialized with the embeddings from time t – Δ …”  and  p. 3, section 2.2, paragraph 2 “We follow the recommendations of … in setting the hyperparameters for the embedding methods, though preliminary experiments were used to tune key settings … we used symmetric context windows of size 4 (on each side).  For SGNS … we used embeddings of size 300…” teach the initializing of the variational parameters (hyperparameters) of the word and context embeddings at time t for skip-gram negative sampling (SGNS) using context window of a particular size and embedding of a particular size).
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate preprocessing the text corpus to generate a positive count matrix for each time step; initializing a plurality of variational parameters for the word embedding vectors; and initializing a plurality of variational parameters for the context embedding vectors as taught by Hamilton et al. to the disclosed method of Trask et al. in view of Cotterell et al.
Hamilton et al. p. 4 section 2.4, paragraph 3).
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. does not appear to explicitly teach optimizing the plurality of variational parameters for the word embedding vectors using a first bidiagonal matrix; optimizing the plurality of variational parameters for the context embedding vectors using a second bidiagonal matrix; and … computing a reparameterization gradient.
Boffy teaches optimizing the plurality of variational parameters for the word embedding vectors using a first bidiagonal matrix; optimizing the plurality of variational parameters for the context embedding vectors using a second bidiagonal matrix (p. 11, section 1.3.1, paragraph 1 “ … let’s simply address the case of a document-term matrix … These matrices are very common in natural language processing to represent documents as mathematical objects (matrices).  Each row of such a matrix simply represents a document, while each column represents a word.  If a given term appears in a document, the corresponding entry is non-zero, and the entry is zero otherwise” teaches a document-term matrix being analogous to word-context matrix (positive count matrix); p. 13, section 2.1 paragraphs 1-2 “The most common method used to perform SVD is the one implemented in LAPACK (Linear Algebra PACKage) and in most popular SVD algorithms. For instance, svd command of Matlab uses LAPACK routines.  The basis of these methods lies in the reduction of the original matrix X to a bidagonal form (i.e., matrix where only the main diagonal and superdiagonal entries are non-zero) by using orthogonal transformations called Householder reflections … It is then easy to compute the SVD of this bidiagonal matrix by using common methods for the computation of eigenvalues of symmetric matrices”  teaches the process of creating a bidiagonal matrix as a necessary step towards computing an eigenvalue of a symmetric matrix using singular value decomposition (SVD);  p. 18, section 2.2.2, paragraph 2 

    PNG
    media_image10.png
    799
    1116
    media_image10.png
    Greyscale

teaches variational parameters of a random vector u0 (word embedding vector) optimized to converge to eigenvector uk (word embedding vector) associated with an eigenvalue of  a first symmetric matrix AAT (eigenvalue of  AAT can be determined using SVD via a first bidiagonal matrix) and teaches variational parameters of a random vector v0 (context embedding vector) optimized to converge to eigenvector vk  associated with an eigenvalue of  a second symmetric matrix ATA (eigenvalue of ATA can be determined second bidiagonal matrix), wherein A is the positive count matrix); and … computing a reparameterization gradient (p. 46, section 4.2.2, paragraph 2 

    PNG
    media_image11.png
    623
    1130
    media_image11.png
    Greyscale

teaches the gradient with respect to word and context embedding vectors being the first derivatives of an error function;  p. 47, section 4.2.2, paragraph 5 “ … To perform one iteration of the gradient descent algorithm, one needs to compute all the derivatives of the error function, which form the full gradient … “ and p. 47, section 4.2.2, paragraph 7 “The Stochastic Gradient Descent is usually a good alternative to the standard gradient descent for Machine Learning applications when data sets are particularly large… instead of computing the gradient by taking all of them into account, we update some parameters after each training example…” teaches computation of a gradient via stochastic gradient descent). 
Trask et al., Cotterell et al., Hamilton et al. and Boffy are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
Boffy to the disclosed computer-readable storage medium of Trask et al. in view of Cotterell et al. and in further view of Hamilton et al.
One of ordinary skill in the art would have been motivated to make this modification in order to approximate the SVD of a large matrix by using sampling methods (Boffy, p. 23, section 2.4, paragraph 1).
Regarding Claim 20,
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. teaches the system of claim 15.
Cotterell et al. further teaches wherein the variational inference operation comprises a smoothing algorithm applied to the text of each time step (pp. 1655-1656, section 8, paragraph 4, “ … Given a finite training corpus … and a lexicon … we generate embeddings … for all word types … using the GENSIM implementation … of the WORD2VEC hierarchical softmax skip-gram model …” and pp. 1651-1652, section I, paragraph 3 “ Our proposed method runs a fast post-processor on the output of any existing tool that constructs word embeddings, such as WORD2VEC… some embeddings are noisy or missing, due to sparse training data.  We correct these problems by using a Gaussian graphical model that jointly models the embeddings of morphologically related words.  Inference under this model can smooth the noisy embeddings that were observed in the WORD2VEC output …” teaches a p. 1653, section 4, paragraphs 3-4 “Our system’s output will be a guess of all of the                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                    . Our system’s input consists of noisy estimates                         
                            
                                
                                    v
                                
                                
                                    i
                                
                            
                        
                     for some of the                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                    , as provided by a black-box word embedding system run on some large corpus … We assume that the black-box system would have recovered the “true”                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     if given enough data, but instead it gives a noisy small-sample estimate  

    PNG
    media_image9.png
    71
    461
    media_image9.png
    Greyscale
  
where                          
                            
                                
                                    n
                                
                                
                                    i
                                
                            
                        
                     is the count of word i in training corpus … This formula is inspired by the central limit theorem, which guarantees that                          
                            
                                
                                    v
                                
                                
                                    i
                                
                            
                            '
                            s
                        
                     distribution would approach (4) (as                          
                            
                                
                                    n
                                
                                
                                    i
                                     
                                
                            
                            →
                            ∞
                        
                    ) if it were estimated by averaging a set of                         
                            
                                
                                    n
                                
                                
                                    i
                                     
                                
                            
                        
                    noisy vectors drawn IID from any distribution with mean                         
                            
                                
                                    w
                                
                                
                                    i
                                     
                                
                            
                        
                     (the truth) and covariance matrix i’ “ teaches sampling of noisy word embedding vectors from any distribution).
	Trask et al. and Cotterell et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the variational inference operation comprises a smoothing algorithm applied to the text of each time step, wherein the smoothing algorithm comprises: … sampling from a variational distribution as taught by Cotterell et al. to the disclosed system of Trask et al.
One of ordinary skill in the art would have been motivated to make this modification in order to exploit lexical relations documented in existing morphological Cotterell et al. p. 1659, section 9, paragraph 1).
Trask et al., in view of Cotterell et al. does not appear to explicitly teach preprocessing the text corpus to generate a positive count matrix for each time step; initializing a plurality of variational parameters for the word embedding vectors; and initializing a plurality of variational parameters for the context embedding vectors. 
Hamilton et al. teaches preprocessing the text corpus to generate a positive count matrix for each time step (pp. 2-3, section 2.1.1, paragraph 1  
    PNG
    media_image3.png
    525
    607
    media_image3.png
    Greyscale
 
    PNG
    media_image4.png
    267
    610
    media_image4.png
    Greyscale
teaches the Positive Pointwise Mutual Information matrix being the positive count matrix being generated within each time step (sliding window j of text containing word                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     and surrounding context                         
                            
                                
                                    c
                                
                                
                                    j
                                
                            
                        
                    )); 
initializing a plurality of variational parameters for the word embedding vectors; initializing a plurality of variational parameters for the context embedding vectors (p. 3, section 2.1.3, paragraph 2 “SGNS has the benefit of allowing incremental initialization during learning, where the embeddings for time t are initialized with the embeddings from time t – Δ …”  and  p. 3, section 2.2, paragraph 2 “We follow the recommendations of … in setting the hyperparameters for the embedding methods, though preliminary experiments were used to tune key settings … we used symmetric context windows of size 4 (on each side).  For SGNS … we used embeddings of size 300…” teach the initializing of the variational parameters (hyperparameters) of the word and context embeddings at time t for skip-gram negative sampling (SGNS) using context window of a particular size and embedding of a particular size).
Trask et al., Cotterell et al. and Hamilton et al. are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
Hamilton et al. to the disclosed system of Trask et al. in view of Cotterell et al.
One of ordinary skill in the art would have been motivated to make this modification in order to estimate the semantic displacement that a word has undergone during a certain time period (Hamilton et al. p. 4 section 2.4, paragraph 3).
Trask et al., in view of Cotterell et al. and in further view of Hamilton et al. does not appear to explicitly teach optimizing the plurality of variational parameters for the word embedding vectors using a first bidiagonal matrix; optimizing the plurality of variational parameters for the context embedding vectors using a second bidiagonal matrix; and … computing a reparameterization gradient.
Boffy teaches optimizing the plurality of variational parameters for the word embedding vectors using a first bidiagonal matrix; optimizing the plurality of variational parameters for the context embedding vectors using a second bidiagonal matrix (p. 11, section 1.3.1, paragraph 1 “ … let’s simply address the case of a document-term matrix … These matrices are very common in natural language processing to represent documents as mathematical objects (matrices).  Each row of such a matrix simply represents a document, while each column represents a word.  If a given term appears in a document, the corresponding entry is non-zero, and the entry is zero otherwise” teaches a document-term matrix being analogous to word-context matrix (positive count p. 13, section 2.1 paragraphs 1-2 “The most common method used to perform SVD is the one implemented in LAPACK (Linear Algebra PACKage) and in most popular SVD algorithms. For instance, svd command of Matlab uses LAPACK routines.  The basis of these methods lies in the reduction of the original matrix X to a bidagonal form (i.e., matrix where only the main diagonal and superdiagonal entries are non-zero) by using orthogonal transformations called Householder reflections … It is then easy to compute the SVD of this bidiagonal matrix by using common methods for the computation of eigenvalues of symmetric matrices”  teaches the process of creating a bidiagonal matrix as a necessary step towards computing an eigenvalue of a symmetric matrix using singular value decomposition (SVD);  p. 18, section 2.2.2, paragraph 2 

    PNG
    media_image10.png
    799
    1116
    media_image10.png
    Greyscale

u0 (word embedding vector) optimized to converge to eigenvector uk (word embedding vector) associated with an eigenvalue of  a first symmetric matrix AAT (eigenvalue of  AAT can be determined using SVD via a first bidiagonal matrix) and teaches variational parameters of a random vector v0 (context embedding vector) optimized to converge to eigenvector vk  associated with an eigenvalue of  a second symmetric matrix ATA (eigenvalue of ATA can be determined using SVD via a second bidiagonal martrix), wherein A is the positive count matrix); and … computing a reparameterization gradient (p. 46, section 4.2.2, paragraph 2 

    PNG
    media_image11.png
    623
    1130
    media_image11.png
    Greyscale

teaches the gradient with respect to word and context embedding vectors being the first derivatives of an error function;  p. 47, section 4.2.2, paragraph 5 “ … To perform one iteration of the gradient descent algorithm, one needs to compute all the derivatives of the error function, which form the full gradient … “ and p. 47, section 4.2.2, paragraph 7 “The Stochastic Gradient Descent is usually a good alternative to the standard gradient descent for Machine Learning applications when data sets are particularly large… instead of computing the gradient by taking all of them into account, we update some parameters after each training example…” teaches computation of a gradient via stochastic gradient descent). 
Trask et al., Cotterell et al., Hamilton et al. and Boffy are considered analogous art because they are directed towards efficient methods of extraction of useful information from electronic text.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate optimizing the plurality of variational parameters for the word embedding vectors using a first bidiagonal matrix; optimizing the plurality of variational parameters for the context embedding vectors using a second bidiagonal matrix; and … computing a reparameterization gradient as taught by Boffy to the disclosed system of Trask et al. in view of Cotterell et al. and in further view of Hamilton et al.
One of ordinary skill in the art would have been motivated to make this modification in order to approximate the SVD of a large matrix by using sampling methods (Boffy, p. 23, section 2.4, paragraph 1).

Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure:  Li et al. (“Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective”) teaches the skip-gram negative sampling model as a representation learning model and an explicit matrix factorization of a co-occurrence matrix (positive count matrix) that is directly obtained from a corpus of text.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHIAKA CHUKWUMA OKOROH whose telephone number is (571)272-3710.  The examiner can normally be reached on M - F 7:30 AM - 4:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on 571-272-7796.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CHIAKA CHUKWUMA OKOROH/Examiner, Art Unit 2125   
                                         
/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125