DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 08/21/2018 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
Drawings
The subject matter of this application admits of illustration by a drawing to facilitate understanding of the invention.  Applicant is required to furnish a drawing under 37 CFR 1.81(c).  No new matter may be introduced in the required drawing.  Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d).
Specification
35 U.S.C. 112(a) or pre-AIA  35 U.S.C.  112, requires the specification to be written in “full, clear, concise, and exact terms.” The specification is replete with terms which are not clear, concise and exact. The specification should be revised carefully in order to comply with 35 U.S.C. 112(a) or pre-AIA  35 U.S.C.  112. Examples of some unclear, inexact or verbose terms used in the specification are: 

    PNG
    media_image1.png
    51
    208
    media_image1.png
    Greyscale
equation (5) on p.7: parameters in the LHS of the conditional distribution and RHS of the softmax function are unclear.


    PNG
    media_image2.png
    45
    193
    media_image2.png
    Greyscale
equation (6) on p.7: parameters in the LHS of the probabilistic score and RHS of the softmax function are unclear.

equation (7) on p.7: the function and parameters in RHS of the log function are unclear.

    PNG
    media_image3.png
    48
    180
    media_image3.png
    Greyscale


equation (8) on p.8: functions and parameters in RHS of the two log functions are unclear.

    PNG
    media_image4.png
    30
    209
    media_image4.png
    Greyscale



    PNG
    media_image5.png
    46
    304
    media_image5.png
    Greyscale
equation (10) on p.8: the whole equation is very blurry. 


Claim Objections
Claims 1-33 are objected to because of the following informalities: 
In claims 2-3, 5, 8, 10-13, 15, 18-19, 21, 24, 26-29 and 31, the claims recite illegible equations. These claims are examined under its broadest reasonable interpretation in view of the issue of illegible text. (Note: the examiner copied the equations from the specification for the 103 rejection. The equations in the specification are a little more legible.)
In claim 1 line 5, claim 17 line 6, and claim 33 line 5, “learning a vector representation for each of the classes” should be “learning a vector representation for each of the plurality of classes”
In claim 1 line 7, claim 17 line 8, and claim 33 line 7, “training the class vectors and words vectors” should be “training the class vectors and the words vectors”
In claim 1 line 8 and claim 33 line 8, “performing class vector based scoring” should be “performing a class vector based scoring”
In claim 1 line 9, claim 17 line 10, and claim 33 line 9, “performing feature selection based on class vectors” should be “performing the feature selection based on the class vectors”
In claims 2-16 and 17-32 (in the preamble of all the dependent claims), “… for text classification using class vectors as claimed…” should be “for the text classification using the class vectors as claimed”
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):

(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-33 are rejected under 35 U.S.C. 112(b)  or pre-AIA  35 U.S.C. 112, second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
Claim 1 recite the limitation “comprising the steps of” in line 2. There is insufficient antecedent basis for this limitation in the claim. For examination purposes examiner has interpreted “the steps” to be “a plurality of steps.”
Claims 1, 17 and 33 recite the limitation “in the same embedding space” in claim 1 line 6, claim 17 line 7 and claim 33 line 6. There is insufficient antecedent basis for this limitation in the claim. For examination purposes examiner has interpreted “in the same embedding space” to be “within an embedding space.”
Claims 2 and 18 recite the limitation “the parameters of model, the prediction probability of the co-occurrence of words, the number of words in the sentence, the likelihood of the observed data, the current word, and the context word.” There is insufficient antecedent basis for those elements in the claim. For examination purposes examiner has interpreted those elements as ones that do not require antecedent basis.
Claims 3 and 19 recite the limitation “the softmax classifier, the dictionary” There is insufficient antecedent basis for those elements in the claim. For examination purposes examiner has interpreted those elements as ones that do not require antecedent basis.
Claims 5 and 21 recite the limitation “the negative sampling, the sigmoid function” There is insufficient antecedent basis for those elements in the claim. For examination purposes examiner has interpreted those elements as ones that do not require antecedent basis.
Claims 9 and 25 recite the limitation “the learning of multiple vectors, the document” There is insufficient antecedent basis for those elements in the claim. For examination purposes examiner has interpreted those elements as ones that do not require antecedent basis.
Claims 10 and 26 recite the limitation “the K possible vectors” There is insufficient antecedent basis for those elements in the claim. For examination purposes examiner has interpreted those elements as ones that do not require antecedent basis.
Claims 11 and 27 recite the limitation “the conversion of class vector and word vector similarity” There is insufficient antecedent basis for those elements in the claim. For examination purposes examiner has interpreted those elements as ones that do not require antecedent basis.
Claims 12 and 28 recite the limitation “the maximum score” There is insufficient antecedent basis for those elements in the claim. For examination purposes examiner has interpreted those elements as ones that do not require antecedent basis.
Claims 13 and 29 recite the limitation “the difference of the probability score of the class vectors, the matrix vector of the words” There is insufficient antecedent basis for those elements in the claim. For examination purposes examiner has interpreted those elements as ones that do not require antecedent basis.
Claims 14 and 30 recite the limitation “the similarity score” There is insufficient antecedent basis for those elements in the claim. For examination purposes examiner has interpreted those elements as ones that do not require antecedent basis.
Claims 15 and 31 recite the limitation “the approach, the expression” There is insufficient antecedent basis for those elements in the claim. For examination purposes examiner has interpreted those elements as ones that do not require antecedent basis.
Claims 16 and 32 recite the limitation “the document frequency of word” There is insufficient antecedent basis for those elements in the claim. For examination purposes examiner has interpreted those elements as ones that do not require antecedent basis.
Claims 16 and 32 recite the limitation "information theoretic criteria such as conditional entropy and mutual information." The phrase "such as" renders the claim indefinite because it is unclear whether the limitations following the phrase are part of the claimed invention.  See MPEP § 2173.05(d). For examination purposes examiner has interpreted the criteria to be “mutual information” according to the equation recited in the claim.
All dependent claims are also rejected due to their dependency on a rejected claim.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 7-10, 13, 17, 23-26, 29 and 33 are rejected under 35 U.S.C. 103 as being unpatentable over Tang ("PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks") in view of Forman ("An Extensive Empirical Study of Feature Selection Metrics for Text Classification").
In regard to claims 1, 17 and 33, Tang teaches: A method for text classification and feature selection using class vectors, comprising the steps of: (see below Tang for text classification using class vectors, and Doquire for feature selection using class vectors)
receiving a text/training corpus including a plurality of training features representing a plurality of objects from a plurality of classes; (Tang, p. 1167 "our goal is to learn a representation of text that is optimized for a given text classification task."; p. 1169 "We select two types of text corpora [a text/training corpus], which consist of either long or short documents."; p. 1167 "... Definition 3. (Word-Label Network) Word-label network, denoted as Gwl = (V ∪ L; Ewl), is a bipartite network that captures category-level word co-occurrences [class level]. L is a set of class labels [a plurality of classes] and V a set of words... Definition 4. (Heterogeneous Text Network) The heterogeneous text network is the combination of word-word, word-document, and word-label networks constructed from both unlabeled and labeled text data. It captures different levels of word co-occurrences [training features representing objects] and contains both labeled and unlabeled information.")
learning a vector representation for each of the classes along with word vectors in the same embedding space; (Tang, p. 1167 "The basic idea is to incorporate both the labeled and unlabeled information when learning the text embeddings. To achieve this, it is desirable to first have an unified representation to encode both types of information. In this paper, we propose different types of networks to achieve this, including word-word co-occurrence networks [word level], word-document networks, and word-label networks [class level]..."; p. 1168 "Definition 5. (Predictive Text Embedding) Given a large collection of text data with unlabeled and labeled information, the problem of predictive text embedding aims to learn low dimensional representations of words by embedding the heterogeneous text network constructed from the collection into a low dimensional vector space [the same embedding space]."; Word-label networks captures a vector representation for each of the classes. Word-word networks captures word vectors. Those information are encoded in a unified vector space.)
training the class vectors and words vectors jointly using skip-gram approach; and (Tang, p. 1166 “The proposed method naturally extends our previous work of unsupervised information network embedding [27] and first learns a low dimensional embedding for words through a heterogeneous text network. The network encodes different levels of co-occurrence information between words and words, words and documents, and words and labels.”; p. 1169 "We call this approach joint training."; p. 1167 "Definition 1 (Word-Word Network) [skip-gram approcah] ... Word-word cooccurrence network, denoted as Gww = (V;Eww), captures the word co-occurrence information in local contexts... Definition 3. (Word-Label Network) Word-label network, denoted as Gwl = (V ∪ L; Ewl), is a bipartite network that captures category-level word co-occurrences [class level]..."; p. 1168 "where ui is the embedding vector of vertex vi in VA… To learn the embeddings of the heterogeneous text network... which can be achieved by minimizing the following objective function…")

    PNG
    media_image6.png
    87
    534
    media_image6.png
    Greyscale
(The objective function of skip-gram model is defined by Mikolov and many other literature as  Mikolov Eq(1), which is used to predict the distribution of the context given a word vector. Thus, Word-word network of Tang can be seen as skip-gram approach because they have the same problem definition and objective function.)

    PNG
    media_image7.png
    63
    446
    media_image7.png
    Greyscale
(On p. 6 bottom – p. 7 top in the specification, training class vectors and word vectors jointly using skip-gram approach is defined in Eq (4) where the first term in the claimed invention Eq(4) is skip-gram.)

    PNG
    media_image8.png
    301
    617
    media_image8.png
    Greyscale
(Tang Eq(5) teaches the first term of the claimed Eq(4) i.e. training word-word network teaches skip-gram with word vectors, Tang Eq(7) teaches the second term of the claimed Eq(4) i.e. training word-label network teaches training with class and word vectors, and they are trained jointly. See claim 8 for more details. Therefore, Tang teaches training class vectors and word vectors jointly using skip-gram.)

Tang does not teach, but Forman teaches: performing class vector based scoring for a particular feature; and (Foreman, p. 1291 " Scoring [class-vector-based-scoring] involves counting the occurrences of a feature in training positive- and negative-class training examples separately, and then computing a function of these.") 
performing feature selection based on class vectors. (Forman, p. 1291 "The overall feature selection procedure is to score each potential feature according to a particular feature selection metric, and then take the best k features [performing feature selection]"; p. 1294 "PR: (Log) Probability Ratio is the sample estimate probability of the word given the positive class divided by the sample estimate probability of the word given the negative class."; Forman compares 12 feature selection methods and PR metric/score is based on positive and negative class vectors.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the input feature vector of Tang to incorporate the teachings of Forman by including the feature selection procedure. Doing so would allow the method to score each potential feature according to a particular feature selection metric.

Claims 17 and 33 recite substantially the same limitation as claim 1, therefore the rejection applied to claim 1 also apply to claim 17 and 33. In addition, Tang teaches: (claim 17) a processor arrangement, (claim 33) A non-transitory computer-readable medium having computer executable instructions (Tang, p. 1172 "… on a single machine with 1T memory, 40 CPU cores at 2.0GHZ.")

In regard to claims 7 and 23, Tang and Forman teach: The method for text classification using class vectors as claimed in claim 1, wherein during the training, each class vector is represented by an id and every word in the sentence of that class co-occurs with its class vector. (Tang, p. 1167 "Definition 3. (Word-Label Network) Word-label network, denoted as Gwl = (V ∪ L; Ewl), is a bipartite network that captures category-level word co-occurrences [co-occuring with its class/category vector]. L is a set of class labels and V a set of words... The weight wij of the edge between word vi and class cj [class id]...")


    PNG
    media_image7.png
    63
    446
    media_image7.png
    Greyscale
In regard to claims 8 and 24, Tang and Forman teach: The method for text classification using class vectors as claimed in claim 7, wherein each class id has a window length of the number of words in that class with objective function as,Where Nc is the number of classes, Nj is the number of words in classj, cj is the class id of the classj. (Tang, p. 1167 "Definition 1... V is a vocabulary of words… The weight wij of the edge between word vi and vj is defined as the number of times that the two words co-occur in the context windows of a given window size… Definition 3... between word vi and class cj [class id]... where ndi is the term frequency of word vi [the number of words] in document d, and ld is the class label [classes] of document d."; p. 1168 "To learn the embeddings of the heterogeneous text network... which can be achieved by minimizing the following objective function:"; The function, e.g. Eq (4), being maximized or minimized is 
    PNG
    media_image8.png
    301
    617
    media_image8.png
    Greyscale
called the objective function. In mathematics, maximizing a function is equivalent to minimizing its negative, i.e. when the value of the function is positive/negative, the function is being maximized/minimized, respectively. Eq (5) p (v|v) is p (w|w), Eq (7) p(v|l) is p(w|c), and Eq (4) is the sum of those two. "v_i and v_j are words cooccuring in a given window size" corresponds to "w_i+c and w_i, where c ϵ [-w, w], c≠0, the context words within a window size.")

In regard to claims 9 and 25, Tang and Forman teach: The method for text classification using class vectors as claimed in claim 1, wherein the learning of multiple vectors per class includes considering of each word in the documents of the corresponding class followed by estimating a conditional probability distribution...  conditioned on the current word (wi). (Tang, p.1168 "For each vertex vj in VB, Eq (1) defines a conditional distribution p(·|vj) over all the vertices in the set VA"; vj is the current word wi. Also see Tang2, p. 1070 for more details: "For each vertex vi, Eqn. (4) actually defines a conditional distribution p2(·|vi) over the contexts, i.e., the entire set of vertices in the network"; The examiner examines the claim under its BRI because of the blurry issue of the equation)


    PNG
    media_image1.png
    51
    208
    media_image1.png
    Greyscale
In regard to claims 10 and 26, Tang and Forman teach: The method for text classification using class vectors as claimed in claim 1, wherein class vector (...) is sampled among the K possible vectors according conditional distribution as:where zi is a discrete random variable corresponding to the class vector is the kth class vector of the jth class. (Tang, p. 1168 "The objective (3) can be optimized with stochastic gradient descent using the techniques of edge sampling [27] and negative sampling [18]. In each step, a binary edge e = (i, j) is sampled with the probability proportional to its weight wij, and meanwhile multiple negative edges (i; j) are sampled from a noise distribution pn(j)."; "Algorithm 1: Joint training. Data:Gww; Gwd; Gwl, number of samples T, number of negative samples K. [T or K can be K possible vectors], sample an edge from 
    PNG
    media_image9.png
    54
    307
    media_image9.png
    Greyscale
Ewl... and update the word and label embeddings"; Edge sampling is based on Eq (3), which is calculated according to the conditional distribution p(·|vj). 

In Definition 3, label_l is the class, therefore j in the edge e(i,j) in Gwl correponds to the kth class vector of the jth class.; The examiner examines the claim under its BRI because of the blurry issue of the equation)


    PNG
    media_image4.png
    30
    209
    media_image4.png
    Greyscale
In regard to claims 13 and 29, Tang and Forman teach: The method for text classification using class vectors as claimed in claim 1, wherein the prediction for the class of test data include step of: calculating the difference of the probability score of the class vectors and Logistic Regression classifier (CV-LR) as:where “w” is the matrix vector of the words in vocabulary. (Forman, p. 1294 "PR: (Log) Probability Ratio is the sample estimate probability of the word given the positive class divided by the sample estimate probability of the word given the negative class."; log (a/b) = log(a) - log(b); p. 1289 "In text classification... each position in the input feature vector corresponds to a given word or phrase."; feature vector is the matrix vector of the words)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the input feature vector of Tang to incorporate the teachings of Forman by including the feature selection procedure. Doing so would allow the method to score each potential feature according to a particular feature selection metric.

Claims 2-5 and 18-21 are rejected under 35 U.S.C. 103 as being unpatentable over Tang in view of Forman in view of Doquire in further view of Mikolov ("Distributed Representations of Words and Phrases and their Compositionality").

    PNG
    media_image10.png
    45
    352
    media_image10.png
    Greyscale
In regard to claims 2 and 18, Tang and Forman do not teach, but Mikolov teaches: The method for text classification using class vectors as claimed in claim 1, wherein under the skip-gram approach, the parameters of model are learnt to maximize the prediction probability of the co-occurrence of words vide function: where corpus is represented as w1, w2, w3,…, wn;N8 is the number of words in the sentence(corpus);L denotes the likelihood of the observed data; and (Mikolov, p. 2 "The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words [the co-occurrence of words] in a sentence or a document. More formally, given a sequence of training words w1, w2, w3,… wT [corpus /  the number of words], the objective of the Skip-gram model is to maximize the average log probability [the likelihood of the observed data]

    PNG
    media_image6.png
    87
    534
    media_image6.png
    Greyscale

wi denotes the current word, while wi+c is the context word within a window of size w. (Mikolov, p. 2 "where c is the size of the training context [a window of size w] (which can be a function of the center word wt [wi current word / context word]).")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to perform the prediction of the co-occurrence information between words and words (Word-Word Network), as taught by Tang, by a Skip-gram model and its objective function, as taught by Mikolov. Doing so would provide an efficient method for learning high-quality distributed vector representations. (Mikolov, p.1 "The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships.")


    PNG
    media_image11.png
    51
    367
    media_image11.png
    Greyscale

    PNG
    media_image12.png
    94
    538
    media_image12.png
    Greyscale
In regard to claims 3 and 19, Tang, Forman, Doquire and Mikolov teach: The method for text classification using class vectors as claimed in claim 1, wherein the prediction probability is calculated using the softmax classifier as:where T is number of unique words selected from corpus in the dictionary; andv'w is the vector representation of the context word. (Mikolov, p. 3 "The basic Skip-gram formulation defines p(wt+j|wt) using the softmax function:
where vw and v′w are the 'input' and 'output' vector representations of w, and W is the number of words in the vocabulary.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to perform the prediction of the co-occurrence information between words and words (Word-Word Network), as taught by Tang, by a Skip-gram model and its objective function, as taught by Mikolov. Doing so would provide an efficient method for learning high-quality distributed vector representations.

In regard to claims 4 and 20, Tang, Forman, Doquire and Mikolov teach: The method for text classification using class vectors as claimed in claim 1, wherein Hierarchical Softmax function is used to speed up training by constructing a binary Huffman tree to compute probability distribution which gives logarithmic speedup. (Mikolov, p. 3 "A computationally efficient approximation of the full softmax is the hierarchical softmax... The main advantage is that instead of evaluating W output nodes in the neural network to obtain the probability distribution, it is needed to evaluate only about log2(W) nodes [logarithmic speedup]... In our work we use a binary Huffman tree, as it assigns short codes to the frequent words which results in fast training.")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to perform the prediction of the co-occurrence information between words and words (Word-Word Network), as taught by Tang, by a Skip-gram model and its objective function, as taught by Mikolov. Doing so would provide an efficient method for learning high-quality distributed vector representations.

    PNG
    media_image13.png
    73
    618
    media_image13.png
    Greyscale

    PNG
    media_image14.png
    69
    462
    media_image14.png
    Greyscale
In regard to claims 5 and 21, Tang, Forman, Doquire and Mikolov teach: The method for text classification using class vectors as claimed in claim 1, wherein the negative sampling which approximates is carried out using formula:where... is the sigmoid function and the word wj is sampled from probability distribution over words... (Mikolov , p. 3 "We define Negative sampling (NEG) by the objective… Thus the task is to distinguish the target word Wo from draws from the noise distribution Pn(w) [probability distribution over words] using logistic regression [sigmoid]…")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to perform the prediction of the co-occurrence information between words and words (Word-Word Network), as taught by Tang, by a Skip-gram model and its objective function, as taught by Mikolov. Doing so would provide an efficient method for learning high-quality distributed vector representations.

Claims 6 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Tang in view of Forman in view of Doquire in further view of Collobert (US 20110301942 A1).
In regard to claims 6 and 22, Tang and Forman do not teach, but Collobert teaches: The method for text classification using class vectors as claimed in claim 1, wherein the word vectors are updated by maximizing the likelihood (L) using stochastic gradient ascent. (Collobert, Stochastic Gradient [0046] "The log-likelihood of equation (7) can be maximized using stochastic gradient ascent...")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to substitute the stochastic gradient descent of Tang with the stochastic gradient ascent of Collobert. Doing so would allow the method to use ascent instead of descent because they are equivalents (minimizing negative log likelihood and maximizing log likelihood).

Claims 11-12, 14, 27-28 and 30 are rejected under 35 U.S.C. 103 as being unpatentable over Tang in view of Forman in view of Doquire in further view of Tang2 ("LINE: Large-scale Information Network Embedding").

    PNG
    media_image15.png
    61
    386
    media_image15.png
    Greyscale

    PNG
    media_image2.png
    45
    193
    media_image2.png
    Greyscale
In regard to claims 11 and 27, Tang and Forman do not teach, but Tang2 teaches: The method for text classification using class vectors as claimed in claim 1, wherein the conversion of class vector and word vector similarity to probabilistic score using softmax function as:where... are the inner un-normalized jth class vector and ith word vector respectively. (Tang2, p. 1070 "The second-order proximity assumes that vertices sharing many connections to other vertices are similar to each other. In this case, each vertex is also treated as a specific 'context' and vertices with similar distributions over the 'contexts' are assumed to be similar. Therefore, each vertex plays two roles: the vertex itself [word] and a specific 'context' [class] of other vertices. We introduce two vectors u_i and u_i′, where u_i [word vector] is the representation of v_i when it is treated as a vertex while u_i ′ [class vector] is the representation of v_i when it is treated as a specific 'context'. For each directed edge (i, j), we first define the probability [probabilistic score]... where |V | is the number of vertices or 'contexts.'")
(The equation calculates the similarity of word vector and class vector, i.e. more connections, more similar of those two sets, and because it’s a softmax function, so the output of it is a [0,1] range, i.e. a probability score. Further, Tang2 doesn't mention the vectors are normalized, therefore those vectors are un-normalized. The examiner examines the claim under its BRI because of the blurry text    issue of the equation.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine Tang and Tang2 because they are the same field of endeavor and Tang is based on Tang2, i.e. PTE (Tang) extends the LINE model (Tang2).


    PNG
    media_image3.png
    48
    180
    media_image3.png
    Greyscale
In regard to claims 12 and 28, Tang, Forman, Doquire and Tang2 teach: The method for text classification using class vectors as claimed in claim 1, wherein the prediction for the class of test data include step of: performing summation of probability score is done for all the words in sentence for each class and predict the class with the maximum score (CV Score) as

    PNG
    media_image16.png
    69
    386
    media_image16.png
    Greyscale
(Tang2, "By learning {u} i = 1..|V| and {u′}I = 1..|V | that minimize this objective, we are able to represent every vertex vi with a d-dimensional vector ~ui."; Eq(6) is the summation of probability scores. In mathematics, maximizing a function is equivalent to minimizing its negative, i.e. when the value of the function is positive/negative, the function is being maximized/minimized, respectively.)

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine Tang and Tang2 because they are the same field of endeavor and Tang is based on Tang2, i.e. PTE (Tang) extends the LINE model (Tang2).

In regard to claims 14 and 30, Tang, Forman, Doquire and Tang2 teach: The method for text classification using class vectors as claimed in claim 1, wherein the similarity between class vectors and word vectors is computed after normalizing them by their l2-norm and (Tang2, p. 1072 "All the embedding vectors are finally normalized by setting || w||2 = 1 [l2-norm]") using the difference between the similarity score as features in bag of words model (norm CV-LR). (Forman, p. 1292, 2.1 Metrics Considered "Here we enumerate the feature selection metrics we evaluated…" p. 1294 "PR: PR: (Log) Probability Ratio is the sample estimate probability of the word given the positive class divided by the sample estimate probability of the word given the negative class."; p. 1289 "In text classification, one typically uses a ‘bag of words’ model: each position in the input feature vector corresponds to a given word or phrase..."; log (a/b)= log(a) - log(b) is the difference between the similarity score, which is used to select features for a bag-of-words model.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the input feature vector of Tang to incorporate the teachings of Forman by including the feature selection procedure. Doing so would allow the method to score each potential feature according to a particular feature selection metric.

Claims 15 and 31 are rejected under 35 U.S.C. 103 as being unpatentable over Tang in view of Forman in view of Doquire in further view of Rousu ("Kernel-Based Learning of Hierarchical Multilabel Classification Models").

    PNG
    media_image17.png
    27
    192
    media_image17.png
    Greyscale
In regard to claims 15 and 31, Tang and Forman do not teach, but Rousu teaches: The method for text classification using class vectors as claimed in claim 1, wherein in order to extend the approach for multiclass and multilabel classification, feature vector for each class is constructed and for class 1, the expression becomes, (Rousu, p. 1601 Abstract "We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time... The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification [multiclass and multilabel classification] learning algorithms."; p. 1602 "A vector y = (y1,...,yk) ∈ Y is called the multilabel and the components yj are called the microlabels"; The examiner examines the claim under its BRI because of the blurry issue of the equation.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to extend the concept of class of Tang to incorporate the teachings of Rousu by including multicategory or multilabel classification. Doing so would allow the text classification are based on multicategory class vectors.

Claims 16 and 32 are rejected under 35 U.S.C. 103 as being unpatentable over Tang in view of Forman in further view of Doquire ("Mutual information-based feature selection for multilabel classification").

    PNG
    media_image18.png
    68
    514
    media_image18.png
    Greyscale
In regard to claims 16 and 32, Tang and Forman do not teach, but Doquire teaches:
    PNG
    media_image19.png
    31
    339
    media_image19.png
    Greyscale
 The method for text classification using class vectors as claimed in claim 1, wherein the feature selection in the corpus is selected by information theoretic criteria such as conditional entropy and mutual information I(C;w) for each word aswhere p(w) is calculated from the document frequency of word. (Doquire, p. 149 "In such problems, the probability distribution of the (discrete) class variable Y can be estimated as p(y=yl) = nl/N, with nl the number of points whose class value is yl. Rewriting the estimated MI in terms of entropies, it gives as...")

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified feature selection of Forman to incorporate the teachings of Doquire by including mutual information. Doing so would allow the method to take the joint relevance and redundancy of features into account during the feature selection process.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SU-TING CHUANG whose telephone number is (408)918-7519.  The examiner can normally be reached on Monday - Thursday 8-5 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571)272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/S.C./Examiner, Art Unit 2122                 


/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122