DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.  Authorization for this examiner’s amendment was given in an interview with Michal Brandt on 24 June 2022.
The application has been amended as follows: 

	1.	(Currently Amended) One or more non-transitory computer-readable media storing instructions, which when executed by one or more hardware processors, cause:
generating a respective feature vector for each respective document in a training dataset of documents, wherein the respective feature vector is generated based at least in part on occurrence, in the respective document, for each respective token in a vocabulary;
training a machine learning model to estimate unknown labels for documents based at least in part on the feature vector for each respective document in the training dataset;
receiving a new document with an unknown label;
identifying a first set of one or more known tokens and a second set of one or more unknown tokens within the new document;
determining, for each respective unknown token in the second set of one or more unknown tokens, one or more known tokens in the vocabulary to represent the unknown token, wherein the set of one or more known tokens for a respective unknown token is determined based at least in part on a distance between the respective unknown token and different clusters of known tokens;
generating a feature vector for the new document based at least in part on an occurrence metric for a particular known token in the vocabulary, wherein the occurrence metric for the particular known token is determined based at least in part on how many times the particular known token occurs in both (a) the first set of one or more known tokens and (b) the one or more known tokens in the vocabulary that represent each respective unknown token in the second set of one or more unknown tokens; and
estimating, by the trained machine learning model, the unknown label for the new documents based at least in part on the feature vector for the new document.

2.	(Original) The one or more non-transitory computer-readable media of Claim 1, wherein  determining, for each respective unknown token in the second set of one or more unknown tokens, one or more known tokens in the vocabulary to represent the unknown token comprises determining, for each respective unknown token in the second set of one or more unknown tokens, a first respective vector representation for the respective unknown token; and identifying a second respective vector representation for a respective known token that is closest to the first respective vector representation.

3.	(Original) The one or more non-transitory computer-readable media of Claim 2, wherein determining which vector representation is closest to the first respective vector representation is based on characteristics determined from documents outside of a domain associated with the training dataset of documents.

4.	(Original) The one or more non-transitory computer-readable media of Claim 1, further comprising:
generating a set of clusters, wherein each cluster includes a subset of one or more known tokens from the vocabulary;
wherein determining, for each respective unknown token in the second set of one or more unknown tokens, one or more known tokens in the vocabulary to represent the unknown token comprises identifying a subset of one or more clusters from the set of clusters that are closest to the unknown token; and selecting at least one known token from at least one cluster of the subset of one or more clusters. 

5.	(Original) The one or more non-transitory computer-readable media of Claim 1, wherein determining, for each respective unknown token in the second set of one or more unknown tokens, one or more known tokens in the vocabulary to represent the unknown token; is performed based at least in part on linguistic context determined based at least in part on out-of-domain characteristics associated with the second set of one or more unknown tokens.

6.	(Original) The one or more non-transitory computer-readable media of Claim 1, wherein the second set of one or more unknown tokens includes words that were not present in the training dataset of documents.

7.	(Original) The one or more non-transitory computer-readable media of Claim 1, wherein each respective token in the vocabulary is associated with a weight that is inversely related to the frequency of the respective token in the training dataset of documents.

8.	(Original) The one or more non-transitory computer-readable media of Claim 1, wherein the feature vector for the new document is equal in length to the number of tokens in the vocabulary.

9.	(Original) The one or more non-transitory computer-readable media of Claim 1 wherein the instructions further cause triggering an automated social media post if the unknown label satisfies a set of criteria associated with the social media post.

10.	(Original) The one or more non-transitory computer-readable media of Claim 1, wherein the instructions further cause:
training a plurality of models using different vocabulary parameters;
determining estimation errors for each model in the plurality of models; and
selecting vocabulary parameters based at least in part on which model of the plurality of models has a lowest estimation error.

11.	(Currently Amended) One or more non-transitory computer-readable media storing instructions, which when executed by one or more hardware processors, cause:
generating a respective feature vector for each respective document in a training dataset of documents, wherein the respective feature vector is generated based at least in part on how frequently each respective token in a vocabulary occurs in the respective document and a vector representation for each respective token;
training a machine learning model to estimate unknown labels for documents based at least in part on the feature vector for each respective document in the training dataset;
receiving a new document with an unknown label;
identifying a set of tokens within the new document;
mapping unknown tokens in the set of tokens to a respective vector representation;
generating a feature vector for the new document based at least in part on the respective vector representation for the unknown tokens in the set of tokens and how often tokens in the vocabulary occur in the set of tokens, wherein the feature vector for the new document includes at least (a) a first part that is generated as a function of the respective vector representation for the unknown tokens and (b) a second part that is generated as a function of how often tokens in the vocabulary occur in the set of tokens; and
estimating, by the trained machine learning model, the unknown label for the new documents based at least in part on the feature vector for the new document.

12.	(Original) The one or more non-transitory computer-readable media of Claim 11, wherein mapping unknown tokens in the set of tokens to a respective vector representation comprises:
determining a respective word vector for each unknown token in the set of tokens; and
aggregating the respective word vectors for unknown tokens in the set of tokens.

13	(Original) The one or more non-transitory computer-readable media of Claim 12, wherein aggregating the respective word vector comprises averaging the word vectors.

14.	(Original) The one or more non-transitory computer-readable media of Claim 12, wherein the respective word vector for each unknown token is determined, based at least in part, on a linguistic context learned from a different corpus of documents than the training dataset of documents.

15.	(Original) The one or more non-transitory computer-readable media of Claim 11, wherein the vocabulary is a reduced vocabulary that is generated by removing at least one token from a full vocabulary of tokens extracted from the training dataset.
	
16.	(Original) The one or more non-transitory computer-readable media of Claim 11, wherein generating the feature vector for the new document comprises concatenating the vector representation with a second vector that is generated based at least in part on an occurrence metric of each respective token in the vocabulary.

17.	(Original) The one or more non-transitory computer-readable media of Claim 11, wherein each respective token in the vocabulary is associated with a weight that is inversely related to the frequency of the respective token in the training dataset of tokens.

18.	(Original) The one or more non-transitory computer-readable media of Claim 11, wherein the instructions further cause triggering an automated social media post if the unknown label satisfies a set of criteria associated with the social media post.

19.	(Original) The one or more non-transitory computer-readable media of Claim 11, wherein the instructions further cause:
training a plurality of models using different combinations of tokens in the reduced vocabulary;
determining estimation errors for each model in the plurality of models; and
selecting a reduced vocabulary based at least in part on which model of the plurality of models has a lowest estimation error.

20.	(Currently Amended) A system comprising:
one or more hardware processors;
one or more non-transitory computer-readable media storing instructions, which when executed by the one or more hardware processors, cause:
generating a respective feature vector for each respective document in a training dataset of documents, wherein the respective feature vector is generated based at least in part on occurrence, in the respective document, for each respective token in a vocabulary;
training a machine learning model to estimate unknown labels for documents based at least in part on the feature vector for each respective document in the training dataset;
receiving a new document with an unknown label;
identifying a first set of one or more known tokens and a second set of one or more unknown tokens within the new document;
determining, for each respective unknown token in the second set of one or more unknown tokens, one or more known tokens in the vocabulary to represent the unknown token, wherein the set of one or more known tokens for a respective unknown token is determined based at least in part on a distance between the respective unknown token and different clusters of known tokens;
generating a feature vector for the new document based at least in part on an occurrence metric for a particular known token in the vocabulary, wherein the occurrence metric for the particular known token is determined based at least in part on how many times the particular known token occurs in both (a) the first set of one or more known tokens and (b) the one or more known tokens in the vocabulary that represent each respective unknown token in the second set of one or more unknown tokens; and
estimating, by the trained machine learning model, the unknown label for the new documents based at least in part on the feature vector for the new document.


Allowable Subject Matter
Claims 1-20 are allowed.  The following is an examiner’s statement of reasons for allowance: 

Consider claim 1, the closest prior art of record, Saputra et al. (An Ensemble Approach to Handle Out of Vocabulary in Multilabel Document Classification) teaches One or more non-transitory computer-readable media storing instructions, which when executed by one or more hardware processors (VI results, computational memory and processing time,), cause:
generating a respective feature vector for each respective document in a training dataset of documents, wherein the respective feature vector is generated based at least in part on occurrence, in the respective document, for each respective token in a vocabulary (section IIIA, processing training documents to generate feature vectors);
training a machine learning model to estimate unknown labels for documents based at least in part on the feature vector for each respective document in the training dataset (Section IIIA, training the model);
receiving a new document with an unknown label (section IIIA, Out of Vocabulary words);
identifying a first set of one or more known tokens and a second set of one or more unknown tokens within the new document (Section IIIB, processing documents with OOV words);
generating a feature vector for the new document based at least in part on an occurrence metric for a particular known token in the vocabulary (Section IIIB,); and
estimating, by the trained machine learning model, the unknown label for the new documents based at least in part on the feature vector for the new document (Section IIIB, classifying document).
Saputra does not teach mapping unknown to known tokens and using result for generating vector. These features taught by Mohandas et, al (US Patent, 11,222,031) at (col 31 lines 40-65). It would have been obvious to combine in order to improve document classification (11,222,031 background),
However the prior art of record does not teach or fairly suggest the limitations of “determining, for each respective unknown token in the second set of one or more unknown tokens, one or more known tokens in the vocabulary to represent the unknown token, wherein an unknown token in the set of one or more unknown tokens is mapped to a known token based at least in part on a computed similarity between the unknown token and the known token;
generating a feature vector for the new document based at least in part on an occurrence metric of each respective token for a particular known token in the vocabulary, wherein the occurrence metric for the particular known token is determined based at least in part on how many times the particular known token occurs in both (a) the first set of one or more known tokens and (b) the one or more known tokens in the vocabulary that represent each respective unknown token in the second set of one or more unknown tokens”  When combined with each and every other limitation of the claim.  Therefore claim 1 is allowable.  

Claims 11 and 20 contain similar limitations as claim 1 and therefore are allowable as well.

Claims 2-10 and 12-19 depend on and further limit claims 1 and 11 and therefore are allowable as well.

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DOUGLAS C GODBOLD whose telephone number is (571)270-1451. The examiner can normally be reached 6:30am-5pm Monday-Thursday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

DOUGLAS GODBOLD
Examiner
Art Unit 2655



/DOUGLAS GODBOLD/Primary Examiner, Art Unit 2655