DETAILED ACTION
	This Office Action is in response to an amendment filed 02/03/2022.
	Claims 1-5, 9-15, and 18-20 are pending.
	Claims 6-8 and 16-17 have been cancelled.
	Claims 1, 13 and 19 are independent claims.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5, 9-10, 12-15 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Kumar et al. (hereinafter Kumar, U.S. Patent Application Publication No. 2020/0034749 A1, filed 07/26/2018, published 01/30/2020) in view of Enuka et al. .
Regarding independent claim 1, Kumar teaches:
A method, performed by one or more computing devices, for automated determination of semantic overlap between classes (at least Abstract; p. 2, [0022], [0025][Wingdings font/0xE0] Kumar teaches a method for semi-autonomous training corpus refinement by applying to the training corpus “inter-class” overlap and noise reduction treatments), the method comprising:
receiving a data set comprising a plurality of documents and a plurality of classes (at least p. 2, [0022]; [0035]; Figure 1 [Wingdings font/0xE0] Kumar teaches providing a training corpus 102 of data (e.g., documents, see p. 2, [0022]) is an initial seed that includes training samples for respective classes to a “Corpus Advisor”. The initial corpus is provided via a subject matter expert (SME) or other entity presenting correctly-defined classes), wherein the plurality of documents comprise text content (at least p. 2, [0022]; p. 3, [0035]-[0036]; Figure 1 [Wingdings font/0xE0] Kumar teaches that a refined training corpus may be used to train a classifier, such as text classification models that provide, for example, response variations in a multi-turn dialog), and wherein the data set comprises indications of which of the plurality of classes have been assigned to which of the plurality of documents (at least p. 3, [0035]-[0036]; Figure 1 [Wingdings font/0xE0] Kumar teaches receipt, by a “Corpus Advisor”, of a training corpus 102 of data that contains text-based training samples for respective classes);
Kumar fails to explicitly teach:
for each document of the plurality of documents, filtering the document to remove stop words, wherein filtering the document to remove stop words comprises filtering words that are less than a threshold length.
However, Enuka teaches:
for each document of the plurality of documents, filtering the document to remove stop words, wherein filtering the document to remove stop words comprises filtering words that are less than a threshold length (at least p. 3, [0027]-[0036]; Figure 2 [Wingdings font/0xE0] Enuka teaches preprocessing of a document to remove one or more of HTML, XML and/or other programming language tags, removing excess whitespace, removing short words (e.g. words comprising less than 3 or 4 characters, removing numeric characters, and/or word stemming or lemmatization).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Enuka with those of Kumar as both inventions relate to aspects of preparing content for use as training data. Adding the teaching of Enuka provides Kumar with methods of simplifying the classification of documents by, for example, tokenizing certain “short words” such that they do not interfere with the classification process.
Kumar and Enuka
for each document of the plurality of documents, filtering the document to replace certain terms with tokens, wherein the certain terms comprise email addresses, dates, numbers, and URLs
However, Heckel teaches:
for each document of the plurality of documents, filtering the document to replace certain terms with tokens, wherein the certain terms comprise email addresses, dates, numbers, and URLs (at least col. 2, line 38 through col. 5, line 15; Figures 1A-B; col. 6, line 9 through col. 7, line 37; Figure 2 [Wingdings font/0xE0] Heckel teaches identification and replacement of email addresses, dates and numbers in training data with labels (e.g. tokens). Here, the Examiner groups dates and numbers as a date can comprise a sequence of numbers)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Heckel with those of Kumar and Enuka as all three inventions relate to aspects of preparing content for use as training data. Adding the teaching of Heckel provides Kumar and Enuka with methods de-personalizing documents intended to be used as training data.
Kumar, Enuka and Heckel fail to explicitly teach:
generating a single vector representation for the document after the document has been filtered, wherein each single vector representation has a same number of elements.
However, Rujan
for each document of the plurality of documents, generating a single vector representation for the document, wherein each single vector representation has a same number of elements (at least Abstract; col. 2, line 7 through col. 5, line 35; col. 6, lines 20-62 [Wingdings font/0xE0] Rujan teaches representing each of said plurality of documents by a vector of n dimensions, said n dimensions forming a vector space, whereas the value of each dimension of said vector corresponds to the frequency of occurrence of a certain term in the document corresponding to said vector, so that said n dimensions span up a vector space).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Rujan with those of Kumar, Enuka and Heckel as both inventions are related to the classification of documents. Adding the teaching of Rujan provides Kumar, Enuka and Heckel with the ability of vectorizing documents.
Kumar further teaches:
for each class of the plurality of classes: generating a single aggregated vector from the single vectors that represent the documents of the class, wherein the single aggregated vector has the same number of elements (at least p. 3, [0036]-[0037]; Figure 1 [Wingdings font/0xE0] Kumar teaches that feature vectors are generated for each class; the class feature vectors comprising elements comprising different words/entities (tokens) from the samples present in each class (e.g. aggregated vectors for each class)).
for each pair of classes of the plurality of classes:
generating an overlap value for the pair of classes, wherein the overlap value represents textual overlap between the pair of classes indicating how much semantic overlap is present between the pair of classes (at least pp. 3-4, [0038]; Figure 1 [Wingdings font/0xE0] Kumar teaches determination of class overlap between pairs of class feature vectors), and
wherein the overlap value for the pair of classes is generated based on the single aggregated vectors for the pair of classes (at least p. 4, [0041]-[0043]; Figure 1 [Wingdings font/0xE0] Kumar teaches an overlap treatment unit 138 into which an aggregated feature space vectorized model 114 is fed. A cosine similarity algorithm that calculates cosine overlap between each feature vector. If the angle between two feature vectors is less than a threshold, the two feature vectors are significantly overlapping); and
outputting a representation of the overlap values for each pair of classes (at least p. 4, [0041] [Wingdings font/0xE0] Kumar teaches an overlap treatment unit 138 that generates an interactive classifier dashboard 126 that allows a user to select or deselect classes represented in the corpus to visualize the inter-class effect and analyze the risk in terms of overlap or anomaly present in selected class(es)).
wherein the method is performed as a pre-processing operation before the plurality of classes are used for machine learning modeling (at least Abstract [Wingdings font/0xE0] Kumar teaches a method of training corpus 

Regarding dependent claim 2, Kumar, Enuka and Heckel fail to explicitly teach:
generating the single vector representation for each document comprises: calculating a term-frequency inverse-document-frequency (tf-idf) representation for each document.
However, Rujan teaches:
generating the single vector representation for each document comprises: calculating a term-frequency inverse-document-frequency (tf-idf) representation for each document (at least Abstract; col. 2, line 7 through col. 5, line 35; col. 6, lines 20-62 [Wingdings font/0xE0] Rujan teaches representing each of said plurality of documents by a vector of n dimensions, said n dimensions forming a vector space, whereas the value of each dimension of said vector corresponds to the frequency of occurrence of a certain term in the document corresponding to said vector, so that said n dimensions span up a vector space).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Rujan with those of Kumar, Enuka and Heckel as all these inventions are related to the classification of documents. Adding the teaching of Rujan provides Kumar, Enuka and Heckel with the ability of vectorizing documents.





Regarding dependent claim 3, Kumar, Enuka and Heckel fail to explicitly teach:
the same number of elements represents a number of words in the plurality of documents.
However, Rujan teaches:
the same number of elements represents a number of words in the plurality of documents (at least Abstract; col. 2, line 7 through col. 5, line 35; col. 6, lines 20-62 [Wingdings font/0xE0] Rujan teaches representing each of said plurality of documents by a vector of n dimensions, said n dimensions forming a vector space, whereas the value of each dimension of said vector corresponds to the frequency of occurrence of a certain term in the document corresponding to said vector, so that said n dimensions span up a vector space).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Rujan with those of Kumar, Enuka and Heckel as all these inventions are related to the classification of documents. Adding the teaching of Rujan provides Kumar, Enuka and Heckel with the ability of vectorizing documents.

Regarding dependent claim 4, Kumar teaches:
the overlap values for each pair of classes are generated by calculating a scalar product from the single aggregated vectors of the pair of classes (at least pp. 3-4, [0037]-[0043]; Figure 1 [Wingdings font/0xE0] Kumar teaches and overlap treatment unit 138 that calculates cosine overlap between each feature vector. A cosine angle matrix 140 of order NxN is determined, where N is the number of classes in the training corpus).

Regarding dependent claim 5, Kumar teaches:
the overlap values for each pair of classes are generated by calculating a pair-wise overlap matrix from the single aggregated vectors of the plurality of classes (at least pp. 3-4, [0037]-[0043]; Figure 1 [Wingdings font/0xE0] Kumar teaches and overlap treatment unit 138 that calculates cosine overlap between each feature vector. A cosine angle matrix 140 of order NxN is determined, where N is the number of classes in the training corpus).

Regarding dependent claim 9, Kumar teaches:
for each pair of classes of the plurality of classes: comparing the overlap value to a semantic threshold; and when the overlap value is above the semantic threshold, identifying the pair of classes as having significant semantic overlap (at least p. 4, [0041]-[0043]; Figure 1 [Wingdings font/0xE0] Kumar teaches an overlap treatment unit 138 into which an aggregated feature space vectorized model 114 is fed. A cosine similarity algorithm that calculates cosine overlap between each feature vector. If the angle between two feature vectors is less than a threshold, the two feature vectors are significantly overlapping).

Regarding dependent claim 10, Kumar teaches:
based at least in part on the overlap values, modifying the plurality of classes to reduce the semantic overlap between them (at least p. 4, [0039]; Figure 1 [Wingdings font/0xE0] Kumar teaches that from the aggregated feature space vectorized model 112, a visualization unit 120 calculates maximum TFICF value for each feature vector 122, and a threshold a is set by dividing max TFICF by a controlling factor n. In some interactive classifier dashboard 126 is generated that contains information about the class and the most impactful tokens along with their TFICF values. In an example, a bubble chart is created, where the size of each bubble is proportionate to the TFICF value, and different colors are used for the different classes. This dashboard gave a flexibility to select or deselect classes represented in the corpus to visualize the inter-class effect and analyze the risk in terms of overlap or anomaly present in selected class(es). That is, classes are modified by selecting or deselecting them).

Regarding dependent claim 12, Kumar teaches:
using the modified plurality of classes to perform machine learning modeling (at least Abstract [Wingdings font/0xE0] Kumar teaches a method of training corpus refinement by applying to the training corpus overlap and noise reduction techniques to produce a refined training corpus).

Regarding claims 13, 14, 15, 18 and 20, claims 13, 14, 15, 18 and 20 merely recite a system for carrying out the method of claims 1, 2, 4, 9 and 10, respectively. Thus, Kumar in view of Enuka, Heckel and Rujan teaches every limitation of claims 13, 14, 15, 18 and 20, and provide proper motivation, as indicated in the rejections of claims1, 2, 4, 9 and 10.






Regarding independent claim 19, Kumar teaches:
One or more computer-readable storage media storing computer-executable instructions for execution on one or more computing devices to perform operations for automated determination of semantic overlap between classes (at least Abstract; p. 2, [0022], [0025][Wingdings font/0xE0] Kumar teaches a method for semi-autonomous training corpus refinement by applying to the training corpus “inter-class” overlap and noise reduction treatments), the operations comprising:
receiving a plurality of documents, a plurality of classes, and associations between the plurality of documents and the plurality of classes, wherein the plurality of documents comprise text content (at least p. 2, [0022]; [0035]; Figure 1 [Wingdings font/0xE0] Kumar teaches providing a training corpus 102 of data (e.g., documents, see p. 2, [0022]) is an initial seed that includes training samples for respective classes to a “Corpus Advisor”. The initial corpus is provided via a subject matter expert (SME) or other entity presenting correctly-defined classes), wherein the plurality of documents comprise text content (at least p. 2, [0022]; p. 3, [0035]-[0036]; Figure 1 [Wingdings font/0xE0] Kumar teaches that a refined training corpus may be used to train a classifier, such as text classification models that provide, for example, response variations in a multi-turn dialog), and wherein the data set comprises indications of which of the plurality of classes have been assigned to which of the plurality of documents (at least p. 3, [0035]-[0036]; Figure 1 [Wingdings font/0xE0] Kumar teaches receipt, by a “Corpus Advisor”, of a training corpus 102 of data that contains text-based training samples for respective classes);
Kumar fails to explicitly teach:
for each document of the plurality of documents: filtering the document to remove stop words, wherein filtering the document to remove stop words comprises filtering words that are less than a threshold length.
However, Enuka teaches:
for each document of the plurality of documents: filtering the document to remove stop words, wherein filtering the document to remove stop words comprises filtering words that are less than a threshold length (at least p. 3, [0027]-[0036]; Figure 2 [Wingdings font/0xE0] Enuka teaches preprocessing of a document to remove one or more of HTML, XML and/or other programming language tags, removing excess whitespace, removing short words (e.g. words comprising less than 3 or 4 characters, removing numeric characters, and/or word stemming or lemmatization).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Enuka with those of Kumar as both inventions relate to aspects of preparing content for use as training data. Adding the teaching of Enuka provides Kumar with methods of simplifying the classification of documents by, for example, tokenizing certain “short words” such that they do not interfere with the classification process.
Kumar and Enuka fail to explicitly teach:
filtering the document to replace certain terms with tokens, wherein the certain terms comprise email addresses, dates, numbers, and URLs.
Heckel teaches:
filtering the document to replace certain terms with tokens, wherein the certain terms comprise email addresses, dates, numbers, and URLs (at least col. 2, line 38 through col. 5, line 15; Figures 1A-B; col. 6, line 9 through col. 7, line 37; Figure 2 [Wingdings font/0xE0] Heckel teaches identification and replacement of email addresses, dates and numbers in training data with labels (e.g. tokens). Here, the Examiner groups dates and numbers as a date can comprise a sequence of numbers)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Heckel with those of Kumar and Enuka as all three inventions relate to aspects of preparing content for use as training data. Adding the teaching of Heckel provides Kumar and Enuka with methods de-personalizing documents intended to be used as training data.
Kumar, Enuka and Heckel fail to explicitly teach:
generating a single vector representation for the document after the document has been filtered, wherein each single vector representation has a same number of elements.
However, Rujan teaches:
for each document of the plurality of documents, generating a single vector representation for the document, wherein each single vector representation has a same number of elements (at least Abstract; col. 2, line 7 through col. 5, line 35; col. 6, lines 20-62 [Wingdings font/0xE0] Rujan teaches representing each of said 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Rujan with those of Kumar, Enuka and Heckel as all these inventions are related to the classification of documents. Adding the teaching of Rujan provides Kumar, Enuka and Heckel with the ability of vectorizing documents.
Kumar further teaches:
for each class of the plurality of classes:
generating a single aggregated vector from the single vectors that represent the documents of the class, wherein the single aggregated vector has the same number of elements (at least p. 3, [0036]-[0037]; Figure 1 [Wingdings font/0xE0] Kumar teaches that feature vectors are generated for each class; the class feature vectors comprising elements comprising different words/entities (tokens) from the samples present in each class (e.g. aggregated vectors for each class))
for each pair of classes of the plurality of classes:
generating an overlap value for the pair of classes, wherein the overlap value represents textual overlap between the pair of classes indicating how much semantic overlap is present between the pair of classes (at least pp. 3-4, [0038]; Figure 1 [Wingdings font/0xE0] Kumar teaches determination of class overlap between pairs of class feature vectors), and
wherein the overlap value for the pair of classes is generated based on the single aggregated vectors for the pair of classes (at least p. 4, [0041]-[0043]; Figure 1 [Wingdings font/0xE0] Kumar teaches an overlap treatment unit 138 into which an aggregated feature space vectorized model 114 is fed. A cosine similarity algorithm that calculates cosine overlap between each feature vector. If the angle between two feature vectors is less than a threshold, the two feature vectors are significantly overlapping);
comparing the overlap value to a semantic threshold; and when the overlap value is above the semantic threshold, identifying the pair of classes as having significant semantic overlap; and outputting an indication of the pairs of classes that have been identified as having significant semantic overlap (at least p. 4, [0041]-[0043]; Figure 1 [Wingdings font/0xE0] Kumar teaches an overlap treatment unit 138 into which an aggregated feature space vectorized model 114 is fed. A cosine similarity algorithm that calculates cosine overlap between each feature vector. If the angle between two feature vectors is less than a threshold, the two feature vectors are significantly overlapping);
wherein the operations are performed as pre-processing before the plurality of classes are used for machine learning modeling (at least Abstract [Wingdings font/0xE0] Kumar teaches a method of training corpus refinement by applying to the training corpus overlap and noise reduction techniques).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Kumar in view of Enuka, and in further view of Heckel, and in further view of Rujan, and in further view of Devin et al. (hereinafter Devin, U.S. Patent Application No. 2014/0247978 A1, filed 03/04/013, published 09/04/2014).
Regarding dependent claim 11, Kumar, Enuka, Heckel and Rujan fail to explicitly teach:
modifying the plurality of classes comprises combining at least two of the plurality of classes into a single class.
However, Devin teaches:
modifying the plurality of classes comprises combining at least two of the plurality of classes into a single class (at least pp. 2-3, [0011]; p. 7, [0079]-[0080]; p. 9, Table 1 [Wingdings font/0xE0] Devin teaches the merging of overlapped classes to reduce class overlap)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Devin with those of Kumar, Enuka, Heckel and Rujan as all these inventions are related to the classification of documents. Adding the teaching of Devin provides Kumar, Enuka, Heckel and Rujan with the option of reducing class overlap by merging similar classes.











Response to Arguments

Regarding the previous rejection of independent claim 1, Applicant argues that the prior art of Kumar fails to teach or suggest at least the limitation, as amended, of:

for each document of the plurality of documents:
filtering the document to replace certain terms with tokens, wherein the certain terms comprise email addresses, dates, numbers, and URLs;

as recited by claim 1.
	
The Examiner agrees that Kumar fails to teach this limitation.
	
However, Heckel et al (US PAT. No. 10,169,315 B1, hereinafter Heckel) teaches pre-processing training data used to train a Personally Identifiable Information (PII) model.
	The pre-processing includes removing or replacing, from original texts (e.g. documents) PII including email addresses (see col. 3, lines 24-27); dates (a date is a number); numbers (see col. 3, lines 5-17; e.g., street addresses, phone numbers, social security numbers, credit card numbers, etc.) and replacing them with tokens (e.g., labels (see col. 3, lines 18-43; e.g. <phone_number., <street_address>, etc.)
(see col. 2, line 38 through col. 5, line 15; Figures 1A-B; col. 6, line 9 through col. 7, line 37; Figure 2).
Kumar does not teach or suggest the limitation, as amended, of:

for each document of the plurality of documents:
filtering the document to remove stop words, wherein filtering the document to remove stop words comprises filtering words that are less than a threshold length;

as recited by claim 1.

	The Examiner agrees that the prior art of Kumar fails to teach this filtering step.

However, Enuka et al. (US PGPUB 2020/0311414 A1, hereinafter Enuka) teaches (citing p. 3, [0027]-[0036]; Figure 2, and in particular p. 3, [0033], step 210) a pre-processing of documents that includes the removal of so called “short words” which are words comprising less than 3-4 characters.

	It is noted that the preprocessing is done after the retrieval of documents (see p. 3, [0027]-[0032]) and prior to generation of a document feature vector for the document (see pp. 3-4, [0037]-[0043]).








Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James H Blackwell whose telephone number is (571)272-4089. The examiner can normally be reached M-F 04:30AM - 12:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Cesar Paula can be reached on 571-272-4128. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.






/James H. Blackwell/
03/22/2022

/CESAR B PAULA/Supervisory Patent Examiner, Art Unit 2177