DETAILED ACTION
This office action is in response to the above identified application filed on April 19, 2020. The application contains claims 1-20. 
Claims 1-20 are pending

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) was submitted on April 19, 2020. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Claim Objections
Claims 3, 7, 8, 15, and 19 are objected to because of the following informalities:
Claim 3, line 2: “the score” ought to read “the statistics score” to be consistent with the terminology of claim 1 and avoid confusion
Claim 7, line 6: “the rule match scores” ought to read “the set of rule match scores” to be consistent with the terminology
Claim 8, line 3: “the of” in front of “the first data category” appears to be a typo
Claim 15, line 2: “the score” ought to read “the statistics score” to be consistent with the terminology of claim 13 and avoid confusion
Claim 19, line 6: “the of” in front of “the first data category” appears to be a typo
Claim 19, line 12: “the rule match scores” ought to read “the set of rule match scores” to be consistent with the terminology
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-11 and 13-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
Claim 1 recites the limitation "the category" in line 6.  There is insufficient antecedent basis for this limitation in the claim. Therefore, claim 1 is indefinite and rejected under 35 U.S.C. 112(b).
Claim 3 recites the limitation "the first category" in line 1-2.  There is insufficient antecedent basis for this limitation in the claim. Therefore, claim 3 is indefinite and rejected under 35 U.S.C. 112(b).
Claim 7 recites the limitation "the rate of matches" in line 2.  There is insufficient antecedent basis for this limitation in the claim. Therefore, claim 7 is indefinite and rejected under 35 U.S.C. 112(b).
Claim 9 recites the limitation "the values" in line 2.  There is insufficient antecedent basis for this limitation in the claim. Therefore, claim 9 is indefinite and rejected under 35 U.S.C. 112(b).
Claim 10 recites the limitation "the classification" in line 5.  There is insufficient antecedent basis for this limitation in the claim. Therefore, claim 10 is indefinite and rejected under 35 U.S.C. 112(b).
Claim 13 recites the limitation "the category" in line 8.  There is insufficient antecedent basis for this limitation in the claim. Therefore, claim 13 is indefinite and rejected under 35 U.S.C. 112(b).
Claim 15 recites the limitation "the first category" in line 1-2.  There is insufficient antecedent basis for this limitation in the claim. Therefore, claim 15 is indefinite and rejected under 35 U.S.C. 112(b).
Claim 19 recites the limitation "the rate of matches" in line 2.  There is insufficient antecedent basis for this limitation in the claim. Therefore, claim 19 is indefinite and rejected under 35 U.S.C. 112(b).
Claim 19 recites the limitation "the values" in line 7.  There is insufficient antecedent basis for this limitation in the claim. Therefore, claim 19 is indefinite and rejected under 35 U.S.C. 112(b).
Claim 19 recites the limitation "the classification" in line 13.  There is insufficient antecedent basis for this limitation in the claim. Therefore, claim 19 is indefinite and rejected under 35 U.S.C. 112(b).
Claim 20 recites the limitation "the categories of neighboring data of the sample data" in line 6.  There is insufficient antecedent basis for this limitation in the claim. Therefore, claim 20 is indefinite and rejected under 35 U.S.C. 112(b).
Claim 20 recites the limitation "the known probability of matches of the first data category" in line 7.  There is insufficient antecedent basis for this limitation in the claim. Therefore, claim 20 is indefinite and rejected under 35 U.S.C. 112(b).
Dependent claims 2-11 and 14-20 are also rejected for inheriting the deficiency from their corresponding independent claims 1 and 13, respectively.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless -
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-10 and 13-19 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Butler et al. (US 20200380212 A1).

With regard to claim 1,
	Butler teaches
a method for classifying examined data in a computerized database (Abstract: a system implements a method for classifying a field in source data into a classification label that identifies the field’s semantic meaning, wherein data in the field being classified corresponds to “examined data”. [0046]: data sources that store source data can include databases, i.e., “a computerized database”), the method comprising: 
calculating statistics of the examined data ([0035]; [0051]: determine statistical attribute(s) of the data fields in received source data, and generate profile data including those statistical attributes, wherein the statistics about the data values can include a maximum value, a minimum value, a standard deviation, a mean, and so forth of the values that are included in each of the data fields (if the data are numerical) indicates “calculating statistics”); 
comparing the statistics of the examined data with known statistics of a first data category to provide a statistics score (Fig. 2C; [0062]; [0040]: the testing module 106 compares the attributes of the data fields with attributes of candidate labels received from the data dictionary database 114 to determine which of those candidate labels is the most closely associated with the attributes of the data fields determined by the data profile, wherein attributes of the profile data, which include statistical attributes as discussed above, corresponds to “the statistics of the examined data”, each candidate label corresponds to “a first data category”, and attributes associated with each candidate label corresponds to “known statistics”. Fig. 3C; [0106]: the testing module 106 performs the pattern match analysis 410 and proposes a Date of Birth label proposal 318 with a score of 0.80, wherein a score of 0.80 is "a statistics score" that is provided based on the comparison discussed above); and 
determining a probability that the category of the examined data matches the first data category based on the statistics score (Fig. 3E; [0115]-[0116]: the results corroboration module 108 assigns the proposed label to a category based on the counts for each proposed label, scores, and weights, wherein Category: Match 290 indicates "a probability that the category ... matches the first data category" and is derived based on the statistics score).

With regard to claim 2,
	Butler teaches
the method of claim 1, wherein the examined data is all of the same category, and wherein the examined data is all within the same column in the computerized database (Fig. 3A; [0098]-[0099]: classify each of the fields of the first and second tables 302a, 302b, wherein the data in each of the fields is all within the same table column and all of the same category).

With regard to claim 3,
Butler teaches
the method of claim 1, comprising determining that the examined data is of the first category if the score is higher than a threshold ([0086]: select the match category when any dissenting labels were below a score threshold, wherein dissenting labels below a score threshold is equivalent to an agreeing or matching score higher than a threshold).

With regard to claim 4,
Butler teaches
the method of claim 1, comprising: 
obtaining a true classification of the examined data ([0060]: obtain a true classification if a user selects the existing label as presented by the results corroboration module 108); and 
if the true classification of the examined data equals the first data category, then adjusting the known statistics of the first data category based on the statistics of the examined data ([0060]: re-classify the field by the classification module 105 and re-test the field by the testing module 106 to confirm that the label is accurate and potentially update the label attributes of that label in the data dictionary database 114, wherein updating the label attributes in the data dictionary database 114 includes “adjusting the known statistics” of the first data category as discussed above).

With regard to claim 5,
Butler teaches
the method of claim 1, wherein the calculated statistics are selected from the list consisting of: average, median, variance, minimum, maximum, standard deviation and correlation ([0051]: the statistics about the data values can include a maximum value, a minimum value, a standard deviation, a mean, and so forth of the values that are included in each of the data fields (if the data are numerical)).

With regard to claim 6,
Butler teaches
the method of claim 1, comprising: 
comparing categories of neighboring data of the examined data with expected categories of neighboring data of the first data category to provide a neighbors score (Fig. 3D; [0108]: compare Date2's neighboring data Date1 with expected field name of neighboring date of Date of Expiration (Date2) based on the context that the date in Date2 is always after the date value of Date1 for each entry, and provide a 95 score for Date2 being a Date of Expiration, wherein Date2 corresponds to “the examined data”, Date of Expiration corresponds to “the first data category”, Date1 corresponds to “neighboring data”, Date of Birth corresponds to “expected categories of neighboring data”, and a 95 score corresponds to “a neighbors score”); and 
determining a probability that the category of the examined data matches the first data category based on the statistics score and the neighbors score ([0108]; Fig. 3C; Fig. 3E; [0115]-[0116]: the results corroboration module 108 assigns the proposed label to a category based on a function of the counts, scores, and weights, wherein all scores are taken into consideration including both the statistics score and the neighbors score as discussed above).

With regard to claim 7,
Butler teaches
the method of claim 1, comprising: 
calculating the rate of matches of the examined data to each rule of a plurality of rules, and comparing the resulting rates with known rates of matches of the first data category for each rule of the plurality of rules, to provide a set of rule match scores ([0064]: the testing module 106 thus receives the profile data from the profile data module 104 and performs a series of statistical-based functions to identify, classify, and test the field details against a set of known label types; [0068]: if a threshold percentage of the entries for the data field satisfy each of these patterns, the testing module 106 can conclude that the field holds credit card numbers, and associate the field name with the appropriate label and probability, wherein patterns corresponds to “rules”, a threshold percentage of the entries for the data field satisfying each of these patterns indicates both “calculating the rate of matches of the examined data to each rule of a plurality of rules” and “comparing …with known rates of matches …” with a threshold percentage corresponding to “known rates of matches”. Fig. 3C; [0106]: a score of 0.80 the pattern match analysis 410 provides for a Date of Birth label proposal 318 corresponds to “match scores”); and 
determining a probability that the category of the examined data matches the first data category based on the statistics score and the rule match scores (Fig. 3C; Fig. 3E; [0115]-[0116]: the results corroboration module 108 assigns the proposed label to a category based on a function of the counts, scores, and weights, wherein all scores are taken into consideration including both the statistics score and the rule match scores as discussed above).

With regard to claim 8,
Butler teaches
the method of claim 1, comprising: 
comparing metadata associated with the examined data with known metadata associated with the of the first data category to provide a metadata score ([0040]: compare attributes of the fields of the source data with attributes of the label, wherein the attributes include statistical metadata describing values of a given field, a tag for a specific set of values for the field (e.g., a list of city names, zip codes, etc.), a specified data format (e.g., a date format), a relationship between or among fields of a data set, and so forth. Fig. 3C; [0106]; [0068]: provide a score of .80 by Pattern Match Analysis 410 after checking the first 4-6 digits for each entry against a table of issuer codes and the last number including a check digit defined by a Luhn test, wherein the score .80 corresponds to "a metadata score", the first 4-6 digits representing issue codes and the last number including a check digit defined by a Luhn test are all metadata when the data field may be a credit card number data field); and 
determining a probability that the category of the examined data matches the first data category based on the statistics score and the metadata score (Fig. 3C; Fig. 3E; [0115]-[0116]: the results corroboration module 108 assigns the proposed label to a category based on a function of the counts, scores, and weights, wherein all scores are taken into consideration including both the statistics score and the metadata score as discussed above).

With regard to claim 9,
Butler teaches
the method of claim 1, comprising: 
comparing values of the examined data with the values in a dictionary associated with the first data category to provide a dictionary score ([0070]: the testing module 106 can include a keyword search test 422 that searches for particular keywords against a limited set of common words in a specification of the reference database 116, wherein the keyword search indicates “comparing …” and a specification of the reference database 116 corresponds to “a dictionary”. Fig. 3C; [0106]; [0068]: provide a score of 0 by Keyword Search 422, wherein the score 0 corresponds to "a dictionary score"); and 
determining a probability that the category of the examined data matches the first data category based on the statistics score and the dictionary score (Fig. 3C; Fig. 3E; [0115]-[0116]: the results corroboration module 108 assigns the proposed label to a category based on a function of the counts, scores, and weights, wherein all scores are taken into consideration including both the statistics score and the dictionary score as discussed above).

With regard to claim 10,
Butler teaches
the method of claim 1, comprising: 
using a trained classifier to classify the examined data, wherein the classifier is trained to detect at least the first data category ([0076]: use the machine learning logic trained on the data set to classify new data of the data set, wherein the trained machine learning logic corresponds to “a trained classifier”. “detect at least the first data category” has been addressed above); and 
determining a probability that the category of the examined data matches the first data category based on the statistics score and the classification provided by the classifier (Fig. 3C; Fig. 3E; [0115]-[0116]: the results corroboration module 108 assigns the proposed label to a category based on a function of the counts, scores, and weights, wherein all factors are taken into consideration including the statistics score and the matching probability is derived based on the classifications provided by the trained classifier as discussed above).

With regard to claim 13,
	Butler teaches
a system for classifying examined data in a computerized database (Abstract: a system for classifying a field in source data into a classification label that identifies the field’s semantic meaning, wherein data in the field being classified corresponds to “examined data”. [0046]: data sources that store source data can include databases, i.e., “a computerized database”), the system comprising:
a memory ([0141]: memory); and 
a processor ([0140]-[0141]: processor) configured to:  
calculate statistics of the examined data ([0035]; [0051]: determine statistical attribute(s) of the data fields in received source data, and generate profile data including those statistical attributes, wherein the statistics about the data values can include a maximum value, a minimum value, a standard deviation, a mean, and so forth of the values that are included in each of the data fields (if the data are numerical) indicates “calculate statistics”); 
compare the statistics of the examined data with known statistics of a first data category to provide a statistics score (Fig. 2C; [0062]; [0040]: the testing module 106 compares the attributes of the data fields with attributes of candidate labels received from the data dictionary database 114 to determine which of those candidate labels is the most closely associated with the attributes of the data fields determined by the data profile, wherein attributes of the profile data, which include statistical attributes as discussed above, corresponds to “the statistics of the examined data”, each candidate label corresponds to “a first data category”, and attributes associated with each candidate label corresponds to “known statistics”. Fig. 3C; [0106]: the testing module 106 performs the pattern match analysis 410 and proposes a Date of Birth label proposal 318 with a score of 0.80, wherein a score of 0.80 is "a statistics score" provided based on the comparison discussed above); and 
determine a probability that the category of the examined data matches the first data category based on the statistics score (Fig. 3E; [0115]-[0116]: the results corroboration module 108 assigns the proposed label to a category based on the counts for each proposed label, scores, and weights, wherein Category: Match 290 indicates "a probability that the category ... matches the first data category" and is derived based on the statistics score).

With regard to claim 14,
	Butler teaches
the system of claim 13, wherein the examined data is all of the same category, and wherein the examined data is all within the same column in the computerized database (Fig. 3A; [0098]-[0099]: classify each of the fields of the first and second tables 302a, 302b, wherein the data in each of the fields is all within the same table column and all of the same category).

With regard to claim 15,
Butler teaches
the system of claim 13, wherein the processor is configured to determine that the examined data is of the first category if the score is higher than a threshold ([0086]: select the match category when any dissenting labels were below a score threshold, wherein dissenting labels below a score threshold is equivalent to an agreeing or matching score higher than a threshold).

With regard to claim 16,
Butler teaches
the system of claim 13, wherein the processor is configured to: 
obtain a true classification of the examined data ([0060]: obtain a true classification if a user selects the existing label as presented by the results corroboration module 108); and 
if the true classification of the examined data equals the first data category, then adjust the known statistics of the first data category based on the statistics of the examined data ([0060]: re-classify the field by the classification module 105 and re-test the field by the testing module 106 to confirm that the label is accurate and potentially update the label attributes of that label in the data dictionary database 114, wherein updating the label attributes in the data dictionary database 114 includes “adjusting the known statistics” of the first data category as discussed above).

With regard to claim 17,
Butler teaches
the system of claim 13, wherein the calculated statistics are selected from the list consisting of: average, median, variance, minimum, maximum, standard deviation and correlation ([0051]: the statistics about the data values can include a maximum value, a minimum value, a standard deviation, a mean, and so forth of the values that are included in each of the data fields (if the data are numerical)).

With regard to claim 18,
Butler teaches
the system of claim 13, comprising: 
comparing categories of neighboring data of the examined data with expected categories of neighboring data of the first data category to provide a neighbors score (Fig. 3D; [0108]: compare Date2's neighboring data Date1 with expected field name of neighboring date of Date of Expiration (Date2) based on the context that the date in Date2 is always after the date value of Date1 for each entry, and provide a 95 score for Date2 being a Date of Expiration, wherein Date2 corresponds to “the examined data”, Date of Expiration corresponds to “the first data category”, Date1 corresponds to “neighboring data”, Date of Birth corresponds to “expected categories of neighboring data”, and a 95 score corresponds to “a neighbors score”); and 
determining a probability that the category of the examined data matches the first data category based on the statistics score and the neighbors score ([0108]; Fig. 3C; Fig. 3E; [0115]-[0116]: the results corroboration module 108 assigns the proposed label to a category based on a function of the counts, scores, and weights, wherein all scores are taken into consideration including both the statistics score and the neighbors score as discussed above).

With regard to claim 19,
Butler teaches
the system of claim 18, comprising: 
calculating the rate of matches of the examined data to each rule of a plurality of rules, and comparing the resulting rates with known rates of matches of the first data category for each rule of the plurality of rules, to provide a set of rule match scores ([0064]: the testing module 106 thus receives the profile data from the profile data module 104 and performs a series of statistical-based functions to identify, classify, and test the field details against a set of known label types; [0068]: if a threshold percentage of the entries for the data field satisfy each of these patterns, the testing module 106 can conclude that the field holds credit card numbers, and associate the field name with the appropriate label and probability, wherein patterns corresponds to “rules”, a threshold percentage of the entries for the data field satisfying each of these patterns indicates both “calculating the rate of matches of the examined data to each rule of a plurality of rules” and “comparing …with known rates of matches …” with a threshold percentage corresponding to “known rates of matches”. Fig. 3C; [0106]: a score of 0.80 the pattern match analysis 410 provides for a Date of Birth label proposal 318 corresponds to “match scores”); 
comparing metadata associated with the examined data with known metadata associated with the of the first data category to provide a metadata score ([0040]: compare attributes of the fields of the source data with attributes of the label, wherein the attributes include statistical metadata describing values of a given field, a tag for a specific set of values for the field (e.g., a list of city names, zip codes, etc.), a specified data format (e.g., a date format), a relationship between or among fields of a data set, and so forth. Fig. 3C; [0106]; [0068]: provide a score of .80 by Pattern Match Analysis 410 after checking the first 4-6 digits for each entry against a table of issuer codes and the last number including a check digit defined by a Luhn test, wherein the score .80 corresponds to "a metadata score", the first 4-6 digits representing issue codes and the last number including a check digit defined by a Luhn test are all metadata when the data field may be a credit card number data field); 
comparing values of the examined data with the values in a dictionary associated with the first data category to provide a dictionary score ([0070]: the testing module 106 can include a keyword search test 422 that searches for particular keywords against a limited set of common words in a specification of the reference database 116, wherein the keyword search indicates “comparing …” and a specification of the reference database 116 corresponds to “a dictionary”. Fig. 3C; [0106]; [0068]: provide a score of 0 by Keyword Search 422, wherein the score 0 corresponds to "a dictionary score"); 
using a trained classifier to classify the examined data, wherein the classifier is trained to detect at least the first data category ([0076]: use the machine learning logic trained on the data set to classify new data of the data set, wherein the trained machine learning logic corresponds to “a trained classifier”. “detect at least the first data category” has been addressed above); and 
determining a probability that the category of the examined data matches the first data category based on the statistics score, the neighbors score, the rule match scores, the metadata score, the dictionary score, and the classification provided by the classifier (Fig. 3C; Fig. 3E; [0115]-[0116]; [0108]: the results corroboration module 108 assigns the proposed label to a category based on a function of the counts, scores, and weights, wherein all scores are taken into consideration including the statistics score, the neighbors score, the rule match scores, the metadata score, the dictionary score, and the classification provided by the classifier as discussed above).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 11, 12, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Butler et al. (US 20200380212 A1), in view of Jaiswal (US 8688601 B2).

With regard to claim 11,
	As discussed in claim 1, Butler teaches all the limitations therein.
	Butler does not explicitly teach
the method of claim 1, comprising: 
obtaining a sample data of the first data category; 
calculating the known statistics of a first data category by calculating statistics of the sample data.
	Jaiswal teaches
the method of claim 1, comprising: 
obtaining a sample data of the first data category (Fig. 3, step 304; Col. 9, lines 16-21: obtain a training data set for each of the plurality of specific categories of sensitive information, wherein a training data set corresponds to “a sample data”, and each specific category of sensitive information corresponds to “the first data category”); 
calculating the known statistics of a first data category by calculating statistics of the sample data (Fig. 3, step 306; Col. 9, lines 48-60: extract a feature set from the training data set that includes statistically significant features within the training data set and use the feature set to build a machine learning-based classification model, wherein extracting statistically significant features from the training data set corresponds to “calculating statistics of the sample data” and using it to build a machine learning-based classification model indicates using it as the “known statistics of the a first data category”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Butler to incorporate the teachings of Jaiswal to obtain a sample data of the first data category and calculate the known statistics of a first data category by calculating statistics of the sample data. Doing so would use known examples of sensitive data as training data set to build a machine-learning model in an attempt to more accurately detect and protect unstructured sensitive data by identifying sensitive data that is similar to, but not exactly the same as, known examples of sensitive data as taught by Jaiswal (Col. 1, line 38-42).

With regard to claim 12,
	Jaiswal teaches
a method for detecting potentially sensitive data (Abstract), the method comprising: 
for a sample of data (Fig. 3, step 304; Col. 9, lines 16-21: obtain a training data set for each of the plurality of specific categories of sensitive information, wherein a training data set corresponds to “a sample data”): 
for a category of sensitive data (Fig. 3, step 304; Col. 9, lines 16-21: each specific category of sensitive information corresponds to “a category of sensitive data”): 
calculating statistics of the sensitive data (Fig. 3, step 306; Col. 9, lines 48-60: extract a feature set from the training data set that includes statistically significant features within the training data set, wherein extracting statistically significant features indicates “calculating statistics” of the sensitive data); 
storing metadata associated with the sensitive data (Fig. 1; Col. 10, lines 35-45: store in database 120 machine learning-based classifiers 126 that include a map of support vectors representing boundary features that may be selected from the highest-ranked features in a feature set, wherein highest-ranked features extracted from sensitive data corresponds to “metadata”); 
	Jaiswal does not teach
for a sample of data:
obtaining classification of data in columns in a database to not sensitive data and to categories of sensitive data; 
for a category of sensitive data: 
calculating probability of matches of the sensitive data for each rule of a plurality of rules; and 
storing categories of neighbor fields of the sensitive data; 
for examined data: 
calculating probability of matches of the examined data for each rule of the plurality of rules and comparing with the probability of matches of the sensitive data for each rule of the plurality of rules to provide rule match scores; 
calculating statistics of the examined data and comparing with the statistics of the sensitive data to provide statistics score; 
comparing metadata associated with the examined data with metadata associated with the sensitive data to provide metadata score; 
comparing categories of neighbor fields of the examined data with categories of neighbor fields of the sensitive data to provide neighbors score; and 
rating the potential of the examined data to be sensitive data based on the rule match scores, statistics score, metadata score and neighbors score.
Butler teaches
obtaining classification of data in columns in a database to not sensitive data and to categories of sensitive data (Abstract: classify a field in source data into a classification label that identifies the field’s semantic meaning. [0046]: data sources that store source data can include databases, i.e., “a database”. Fig. 3A; [0098]-[0099]: each of the fields of the first and second tables 302a, 302b is in table “columns”. [0044]: the label index can indicate whether a particular field includes personally identifying information (PII), i.e., “sensitive data” or not); 
calculating probability of matches of the sensitive data for each rule of a plurality of rules ([0064]: the testing module 106 thus receives the profile data from the profile data module 104 and performs a series of statistical-based functions to identify, classify, and test the field details against a set of known label types; [0068]: if a threshold percentage of the entries for the data field satisfy each of these patterns, the testing module 106 can conclude that the field holds credit card numbers, and associate the field name with the appropriate label and probability, wherein patterns corresponds to “rules” and a “probability” is calculated); and 
storing categories of neighbor fields of the sensitive data ([0040]; [0044]: each label of the data dictionary database 114 is associated with one or more attributes that can include statistical metadata describing a relationship between or among fields of a data set, e.g., whether they correlate to one another, whether there is a dependency, and so forth. Fig. 3D; [0108]: Date2 is assigned the label Date of Expiration based on its neighbor field Date1’s label being Date of Birth and the context that the date in Date2 is always after the date value of Date1 for each entry indicates “categories of neighbor fields” are stored); 
for examined data (Abstract: classify a field in source data into a classification label that identifies the field’s semantic meaning, wherein data in the field being classified corresponds to “examined data”): 
calculating probability of matches of the examined data for each rule of the plurality of rules and comparing with the probability of matches of the sensitive data for each rule of the plurality of rules to provide rule match scores ([0064]: the testing module 106 thus receives the profile data from the profile data module 104 and performs a series of statistical-based functions to identify, classify, and test the field details against a set of known label types; [0068]: if a threshold percentage of the entries for the data field satisfy each of these patterns, the testing module 106 can conclude that the field holds credit card numbers, and associate the field name with the appropriate label and probability, wherein patterns corresponds to “rules”, a threshold percentage of the entries for the data field satisfying each of these patterns indicates both “calculating probability of matches of the examined data for each rule of the plurality of rules” and “comparing …with the probability of matches …” with a threshold percentage corresponding to “the probability of matches”. Fig. 3C; [0106]: a score of 0.80 the pattern match analysis 410 provides for a Date of Birth label proposal 318 corresponds to “rule match scores”); 
calculating statistics of the examined data ([0035]; [0051]: determine statistical attribute(s) of the data fields in received source data, and generate profile data including those statistical attributes, wherein the statistics about the data values can include a maximum value, a minimum value, a standard deviation, a mean, and so forth of the values that are included in each of the data fields (if the data are numerical) indicates “calculating statistics”) and comparing with the statistics of the sensitive data to provide statistics score (Fig. 2C; [0062]; [0040]: the testing module 106 compares the attributes of the data fields with attributes of candidate labels received from the data dictionary database 114 to determine which of those candidate labels is the most closely associated with the attributes of the data fields determined by the data profile. Fig. 3C; [0106]: the testing module 106 performs the pattern match analysis 410 and proposes a Date of Birth label proposal 318 with a score of 0.80, wherein a score of 0.80 is a “statistics score" provided based on the comparison discussed above); 
comparing metadata associated with the examined data with metadata associated with the sensitive data to provide metadata score ([0040]: compare attributes of the fields of the source data with attributes of the label, wherein the attributes include statistical metadata describing values of a given field, a tag for a specific set of values for the field (e.g., a list of city names, zip codes, etc.), a specified data format (e.g., a date format), a relationship between or among fields of a data set, and so forth. Fig. 3C; [0106]; [0068]: provide a score of .80 by Pattern Match Analysis 410 after checking the first 4-6 digits for each entry against a table of issuer codes and the last number including a check digit defined by a Luhn test, wherein the score .80 corresponds to a “metadata score", the first 4-6 digits representing issue codes and the last number including a check digit defined by a Luhn test are all metadata when the data field may be a credit card number data field); 
comparing categories of neighbor fields of the examined data with categories of neighbor fields of the sensitive data to provide neighbors score (Fig. 3D; [0108]: compare Date2's neighboring data Date1 with expected field name of neighboring date of Date of Expiration (Date2) based on the context that the date in Date2 is always after the date value of Date1 for each entry, and provide a 95 score for Date2 being a Date of Expiration, wherein Date2 corresponds to “the examined data”, Date of Expiration corresponds to “the first data category”, Date1 corresponds to “neighboring data”, Date of Birth corresponds to “expected categories of neighboring data”, and a 95 score corresponds to “a neighbors score”); and 
rating the potential of the examined data to be sensitive data based on the rule match scores, statistics score, metadata score and neighbors score (Fig. 3C; Fig. 3E; [0115]-[0116]; [0108]: the results corroboration module 108 assigns the proposed label to a category based on a function of the counts, scores, and weights, wherein all scores are taken into consideration including the rule match scores, the statistics score, the metadata score, and the neighbors score as discussed above).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jaiswal to incorporate the teachings of Butler to prepare sample data to train machine learning classifiers to perform a series of statistical checks on metadata and data content of a data set in order to discover, classify, and label data content of the data set. Doing so would rapidly and, in some cases, automatically provide labels for data sets. The information can be used by computing systems for various applications. For example, applications that can use the generated labels of data sets can include data quality enforcement, personal data anonymization, data masking, personally identifiable information (PII) reports, test data management, data set annotation, and so forth as taught by Butler ([0005]-[0006]).

With regard to claim 20,
	As discussed in claim 19, Butler teaches all the limitations therein.
	Butler does not explicitly teach
a sample data.
	Jaiswal teaches
the system of claim 19, comprising: 
obtaining a sample data of the first data category (Fig. 3, step 304; Col. 9, lines 16-21: obtain a training data set for each of the plurality of specific categories of sensitive information, wherein a training data set corresponds to “a sample data”, and each specific category of sensitive information corresponds to “the first data category”); 
calculating the known statistics of a first data category by calculating statistics of the sample data (Fig. 3, step 306; Col. 9, lines 48-60: extract a feature set from the training data set that includes statistically significant features within the training data set and use the feature set to build a machine learning-based classification model, wherein extracting statistically significant features from the training data set corresponds to “calculating statistics of the sample data” and using it as the “known statistics of the a first data category”);
training the classifier using the sample data (Fig. 3, step 306; Col. 9, lines 35-47: use machine learning to train at least one machine learning-based classifier based on an analysis of the training data sets).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Butler to incorporate the teachings of Jaiswal to obtain a sample data of the first data category, calculate the known statistics of a first data category by calculating statistics of the sample data, and train the classifier using the sample data. Doing so would use known examples of sensitive data as training data set to build a machine-learning model in an attempt to more accurately detect and protect unstructured sensitive data by identifying sensitive data that is similar to, but not exactly the same as, known examples of sensitive data as taught by Jaiswal (Col. 1, line 38-42).
	Jaiswal does not explicitly teach
finding the expected categories of neighboring data of the first data category by finding the categories of neighboring data of the sample data; 
calculating the known probability of matches of the first data category for each rule of the plurality of rules by calculating known probability of matches of the sample data for each rule of the plurality of rules; 
finding the known metadata associated with the first data category by detecting metadata associated with the sample data; 
building the dictionary based on values of data in the sample data; 
However, in view of the teachings of a sample data by Jaiswal as discussed above and the teachings by Butler ([0076]; [0093]) of continually training the machine learning logic on the examined data and applying it to new data of the data set, in which the examined data is like a constantly changing sample data, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Butler and Jaiswal to incorporate the teachings of Butler to apply the steps that were applied to examined data to the sample data:
finding the expected categories of neighboring data of the first data category by finding the categories of neighboring data of the sample data (Identify categories of each neighboring data field according to the discussion in the parent claims. [0040]; [0044]: associated each label of the data dictionary database 114 with one or more attributes that can include statistical metadata describing a relationship between or among fields of a data set, e.g., whether they correlate to one another, whether there is a dependency, and so forth, i.e., “neighboring data”); 
calculating the known probability of matches of the first data category for each rule of the plurality of rules by calculating known probability of matches of the sample data for each rule of the plurality of rules ([0064]: the testing module 106 thus receives the profile data from the profile data module 104 and performs a series of statistical-based functions to identify, classify, and test the field details against a set of known label types; [0068]: if a threshold percentage of the entries for the data field satisfy each of these patterns, the testing module 106 can conclude that the field holds credit card numbers, and associate the field name with the appropriate label and probability, wherein patterns corresponds to “rules”, a threshold percentage of the entries for the data field satisfying each of these patterns indicates “calculating known probability of matches”); 
finding the known metadata associated with the first data category by detecting metadata associated with the sample data ([0051]: the profile data module 104 determines (258) statistical attribute(s) of the data fields and generates (260) profile data including those statistical attributes, wherein statistical attributes correspond to “metadata”); 
building the dictionary based on values of data in the sample data ([0039]: the load data module 110 sends the label index to the reference database 116 for access by future iterations of the labeling process, wherein the reference database 116 corresponds to “the dictionary”); 
Doing so would prepare seed training data for the machine learning classifiers so that the classifiers can start the machine learning process based on reliable seed metadata that has been derived from a selected sample data representative of real-life scenarios and improve the accuracy of the classifiers.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIAOQIN HU whose telephone number is (571)272-1792.  The examiner can normally be reached on Monday-Friday 7:00am-3:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Fred Ehichioya can be reached on (571) 272-4034.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/XIAOQIN HU/Examiner, Art Unit 2168