DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claims 1-20 are present in this application.  Claims 1-20 are pending in this office action.  


Information Disclosure Statement
The information disclosure statements (IDS) submitted on March 13, 2020; March 08, 2021; November 10, 2021 and April 05, 2022 and in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements have been considered by the examiner.

This Office Action is Non-Final.



Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that
form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless -
(a)(1) the claimed invention was patented, described in a printed publication, or in public use,
on sale or otherwise available to the public before the effective filing date of the claimed
invention.


Claims 1-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Williamson et al. (US 20180232528 A1).

Regarding claims 1, 9 and 15, Williamson discloses a method, comprising: 
performing, by a computer system, a scan to identify data items in a database that correspond to one or more of a plurality of specified output classes (see Williamson paragraph [0020], system for automatically scanning for sensitive data in multiple data sources; see paragraph [0021],The illustrated system includes one or more input data sources 102A-N and a sensitive data scanner 104. The sensitive data scanner 104 includes a data pre-processor 106, a data protect module 112, a data classifier 108, a classifier refinement engine 110, and a data classification reporting module 11), the scan including: 
determining metadata for a portion of the database, wherein the metadata includes schema information (see Williamson paragraph [0050], The metadata analyzer 202 analyzes the metadata of a data portion in the data received from the data of the input data sources 102A-N to determine whether the data portion is sensitive data. The metadata may include, in the case of data pre-processed by the data pre-processor 106, the metadata labels in the common data structure. In the case where a data pre-processor 106 is not used, the metadata includes the metadata labels directly extracted from the input data sources 102A-N. This includes column labels, schema names, database names, tables names, XML tags, filenames, file headers, other tags, file metadata, and so on); 
generating a data profile for a set of data items stored in the portion of database, wherein the generating includes performing a character-based analysis of the set of data items (The data pre-processor 106 may determine the relationship between the input data sources 102A-N using various rules. Data from the input data sources 102A-N may have a relationship as defined by the data itself. This may be in the form of related columns in a database, data from a single source, spatial or temporal proximity of data to other data, and so on. These relationships may be indicated by the data pre-processor 106 in the common structure (e.g., by connecting nodes in a graph) when the data pre-processor 106 parses the input data sources 102A-N and places the data in the common structure…. for non-textual data, such as audio or visual data, the data pre-processor 106 may perform text extraction (e.g., via speech recognition, object recognition, optical character recognition) against the audio or visual data to extract text from this data to be placed in the common data structure. As another example, for binary data, the data pre-processor 106 may convert the binary data to text using various binary to text conversion protocols, such as converting binary values to ASCII…. This may be achieved using a matching algorithm that matches metadata labels having the exact same label or labels that are within a threshold degree of similarity (e.g., labels that match if non-alphanumeric characters are removed). The metadata label is the descriptor used in the data to describe the data, e.g., column labels, XML tags, etc. Each metadata label may describe a portion of data that share some common characteristic. For example, a column in a database may have a metadata label indicating the data in the column are credit card numbers. The data pre-processor 106 may combine such data by indicating a relationship between them in the common data structure, or placing them in a single unit (e.g., a single node) in the common data structure);
identifying whether the set of data items corresponds to one of the plurality of specified output classes (see Williamson paragraph [0076], The classifier accuracy tuner 302 tunes the accuracy of the base determination of whether a data portion is sensitive data for the components of the data classifier 108 that output a binary determination of whether data is sensitive data or not sensitive data. These components may be the metadata analyzer 202, the reference data matcher 204, the pattern matcher 206, the logical classifier 208, and the contextual analyzer 210) by utilizing a multi-class neural network classifier trained to perform the identifying using a plurality of features, including features extracted from the metadata and the data profile (see Williamson paragraph [0036], the data classifier 108 may also determine that data is sensitive using machine learning algorithms. The data classifier 108 trains a machine learning model, such as a multilayer perceptron or convolutional neural network, on data known to be sensitive data. Features may first be extracted from the data using an N-gram (e.g., a bigram) model and these features input into the machine learning model for training. After training, the machine learning model will be able to determine (with a confidence level) whether data is sensitive or not); and 
identifying, based on outputs of the multi-class neural network classifier, a particular output class of the plurality of specified output classes that corresponds to the set of data items (see Williamson paragraph [0063], The deep learning classifier 212 may first extract features from a data portion before feeding it into the machine learning model. This may involve extracting bigram tokens from the data portion. The extracted features are fed into the machine learning model, which outputs a prediction, e.g., a percentage value, indicating whether the data portion is likely to be sensitive data or not sensitive data. In the case of a percentage value, if the percentage exceeds a threshold, the deep learning classifier 212 may indicate that the data portion is sensitive data. Otherwise, the data is not sensitive data).
Regarding claims 2, 10 and 16, Williamson discloses wherein the scan further includes: applying each of a set of rules to the set of data items stored in the portion of the database, wherein each rule represents a regular expression corresponding to one of the plurality of specified output classes; and wherein the plurality of features includes features extracted from application of the set of rules (see Williamson paragraph [0031], The data classifier 108 determines, using computer-specific rules and algorithms, whether a data portion in the data received from the data pre-processor 106 is sensitive data, and may also determine the level of the sensitive data (i.e., how sensitive the data is)…a data portion is a smallest unit of data that can individually convey information (or is meaningful) without relying on other data. For example, a zip code may be a data portion, as it conveys information about an address. In practice, the data classifier 108, when searching through the data from the input data sources 102A-N for data portions, may not search specifically for these smallest units of data, as this type of search may require analysis of the actual data and thus may be resource intensive…In general, the data classifier 108 may determine if the data is sensitive based on pattern matching, logical rules, contextual matching, reference table matching, and machine learning).
Regarding claims 3, 11 and 17, Williamson discloses wherein the portion of the database is a table that includes a plurality of columns, wherein the set of data items is one of the plurality of columns, and wherein the determined metadata includes a table name, a column name, and a column data type (see Williamson paragraph [0028], the data pre-processor 106 combines data having the same metadata labels. This may be achieved using a matching algorithm that matches metadata labels having the exact same label or labels that are within a threshold degree of similarity (e.g., labels that match if non-alphanumeric characters are removed). The metadata label is the descriptor used in the data to describe the data, e.g., column labels, XML tags, etc. Each metadata label may describe a portion of data that share some common characteristic. For example, a column in a database may have a metadata label indicating the data in the column are credit card numbers).
Regarding claims 4, 12 and 18, Williamson discloses, wherein each output class of the plurality of specified output classes has a predefined list of common column names and common table names for that output class, and wherein extracting the features from the metadata includes: generating, for each output class, a feature for each of the common column names and the common table names, wherein the generating is performed by applying a character-based neural network classifier to the predefined list of the common column names and the common table names, respectively, for that output class (see Williamson paragraph [0028], a column in a database may have a metadata label indicating the data in the column are credit card numbers. The data pre-processor 106 may combine such data by indicating a relationship between them in the common data structure, or placing them in a single unit (e.g., a single node) in the common data structure. In addition to metadata labels, the data pre-processor 106 may also perform the same action, i.e., indicate a relationship or combine into a single unit, against data that have matching contents (e.g., two cells in a database with the same ID number). Furthermore, the data pre-processor 106 may combine data associated with a single metadata label into a single unit, e.g., all data under a single structure, in the common data structure).
Regarding claims 5 and 19, Williamson discloses, wherein each output class has a predefined list of acceptable data types, and wherein extracting the features from the metadata includes generating a feature for each output class that indicates whether a column of the database includes an acceptable data type for that output class (see Williamson paragraph [0044], The data classification reporting module 114 reports the results produced by the data classifier 108 indicating whether data portions in the input data sources 102A-N are sensitive or not. The data classification reporting module 114 may present the results in a user interface to the user. The user interface may present to the user the types of sensitive data that have been detected (e.g., names, social security numbers, etc.), the number of data portions with sensitive data that have been detected, the source of these sensitive data portions, the historical trends regarding the number of sensitive data portions that have been detected, and so on… see Williamson paragraph [0051], The metadata analyzer 202 may look for any metadata that indicates one of the above sensitive data types. The metadata may have a label that directly infers the sensitive data type, or may include abbreviations or acronyms that indicate the sensitive data type. For example, a social security number sensitive data type may have a metadata of “ssn” or “soc_sec_num,” and a credit card number sensitive data type may be indicated with the metadata of “ccn” or “credit_card_num,” etc. Additional formulations of metadata labels may be used as well. After determining that a data portion is associated with metadata that indicates a sensitive data type, the metadata analyzer 202 may indicate that the data portion is sensitive data).
Regarding claims 6, 13 and 20, Williamson discloses wherein to perform the character-based analysis, the scan further includes determining an alphanumeric composition of characters included in each data item of the set of data items (see Williamson paragraph [0054], The pattern matcher 206 matches data portions received in the data from the data pre-processor 106 with various patterns to determine whether the data portions are sensitive data. Data portions that generate a positive match with a pattern may be determined by the pattern matcher 206 to be sensitive data. The patterns may be in any type of format that can store patterns, such as regular expressions, formal grammars, rules, wildcards, image and video pattern recognition methods (in the case of non-text data portions), and so on. For example, a sensitive data type for a social security number may be matched by the regular expression “[0-9]{3}-[0-9]{2}-[0-9]{4}” indicating a pattern of three numbers, followed by a dash, followed by two numbers, followed by another dash, and followed by 4 numbers. Other sensitive data types may be matched similarly. For example, zip codes may be matched as sequences of 5 numbers. Emails may be matched by a string of characters (the local-part), followed by the “@” symbol, followed by an alphanumeric string of characters (the domain) that may include a period, and then ending with a period and sequence of characters that matches a top level domain (e.g., “.com,” “.mail”). Telephone numbers for the United States may be matched according to the standard format of the three digit area code, a delimiter, a three digit central office code, the same delimiter, and a four digit line number).
Regarding claims 7 and 14 Williamson discloses wherein to perform the character-based analysis, the scan further includes determining, for the set of data items, a distribution of character compositions of the data items within the set (see Williamson paragraph [0053], The reference data source may include a list, database, table, or other data structure stored by the reference data matcher 204 that include lists of data that are likely to be sensitive data. The data in the reference data is data that is of a sensitive data type that does not necessarily have any shared patterns, but if matched, is highly likely to be sensitive data. For example, while social security numbers follow a distinct pattern (e.g., 3 digits, dash, 2 digits, dash, 4 digits), some sensitive data types, such as names of persons, do not follow any specific pattern or rule, and can be uniquely identified and are likely to be sensitive. Other examples of sensitive data types that may be stored in the reference data sources include usernames, email address domains, postal codes, other address components (e.g., country codes, common street names), product names, medical conditions or terms, and so on. The reference data matcher 204 may require an exact match between the data portion and the data indicated in the reference data source, or only a percentage match beyond a threshold degree (e.g., matching a percentage number of the data in the data portion to the data in the reference data source).
Regarding claim 8 Williamson discloses wherein to perform the character-based analysis, the scan further includes: applying the distribution of character compositions to a similarity neural network classifier trained to identify similarities of the distribution to each of the plurality of specified output classes; applying the distribution of character compositions to a dissimilarity neural network classifier trained to identify dissimilarities of the distribution to each of the plurality of specified output classes; and wherein the plurality of features includes features extracted from application of the distribution of character compositions to the similarity and dissimilarity neural network classifiers (see Williamson paragraph [0054], The pattern matcher 206 matches data portions received in the data from the data pre-processor 106 with various patterns to determine whether the data portions are sensitive data. Data portions that generate a positive match with a pattern may be determined by the pattern matcher 206 to be sensitive data. The patterns may be in any type of format that can store patterns, such as regular expressions, formal grammars, rules, wildcards, image and video pattern recognition methods (in the case of non-text data portions), and so on. For example, a sensitive data type for a social security number may be matched by the regular expression “[0-9]{3}-[0-9]{2}-[0-9]{4}” indicating a pattern of three numbers, followed by a dash, followed by two numbers, followed by another dash, and followed by 4 numbers. Other sensitive data types may be matched similarly. For example, zip codes may be matched as sequences of 5 numbers. Emails may be matched by a string of characters (the local-part), followed by the “@” symbol, followed by an alphanumeric string of characters (the domain) that may include a period, and then ending with a period and sequence of characters that matches a top level domain (e.g., “.com,” “.mail”). Telephone numbers for the United States may be matched according to the standard format of the three digit area code, a delimiter, a three digit central office code, the same delimiter, and a four digit line number. Bank account numbers may be matched based on a standard length for account numbers and a set pattern limiting the types of account numbers and routing numbers that are available. Credit card numbers may be matched based on length and the fact that credit card numbers follow specific patterns (e.g., the initial issuer identification number is fixed to a set number of combinations). Dates of birth, driver's license numbers, and other sensitive data types may also follow various patterns and thus be able to be pattern matched by the pattern matcher 206).

Conclusion
Any inquiry concerning this communication or earlier communications from 
the examiner should be directed to DINKU GEBRESENBET whose telephone number is 571-270-1636.  The examiner can normally be reached between 8am-5pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ashish Thomas can be reached at 571-272-0631. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the patent application information retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197(toll-free).

/DINKU W GEBRESENBET/Primary Examiner, Art Unit 2164