Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Applicant’s Remarks
Applicant’s remarks filed 02/04/2021 have been fully considered and are rendered moot in view of the After Final Claim Amendments filed 02/04/2021 and the Examiner’s Amendment authorized on February 17, 2021 as shown below.  All prior rejections have been withdrawn.

Examiner’s Amendment
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in an interview with Mr. Martin Miller on February 17, 2021.
In addition to the After Final Claim Amendments filed 02/04/2021, the application is further amended as follows: 
--15.	(Currently Amended) A non-transitory computer readable medium embodying programming code that when executed by a processor causes the processor to perform functions, including functions to:
receive a dataset with a plurality of variable length character strings;
for each respective variable length character string in the plurality of variable length character strings:
compute a plurality of features of alphanumeric characters in the respective variable length character string;
capture attributes of the alphanumeric characters in the respective variable length character string based on the plurality of computed features in the respective variable length character string, wherein the captured attributes are a combination of features;
populate a data vector with the captured attributes, wherein the data vector has a predetermined length and includes one or more of the captured attributes of the respective variable length character string;
training a machine learning algorithm using a training dataset;

based on the category assigned to each respective data vector by the trained machine learning algorithm, evaluate the dataset; and
in response to evaluating the dataset based on the category assigned to each respective data vector, determine whether the dataset satisfies a data quality metric based on a number of data vectors in the dataset assigned to a category that corresponds to a data source that provided the dataset, wherein the data quality metric is one of:
the dataset includes a percentage of data vectors assigned to an outlier category that is less than an outlier threshold percentage, or 
the dataset fails to exceed the number of data vectors assigned to the category that corresponds to the data source that provided the dataset.  

20.	(Canceled) 

21.	(Currently Amended) A method, comprising:
receiving, by a processor, a dataset with a plurality of variable length character strings;
training a machine learning algorithm using a training dataset;
for each respective variable length character string in the plurality of variable length character strings:
computing, by the processor, a plurality of features of alphanumeric characters of the respective variable length character string;
capturing, by the processor, attributes of the alphanumeric characters of the respective variable length character string based on the plurality of computed features;
populating, by the processor, a data vector with the captured attributes, wherein the data vector has a predetermined length and includes one or more of the captured attributes of the alphanumeric characters in the respective variable length character string;
assigning a category to each respective data vector using the trained machine learning algorithm;
based on the category assigned to each respective data vector by the trained machine learning algorithm, evaluating the dataset; 
in response to evaluating the dataset based on the category assigned to each respective data vector, determining whether the dataset satisfies a data quality metric based on a number of data vectors in the dataset assigned to a category that corresponds to a data source that provided the dataset, wherein the data quality metric is satisfied when the dataset fails to exceed the number of data vectors assigned to the category that corresponds to the data source that provided the dataset; 
in response to a determination that the dataset fails to satisfy the data quality metric, generating an alarm; and


25.	(Currently Amended) The method of claim 21, further comprising: 

identifying triplets of a character of the same category, two characters of the same category adjacent to characters of a different category as a captured attribute.

27.	(Currently Amended) The method of claim 21, wherein the data quality metric is also satisfied when a percentage of data vectors assigned to an outlier category is less than an outlier threshold percentage.

28.	(Canceled) 

29.	(Currently Amended) A system, comprising:
a data source, wherein the data source outputs a dataset related to a service and the dataset includes a plurality of variable length character strings and that are encoded according to an encoding standard;
at least one database coupled to the data source and operable to store the dataset output by the data source; and 
a data quality monitoring component coupled to the data source, wherein the data quality monitoring component includes a processor and programming code that when executed by the processor, the processor is operable to perform functions, including functions to:
receive the dataset with the plurality of variable length character strings;
for each respective variable length character string in the plurality of variable length character strings:
compute a plurality of features of 
capture attributes of the respective variable length character string based on the plurality of computed features;
populate a data vector with the captured attributes, wherein the data vector has a predetermined length and includes one or more of the captured attributes of the respective variable length character string;
training a machine learning algorithm using a training dataset;
based on the captured attributes in the data vector, assign a category to each respective data vector using the trained machine learning algorithm;
based on the category assigned to each respective data vector, evaluate the dataset; 
in response to evaluating the dataset based on the category assigned to each respective data vector, determine whether the dataset satisfies a data quality metric based on a number of data vectors in the dataset assigned to a category that corresponds to a data source that provided the dataset, wherein the data quality metric is satisfied when the data set meets a percentage of data vectors assigned to an outlier category that is less than an outlier threshold percentage; and
in response to a determination that the dataset fails to satisfy the data quality metric, generate an alarm.  

31.	(Currently Amended) The system of claim 29, wherein:
a category is at least one of an upper-case letter, a lower-case letter, a punctuation mark, a Roman letter, an Arabic letter, a number, a punctuation mark, a special character, a space, or an outlier.


34.	(Currently Amended) The system of claim 29, wherein the data quality metric is satisfied when the dataset also

fails to exceed the number of data vectors assigned to the category that corresponds to the data source that provided the dataset. 

36.	(New) The method of claim 21, further comprising: 
identifying a letter adjacent to a number, a number adjacent to a letter, a letter surrounded by characters of a different category as a captured attribute. 

37.	(New) The system of claim 29, wherein:
the computed features include at least one of a percentage of elements in the respective variable length character string are in a specific category, a number of edges between letters, a group of numbers in serial, or a group of letters in serial. --

Pertinent Art Cited
The following US Patent Applications and/or NPL references reveal the current state of the art:

Bergeron et al. (US 2020/0210442) teaches a non-transitory computer readable medium embodying programming code that when executed by a processor causes the processor to perform functions (storage medium 506 and processor 502 of computer system 500 of fig.5 and par.0066), including functions to: 

compute a plurality of features of alphanumeric characters in the respective variable length character string (generate features from the text windows of varying length where features 216 of address components in the text windows 212 includes numeric zip codes, two letters state appreciations, compass directions, post office boxes, etc., par.0034 and par.0035); 
capture attributes of the respective variable length character string based on the plurality of computed features in the respective variable length character string, wherein the captured attributes are a combination of features (lengths of valid addresses in a given locale to identify US mailing addresses, where tokens matching five-digit zip codes, par.0033; associating tokens within each text window with address labels such as house number, road, city, state, postal code, or country, par.0039); 
populate a data vector with the captured attributes, wherein the data vector has a predetermined length and includes one or more of the captured attributes of the respective variable length character string (hash indexes 214 include fixed length vector representations of text windows 212 that are based on hash values of words and/or tokens in text windows, par.0034, where apply a first hash function to each token in a text window to generate a hash value representing an index into a fixed-length vector representation of the text window, par.0034); 
training a machine learning algorithm using a training dataset (training the logistic regression model using positive examples containing real complete mailing addresses, par.0036-0037); 
based on the captured attributes in the data vector, assign a category to each respective data vector using the trained machine learning algorithm (a score from 0 to 1 representing the probability that a text window contains an address, based on a vector representation of the text window and/or binary features 216 associated with address components in the text window, par.0036; produce scores 232 representing likelihoods the corresponding text windows 212 contain addresses, par.0037); and


Hamm (US 8,788,412) teaches of a system and methods for processing and checking user input data for various errors or discrepancies such as invalid address information (Hamm: col.3 lines 30-34) and further teaches generating an alarm (Hamm: alert users, col.3 line 43); and forwarding the generated alarm to a client with a report and a link to the dataset, wherein the report indicates how the dataset failed to satisfy the data quality metric (Hamm: link user to website to look up information regarding a zip code for a particular address in the event address information appears to be incorrect in the data from the user, col.9 lines 30-31 and col.10 lines 55-58).

Zinszer et al. (“Residential address errors in public health surveillance data: A description and analysis of the impact on geocoding”, Elsevier, Spatial and Spatio-temporal Epidemiology 1, pages 163-168, year 2010) teaches a non-transitory computer readable medium embodying programming code that when executed by a processor causes the processor to perform functions, including functions to: 
receive a dataset with a plurality of variable length character strings (primary residential addresses from public health file with variable length character strings, fig.1, page 164); 
for each respective variable length character string in the plurality of variable length character strings (for each address of plurality of residential addresses from public health file, fig.1, column 2, last paragraph of page 164): 
compute a plurality of features of the respective variable length character string (for each address, determine street name, street number, and postal code; column 2, last paragraph of page 164); 
capture attributes of the respective variable length character string based on the plurality of computed features (determining attributes where postal code in Canada contains alphanumeric with alternating letters and numbers, column 1, second paragraph of page 165); 

assign a category to each respective data vector using a machine learning algorithm (categories of “exact addresses”, “recoverable addresses”, and “unprocessable addresses” using classification algorithm, fig.2 and page 165); 
based on the category assigned to each respective data vector, evaluate the dataset (evaluate where addresses “missing” or “other errors”, fig.2-page 165 and fig.3-page 166.); and 
in response to evaluating the dataset based on the category assigned to each respective data vector, determine whether the dataset satisfies a data quality metric (algorithm compares addresses in the public health dataset to a Postal Code Address Data (PCAD) file maintained by Canada Post, fig.1 and column 2, last paragraph of page 164 and page 166).

Sathyanarayana et al. (US 8,468,167) teaches of an automatic data validation and correction system and methods including functions to: identifying one or more anomalies from a given data set using contextual information and validation rules and automatically corrects any identified anomalies or missing information; and in response to a determination that the dataset fails to satisfy the data quality metric, generate an alarm; forward the generated alarm to a client with a report indicating how the dataset failed to satisfy the data quality metric and a link to the dataset (par.0032).  Sathyanarayana further teaches of a data quality metric is one of:  the dataset includes a percentage of data vectors assigned to an outlier category that is greater than an outlier threshold percentage, or the dataset fails to exceed a number of data vectors assigned to a specific category that corresponds to a data source that provided the dataset (par.0032 and par.0039).

.

Allowable Subject Matter
Claims 15-19, 21-25, 27, and 29-37 are allowed.
The primary reason for the allowance of claim 15 is that the prior art of record, taken alone or in combination, fails to disclose or render obvious the subject matter of:
“in response to evaluating the dataset based on the category assigned to each respective data vector, determine whether the dataset satisfies a data quality metric based on a number of data vectors in the dataset assigned to a category that corresponds to a data source that provided the dataset, wherein the data quality metric is one of: the dataset includes a percentage of data vectors assigned to an outlier category that is less than an outlier threshold percentage, or the dataset fails to exceed the number of data vectors assigned to the category that corresponds to the data source that provided the dataset”.
The primary reason for the allowance of claim 21 is that the prior art of record, taken alone or in combination, fails to disclose or render obvious the subject matter of:
“in response to evaluating the dataset based on the category assigned to each respective data vector, determining whether the dataset satisfies a data quality metric based on a number of data vectors in the dataset assigned to a category that corresponds to a data source that provided the dataset, wherein the data quality metric is satisfied when the dataset fails to exceed the number of data vectors assigned to the category that corresponds to the data source that provided the dataset”.

“in response to evaluating the dataset based on the category assigned to each respective data vector, determine whether the dataset satisfies a data quality metric based on a number of data vectors in the dataset assigned to a category that corresponds to a data source that provided the dataset, wherein the data quality metric is satisfied when the data set meets a percentage of data vectors assigned to an outlier category that is less than an outlier threshold percentage”.
Claims 16-19, 22-25, 27, and 30-37 are allowed due to their dependency on claims 15, 21 and 29.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
The additional prior arts made of record and have not been relied upon are considered pertinent to applicant's disclosure as follows: US 2019/0057306.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HIEN (CINDY) D KHUU whose telephone number is (571)272-8585.  The examiner can normally be reached on Monday-Friday 8am-4:30pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ken Lo can be reached on 571-272-9774.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for 
/HIEN D KHUU/Primary Examiner, Art Unit 2116                                                                                                                                                                                                        February 18, 2021