DETAILED ACTION

The present application is being examined under the pre-AIA  first to invent provisions. 

Remarks
The amendments were received on 4/8/21.  Claims 57-105 are pending in the application.  Claims 1-56 have been cancelled and claims 57-105 have been added.  Applicants' arguments have been carefully and respectfully considered.
Claim(s) 57-65, 67-84, 86-94, and 96-104 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Gould (US 2005/0102325) and further in view of Zait et al. (US 6,957,225).
Claims 66, 85, 95, and 105 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Gould in view of Zait, and further in view of Brookler et al. (US 6879976).

Claim Rejections - 35 USC § 103
The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the manner in which the invention was made.


Claim(s) 57-62, 64, 65, 67-74, 76-81, 83, 84, 86-91, 93, 94, 96-101, 103 and 104 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Gould (US 2005/0102325) and further in view of Zait et al. (US 6,957,225).

With respect to claim 57, Gould teaches a method for data characterization comprising: 
receiving multiple data sets from one or more data sources (pa 0098, an input data set 402 from potentially several types of data systems);
generating for each of the received multiple data sets a data profile including summary information representative of data characteristics for the corresponding data set (pa 0098, make census component conducts a “census” of the data set, creating a separate census record for each unique field/value pair and each census record includes a count of the number of occurrences of the unique field/value pair for that census record & pa 0126, The canonicalize component 616 takes in a flow of records and sends out a flow of census elements containing a field/value pair representing values for each field in an input record. (An input record with ten fields yields a flow of ten census elements.) Each value is converted into a canonical (i.e., according to a pre-determined format) human readable string representation. Also included in the census element are flags indicating whether the value is valid and whether the value is null);
applying multiple characterization procedures to at least some of the generated data profiles for the multiple data sets (pa 0147, the information in the census file can be used to perform the joint-field analysis between two fields in two to produce characterizations of relationships between fields of records corresponding to at least some of the multiple data sets (Gould, pa 0147, the result of the joint-field analysis includes information about potential relationships between the fields);
identifying a candidate set of potentially matching data sets from the multiple data sets (Gould, pa 0176, the results of the joint-field analysis including which of the three types of relationship potentially exists between various fields is loaded into the metadata store for presentation to the user); and 
deriving quality metrics for the candidate set of potentially matching data sets (Gould, Fig. 15b & pa 0177, A single extend component 1400 receives records from the set of prepared census data C 1526, along with join information 1528 specifying the specific fields in source C to be compared. Extended records flow into both ports of a census join component 1200 that generates records containing values, patterns of occurrence, and counts for occurrence charts for the fields being compared.).
Gould doesn't expressly discuss eliminating at least some data sets from the multiple data sets according to the characterizations of the relationships produced by applying the multiple characterization procedures to form the candidate set of potentially matching data sets.
Zait teaches eliminating at least some data sets from the multiple data sets according to the characterizations of the relationships produced by applying the multiple characterization procedures to form the candidate set of potentially matching data sets (Zait, Col. 8 Li. 52- Col. 9 Li. 32, Determination of significance of 
It would have been obvious at the effective filing date of the invention to a person having ordinary skill in the art to which said subject matter pertains to have modified Gould with the teachings of Zait because partition pruning based on column correlation reduces the number of records processed during query execution (Zait, Col. 6 Li. 9-13).

With respect to claim 58, Gould in view of Zait teaches the method of claim 57, wherein eliminating at least some data sets from the multiple data sets includes applying successive elimination rules to the characterizations of relationships between fields of records corresponding to at least some of the multiple data sets (Gould, pa 0121, filter incoming records, pa 0172, filter census records, pa 0173, filter converts values in census records, pa 0196, filtering out less meaningful functional dependency relationships).

With respect to claim 59, Gould in view of Zait teaches the method of claim 57, wherein the summary information representative of data characteristics for the corresponding data set includes one or more of: patterns, counts, or distribution of data in the data set 

With respect to claim 60, Gould in view of Zait teaches the method of claim 57, wherein applying the multiple characterization procedures to produce the characterization of relationships between fields includes:
determining level of data overlap between a first data set of the multiple data sets and at least another data set of the multiple data sets (Gould, pa 0155, “relative value overlap” for each field, representing the percentage of distinct values occurring one field that also occur in the other).

With respect to claim 61, Gould in view of Zait teaches the method of claim 60, further including:
selecting, based on the level of data overlap between the first data set and the at least other data set, two or more of the multiple data sets from the candidate set for which data quality metrics are derived (Gould, pa 0155, Some statistics based on these totals are used to determine whether a pair of fields has one of the three types of relationships mentioned above. The statistics include the percentages of total records in a field that have distinct or unique values, percentages of total records having a particular pattern of occurrence, and the "relative value overlap" for each field & pa 0156-0159, high relative value overlap indicates relationship).

claim 62, Gould in view of Zait teaches the method of claim 57, wherein applying the multiple characterization procedures includes:
determining, based on a first data profile for a first data set from the multiple data sets, a minimum a number of distinct values required to be present within a second data set so that an overlap threshold, indicative that the second data set and the first data set substantially overlap, is exceeded; and determining, based on a second data profile for the second data set, whether the second data set includes at least the minimum number of distinct values (Gould, pa 0155, determining whether a pair of fields has one of the three types of relationships, [0156] foreign key relationship-a first one of the fields has a high relative value overlap (e.g., >99%) and the second field has a high percentage (e.g., >99%) of unique values. The second field is potentially a primary key and the second field is potentially a foreign key of the primary key. [0157] joins well relationship -at least one of the fields has a small percentage ( e.g., <10%) of rejected records, and the percentage of individual joined records having a pattern of occurrence of NxN is small ( e.g., <1 %) ).

With respect to claim 63, Gould in view of Zait teaches the method of claim 57, as discussed above.  Gould doesn't expressly discuss the teachings of claim 63.
Zait teaches wherein applying the multiple characterization procedures includes:
identifying, based on a first data profile for a first data set (Zait, Col. 8 Li. 20-28, In one embodiment, identification of correlated columns includes identifying a column based on the number of distinct values for the particular column. Further in the 
correlated columns.), one or more most common values in the first data set (Zait, Col. 6 Li. 1-8, the column order_date from the ORDERS table is correlated to the column shipment_date, which is the partitioning key for the SALES table, via the common column order_id from both the SALES and ORDERS tables. Correlation between columns, in this context, refers to a strong relationship between the values in one column and the corresponding values contained in another column.) that such that a ratio of a number of records, excluding records for the one or more identified most common values, to a total number of records in the first data set is below an overlap threshold required to be exceeded for another data set to be determined to substantially overlap the first data set (Zait, Col. 9 Li. 19-32, if the following condition is met (e.g., returns "true"), then the correlation between columns C1 and C2 is considered significant. n/N«1.0. A relationship between n and N that is significantly less than one is considered a significant correlation because it indicates that the number of distinct combinations of values for the correlated columns is significantly less than the total number of rows produced by a join of the tables. That is, the columns are strongly correlated, evidenced by the existence of repetitious value combinations with respect to the columns. & Col. 11 Li. 55-59, correlation table (i.e., a join of tables T1 and T2 on
columns C1 and C2 , with duplicate records eliminated) ); and
identifying one or more other data sets, from the multiple data sets, not including the one or more identified most common values in the first data set (Zait, Fig. 2, step 220 & Col. 11 Li. 53-67, ensuring correlation table size is less than a 
It would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains to have modified Gould with the teachings of Zait because it ensures that the benefits of the correlation process outweigh the costs of creating and maintaining the correlation table (Zait, Col. 11 Li. 55-59).

With respect to claim 64, Gould in view of Zait teaches the method of claim 57, wherein applying the multiple characterization procedures to at least some of the generated data profiles includes:
applying one or more characterization procedures implemented to avoid all-to-all pairwise comparisons between different profiles or different data sets (Gould, pa 0161, By comparing census records corresponding to the key fields in the join operation, with filter 1202 selecting "Field 1" (A1) and filter 1204 selecting "Field 1" (B1), the census join component 1200 potentially makes a much smaller number of comparisons than a join component 1100 that compares key fields of individual records from Table A and Table B.).

With respect to claim 65, Gould in view of Zait teaches the method of claim 57, further comprising:


With respect to claim 67, Gould in view of Zait teaches the method of claim 57, wherein the summary information of a first data profile for at least one field of records stored in the first data set includes a list of distinct values appearing in the field, and respective counts of numbers of records in which each distinctive value appears (Gould, Fig 8b)

With respect to claim 68, Gould in view of Zait teaches the method of claim 67, wherein the first data profile, from the data profiles, includes descriptive information describing one or more characteristics associated with the first data profile (Gould, Fig 8a).

With respect to claim 69, Gould in view of Zait teaches the method of claim 68, wherein the descriptive information for the first data profile includes issue information describing one or more potential issues associated with the data set associated with the first data profile (Gould, Fig 8a).

claim 70, Gould in view of Zait teaches the method of claim 69, wherein the one or more potential issues include presence of duplicate values in a field that is detected as a candidate primary key field (Gould, Fig 8a, duplicates).

With respect to claim 71, Gould in view of Zait teaches the method of claim 68, wherein the descriptive information describing the one or more characteristics associated with the first data profile includes population information describing a degree of population of a field in records stored in a first data source (Gould, Fig 8a).

With respect to claim 72, Gould in view of Zait teaches the method of claim 68, wherein the descriptive information describing the one or more characteristics associated with the first data profile includes uniqueness information describing a degree of uniqueness of values appearing in the field of records stored in a first data source (Gould, Fig 8b, detailed counts, distinct values percent).

With respect to claim 73, Gould in view of Zait teaches the method of claim 68, wherein the descriptive information describing the one or more characteristics associated with the first data profile includes pattern information describing one or more repeated patterns characterizing values appearing in a field of records stored in a first data source (Gould, Fig 8b, most common patterns).

With respect to claim 74, Gould in view of Zait teaches the method of claim 57, wherein applying multiple characterization procedures includes:

aggregating the two or more sets data profiles, to which the one or more rules were applied, to produce a third data profile; and storing the third data profile (Gould, Fig. 13 & pa 0165-0166, extended records are generated by concatenating a unique identifier for the pair of key fields that being joined with the value in the census record).

With respect to claim 75, Gould in view of Zait teaches the method of claim 57, as discussed above.  Gould doesn't expressly discuss determining that a first data set, from the candidate set of potentially matching data sets from the multiple data sets, substantially overlaps a second data set; and determining that the first data set substantially overlaps a third data set in response to a determination that the second data set is sufficiently similar to the third data set.
Zait teaches determining that a first data set, from the candidate set of potentially matching data sets from the multiple data sets, substantially overlaps a second data set; and determining that the first data set substantially overlaps a third data set in response to a determination that the second data set is sufficiently similar to the third data set (Zait, Col. 7 Li. 36-43).
It would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains to have modified 

With respect to claims 76-84, the limitations are essentially the same as claims 57-65 and are thus rejected for the same reasons.

With respect to claims 86-94, the limitations are essentially the same as claims 57-65 and are thus rejected for the same reasons.

With respect to claims 96-104, the limitations are essentially the same as claims 57-65 and are thus rejected for the same reasons.

Claims 66, 85, 95, and 105 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Gould in view of Zait, and further in view of Brookler et al. (US 6879976).

With respect to claim 66, Gould in view of Zait teaches the method of claim 65, as discussed above.  Gould in view of Zait doesn't expressly discuss the teachings of claim 66.
Brookler teaches wherein directly comparing the first set of distinct values in the first data profile to the second set of distinct values includes:
forming a vector intersection of a first bit vector, identifying which of a set of possible distinct values for all of the multiple data sets is included in the first set of 
It would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains to have modified Gould in view of Zait with the teachings of Brookler because it results in faster processing and less memory usage while saving time by avoiding reconciling individual result sets (Brookler, Col. 2 Li. 65-Col. 3 Li. 14).
	
With respect to claim 85, 95, and 105, the limitations are essentially the same as claim 66 and are thus rejected for the same reasons.

Response to Arguments
Rejection of claims under 35 U.S.C. 103
Applicant seems to argue a newly amended limitation.  Applicant’s amendment has rendered the previous rejection moot.  Upon further consideration of the amendment, a new grounds of rejection is made in view of Zait et al. (US 6,957,225).


Conclusion
THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRITTANY N ALLEN whose telephone number is (571)270-3566.  The examiner can normally be reached on M-F 9 am - 5:00 pm EST.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached on 571-272-4046.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.


/BRITTANY N ALLEN/           Primary Examiner, Art Unit 2169