Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


EXAMINER’S AMENDMENT
 1.	An examiner’s amendment to the record appears below. Should the changes and/or additions be unaccepted to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no longer later than the payment of the issue fee.
2.	Authorization for this examiner’s amendment was given in a telephone interview with Shrinath Malur (Registration Number: 34,663) for Applicant on February 1, 2022.
3.	The amendment filed 1/20/22 has been entered. The instant Examiner’s amendment is directed to said entered amendment.
4.	The application has been amended as follow:
IN THE CLAIMS
5.	Replace Claims 1-25 with claims 1-25 amended by Examiner set forth below:
1. (Previously Presented) A method for data entries deduplication, comprising:
indexing an input data set, wherein the input data set is in a tabular format and the indexing includes providing a unique Row identifier (RowID), wherein rows are the data entries;
standardizing the input data set into a predefined and unified format; 
segmenting the standardized input data set, wherein each segment includes a subset of the rows included in the input data set;

computing attribute similarity for each column across each pair of rows;
computing, for each pair of rows, row-to-row similarity as a weighted sum of attribute similarities;
clustering pairs of rows based on their row-to-row similarities;
determine clusters that are substantially related based on a cluster signature;
iteratively merging clusters that are substantially related;
providing an output data set including at least the clustered pairs of rows; and
wherein the output data set further includes the input data set, a cluster identification indicating the deduplicated group to which a corresponding row belongs, a cluster anchor information including the RowID, and a confidence score indicating a confidence or likelihood that the row belongs to the cluster.

2.-3. (Canceled). 

4. (Original) The method of claim 1, wherein computing attribute similarity further comprises:
utilizing a comparator based on a type of an attribute to compute the attribute similarity, wherein the comparator is any one of: exact matching and fuzzy matching.

5. (Original) The method of claim 4, wherein row-to-row similarity demonstrates pairs of rows are similar, and wherein the weights are determined based on a machine learning model.

6. (Original) The method of claim 1, wherein clustering the pairs of rows further comprises:
generating a graph including nodes and edges, wherein the nodes represent rows and edges represent the row-to-row similarities; and
applying a greedy algorithm on the graph to cluster rows, wherein each cluster includes at two similar data entries.

7. (Original) The method of claim 6, wherein the clustering results in isolated rows, wherein each isolated row is individually clustered.

8.-9. (Canceled). 

10. (Previously Presented) The method of claim 6, wherein the format of the output data set is any one of: a table and a graph.

11. (Original) The method of claim 10, wherein the input data set is sourced from a plurality of data sources.

12. (Currently Amended) A non-transitory computer readable medium having stored thereon instructions for causing a processing circuitry to perform a process method for data entries deduplication, the process comprising:

standardizing the input data set into a predefined and unified format; 
segmenting the standardized input data set, wherein each segment includes a subset of the rows included in the input data set;
indexing each segment using a text search engine;
computing attribute similarity for each column across each pair of rows;
computing, for each pair of rows, row-to-row similarity as a weighted sum of attribute similarities;
clustering pairs of rows based on their row-to-row similarities;
determine clusters that are substantially related based on a cluster signature;
iteratively merging clusters that are substantially related;
providing an output data set including at least the clustered data entries and clustered data entries; and
wherein the output data set further includes the input data set, a cluster identification indicating the deduplicated group to which a corresponding row belongs, a cluster anchor information including the RowID, and a confidence score indicating a confidence or likelihood that the row belongs to the cluster.

13. (Previously Presented) A system for data entries deduplication, comprising:
a processing circuitry; and
a memory, the memory containing instructions that, when executed by the processing circuitry, configure the system to:

standardize the input data set into a predefined and unified format; 
segment the standardized input data set, wherein each segment includes a subset of the rows included in the input data set;
index each segment using a text search engine;
compute attribute similarity for each column across each pair of rows;
compute, for each pair of rows, row-to-row similarity as a weighted sum of attribute similarities;
cluster pairs of rows based on their row-to-row similarities;
determine clusters that are substantially related based on a cluster signature;
iteratively merge clusters that are substantially related;
provide an output data set including at least the clustered data entries and clustered data entries; and
wherein the output data set further includes the input data set, a cluster identification indicating the deduplicated group to which a corresponding row belongs, a cluster anchor information including the RowID, and a confidence score indicating a confidence or likelihood that the row belongs to the cluster.

14.-15. (Canceled). 

16. (Original) The system of claim 13, wherein the system is further configured to:


17. (Original) The system of claim 16, wherein row-to-row similarity demonstrates pairs of rows are similar, and wherein the weights are determined based on a machine learning model.

18. (Original) The system of claim 13, wherein the system is further configured to:
generate a graph including nodes and edges, wherein the nodes represent rows and edges represent the row-to-row similarities; and
apply a greedy algorithm on the graph to cluster rows, wherein each cluster includes at two similar data entries.

19.-20. (Canceled). 

21. (Currently Amended) The system of claim 18, wherein the format of the output data set is any one of: a table and a graph.

22. (Original) The system of claim 21, wherein the input data set is sourced from a plurality of data sources. 



	24. (Currently Amended)  The non-transitory computer readable medium according to claim 12,
wherein the process further comprises:
generating a graph including nodes and edges, wherein the nodes represent rows and edges represent the row-to-row similarities; and
applying a greedy algorithm on the graph to cluster rows, wherein each cluster includes at two similar data entries,
wherein the cluster signature is a combination of a cluster anchor, a cluster ID, and a neighbor vector, where the cluster anchor is the row that has the most connections with other rows in a particular cluster, and where the neighbor vector includes rows that are part of the particular cluster in addition to rows that are "N" hops away, “N” being a non-negative integer, in the generated graph, from rows within the particular cluster.

25. (Currently Amended)  The system of claim 18, wherein the cluster signature is a combination of a cluster anchor, a cluster ID, and a neighbor vector, where the 


Conclusion
6.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to SYLING YEN whose telephone number is (571)270-1306.  The examiner can normally be reached on 8am-4:30pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Mark Featherstone can be reached at 571-270-3750.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


66




/SYLING YEN/Primary Examiner, Art Unit 2166                                                                                                                                                                                                        
February 1, 2022