DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
The following claims are pending in this office action: 1-7, 10-16, 19-21
The following claims are amended: 1-4, 10, 12-13, 15-16, 19,
The following claims are new: 20-21
The following claims are cancelled: 8-9, 17-18
The following claims are rejected: 1-7, 10-16, 19-21
Response to Arguments
Applicant’s arguments filed amendments on 04/20/21, to address the objections have been fully considered and are persuasive.  The objections of the specification and claims 3,10, 12, and 19 have been withdrawn. 
Applicant’s arguments filed amendments on 04/20/21, to address the 112(f) interpretation have been fully considered and are persuasive.  The 112(f) interpretation of claim 1 has been withdrawn. 
Applicant’s arguments filed amendments on 04/20/21, to address the 112(b) rejection have been fully considered and are persuasive.  The 112(b) rejection of claims 4-5, and 13-14 has been withdrawn. 
Applicant’s arguments filed amendments on 04/20/21, to address the 101 rejection have been fully considered and are persuasive.  The 101 rejection has been withdrawn. 
Applicant's arguments filed amendments on 02/28/2021 to address the 35 U.S.C. 103 rejection have been fully considered but they are not persuasive. Applicant argues that the “minimum confidence factor” has been mischaracterized as the “threshold” and Applicant argues that the “classifiers are facilitated with learning based on a comparison between two categories of classification…” Applicant is arguing points that are not claimed. Additionally, Applicant is arguing that Russell does not disclose or suggest “facilitating learning”. Examiner disagrees with Applicant’s arguments. Para. [0028], [0093], and [0111] of Russell disclose that classifiers are modified/tuned/learned/retrained by the determination if there are too many label conflicts/false positives exceeding a threshold. This is an indication of a low level of confidence due to label conflicts/false positive thus necessitating the need for further training of classifiers. The combination of Dirac, Leon, and Russell teaches what the Applicant is arguing. Therefore, Examiner respectfully asserts that the combination of the cited art sufficiently teaches the limitations recited in the amended claims.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention 

	Claims 1-2, 6-7, 10-11, 15-16, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. US 20150379430 A1 to Dirac, et al. (hereinafter, “Dirac”), in view of “Evaluating the effect of voting methods on ensemble-based classification” to Leon, et al. (hereinafter, “Leon”), and further in view of U.S. Pub. No. US 20150254555 A1 to Russell, et al. (hereinafter, “Russell”).
As per claim 1, Dirac teaches a method of managing data of an entity, the method comprising:
receiving, by a data management system comprising a processor, data associated with an entity from a data source, wherein the data comprises a current data and a reference data a (Dirac, Para. [0091] discloses “the MLS (machine learning service) may include… storage devices that are used to store input data sets… and the network pathways used for transferring client input data and results” and Para. [0346] discloses “MLS customers may be informed relatively early during the processing of a given data set DS1 (such as a test data set for a model) whether DS1 has a high probability of containing records that were also in a second data set DS2 (such as the training data set for the model)” and Fig. 76 discloses Processor 9010n (current data is DS2 and reference data is DS1))
predicting, by the data management system comprising the processor, a category of the current data to be one of, duplicate data and non-duplicate data, with respect to the reference data, [[using a plurality of Supervised Machine Learning (SML) classifiers, wherein each of the plurality of SML classifiers predicts the category of the data individually]] (Dirac, Para. [0349 discloses “In the depicted embodiment, a probabilistic duplicate detector 7036 of the MLS may use the alternate representation 7030 to make one of the following determinations regarding a given OR Te-k of the test data set 7004: either (a) Te-k is not a duplicate of any of the ORs of the training data set or (b) Te-k has a non-zero probability of being a duplicate of an OR of the training data set” and Fig. 76 discloses Processor 9010n)
generating, by the data management system comprising the processor, a confidence factor of the duplicate data category and the non-duplicate data category [[based on the prediction of each of the plurality of SML classifiers]] (Dirac, Para. [0349] discloses “In some embodiments the probabilistic duplicate detector 7036 may be able to estimate or compute a confidence level or certainty level associated with a labeling of a given OR as a duplicate”) 
determining, by the data management system comprising the processor, the current data to be one of, the duplicate data and the non-duplicate data based on the confidence factor (Dirac, Para. [0350] discloses “The duplicate detector 7036 may examine some number of ORs of the test data set 7004 and obtain one or more duplication metrics 7040 for the examined ORs.”. Para [0351] further discloses “In at least some embodiments, if the duplication metric 7040 meets a threshold criterion…one or more duplication responses 7045 may be implemented by the MLS. Any of a number of different responsive actions may be undertaken in different embodiments—e.g…” (using the threshold criterion (confidence factor), the system can make a determination as to whether or not a data record is a duplicate or not through a responsive action))
providing instructions to a system based on the determination of the current data to be one of the duplicate data and the non-duplicate data to manage redundant data (Dirac, Para [0351] discloses “one or more duplication responses 7045 may be implemented by the MLS. Any of a number of different responsive actions may be undertaken in different embodiments—e.g., clients may be sent warning messages indicating the possibility of duplicates, likely duplicates may be removed or deleted from the test data set 7004, a machine learning job that involves the use of the test data may be suspended, canceled or abandoned, and so on.” and Para [0365] discloses “in response to the identification of potential or likely duplicates within a data set, the MLS may suspend, abandon or cancel a machine learning job which involves the use of the data set or is otherwise associated with the data set.”)
Dirac fails to explicitly teach:
using a plurality of Supervised Machine Learning (SML) classifiers, wherein each of the plurality of SML classifiers predicts the category of the data individually
based on the prediction of each of the plurality of SML classifiers
However, Leon (Leon addresses the issue of using different ensemble voting methods in junction with a plurality of machine learning classifiers) teaches:
using a plurality of Supervised Machine Learning (SML) classifiers, wherein each of the plurality of SML classifiers predicts the category of the data individually (Leon, Abstract discloses “Bagging is a popular method used to increase the accuracy of classification, by training a set of classifiers on slightly different datasets and aggregating their output by voting” and Introduction second paragraph discloses “Ensemble methods have been proven to be a good approach for the improvement of classification accuracy. Among the most popular methods, one can mention bagging…” (Ensemble method uses plurality of classifiers where each outputs a vote (prediction) regarding a category of the data. Ensemble methods are a class of supervised machine learning, so because bagging is a subclass of ensemble methods, bagging itself is a supervised machine learning method))
based on the prediction of each of the plurality of SML classifiers (Leon, Abstract discloses “Bagging is a popular method used to increase the accuracy of classification, by training a set of classifiers on slightly different datasets and aggregating their output by voting” and Introduction second paragraph discloses “Ensemble methods have been proven to be a good approach for the improvement of classification accuracy. Among the most popular methods, one can mention bagging…” (Ensemble method uses plurality of classifiers where each outputs a vote (prediction) regarding a category of the data. Ensemble methods are a class of supervised machine learning, so because bagging is a subclass of ensemble methods, bagging itself is a supervised machine learning method))
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the determination as to whether a data record is a duplicate or not through the use of predictions and a confidence factor as disclosed by Dirac to use the plurality of machine learning classifiers as disclosed by Leon. The combination would have been obvious because a person of ordinary skill in the art would be motivated to have “improvement of classification accuracy” through the use of ensemble methods (Leon, Introduction second paragraph)
Dirac fails to explicitly teach:
facilitating learning for one or more SML classifiers of the plurality of SML classifiers, associated with the category of the data having a minimum confidence factor
However, Russell (Russell addresses the issue of classification of data records) teaches:
facilitating learning for one or more SML classifiers of the plurality of SML classifiers, associated with the category of the data having a minimum confidence factor (Russell, Para. [0028] discloses “In at least one of the various embodiments, if the number of classification errors exceeds one or more defined thresholds, additional actions may be performed. In at least one of the various embodiments, one or more of the classifiers may be tuned and/or modified based on data corresponding to one or more observed classification errors” (threshold is minimum confidence factor) and Para. [0093] discloses “In some embodiments, the system may employ fully supervised training… Any number of supervised learning algorithms may be used to analyze the training data and produce a function that is stored as Model(s)” and Para. [0111] discloses “In other embodiment, the DLNN model may be arranged to re-train if a number of detected classification errors (e.g., false positive, label conflicts, or the like) exceeds a defined threshold.” and Fig. 6 steps 608 and 610  (Classifiers are modified/tuned/learned/retrained by the determination if there are too many label conflicts/false positives exceeding a threshold. This indicates a low level of confidence as need to be subsequently retrained))
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the data management system as disclosed by Dirac to re train classifiers as disclosed by Russell. The combination would have been obvious because a person of ordinary skill in the art would be motivated to increase the 

As per claim 2, the combination of Dirac, Leon, and Russell as shown above teaches the method as claimed in claim 1, Dirac further teaches:
converting format of the data to a predefined format of [[the plurality of SML classifiers]] (Dirac, Para [0004] discloses “For many machine learning problems, transformations may have to be applied on various input data variables before the data can be used effectively to train models” (transforming (converting) data to a format that is acceptable to various SML classifiers))
Leon further teaches:
the plurality of SML classifiers (Leon, Abstract discloses “Bagging is a popular method used to increase the accuracy of classification, by training a set of classifiers on slightly different datasets and aggregating their output by voting” and Introduction second paragraph discloses “Ensemble methods have been proven to be a good approach for the improvement of classification accuracy. Among the most popular methods, one can mention bagging…” (Ensemble method uses plurality of classifiers where each outputs a vote (prediction) regarding a category of the data. Ensemble methods are a class of supervised machine learning, so because bagging is a subclass of ensemble methods, bagging itself is a supervised machine learning method))
Same motivation to combine Dirac and Leon as claim 1

As per claim 6, the combination of Dirac, Leon, and Russel  as shown above teaches the method as claimed in claim 1, Dirac further teaches:
wherein the current data is determined to be duplicate data when the confidence factor of the duplicate data category is greater than the confidence factor of the non-duplicate data category (Dirac, Para. [0349] discloses a probabilistic duplicate detector 7036 of the MLS may use the alternate representation 7030 to make one of the following determinations regarding a given OR Te-k of the test data set 7004: either (a) Te-k is not a duplicate of any of the ORs of the training data set or (b) Te-k has a non-zero probability of being a duplicate of an OR of the training data set. The probabilistic duplicate detector 7036 may be able to estimate or compute a confidence level or certainty level associated with a labeling of a given OR as a duplicate. Para [0350] discloses “The duplicate detector 7036 may examine some number of ORs of the test data set 7004 and obtain one or more duplication metrics 7040 for the examined ORs” where Para [0351] discloses “if the duplication metric 7040 meets a threshold criterion…one or more duplication responses 7045 may be implemented by the MLS. Any of a number of different responsive actions may be undertaken in different embodiments—e.g…” (using the threshold (confidence factor), the system can make a determination to as to whether data is duplicate))

As per claim 7, the combination of Dirac, Leon, and Russel as shown above teaches the method as claimed in claim 1, Dirac further teaches:
wherein the current data is determined to be non-duplicate data when the confidence factor of the non-duplicate data category is greater than the confidence factor of the duplicate (Dirac, Para. [0349] discloses a probabilistic duplicate detector 7036 of the MLS may use the alternate representation 7030 to make one of the following determinations regarding a given OR Te-k of the test data set 7004: either (a) Te-k is not a duplicate of any of the ORs of the training data set or (b) Te-k has a non-zero probability of being a duplicate of an OR of the training data set. The probabilistic duplicate detector 7036 may be able to estimate or compute a confidence level or certainty level associated with a labeling of a given OR as a duplicate. Para [0350] discloses “The duplicate detector 7036 may examine some number of ORs of the test data set 7004 and obtain one or more duplication metrics 7040 for the examined ORs” where Para [0351] discloses “if the duplication metric 7040 meets a threshold criterion…one or more duplication responses 7045 may be implemented by the MLS. Any of a number of different responsive actions may be undertaken in different embodiments—e.g…” (using the threshold (confidence factor), the system can make a determination to as to whether data is non-duplicate))

As per claim 10, Dirac teaches a data management system for managing data of an entity, comprising:
a processor (Dirac, Para. [0372] discloses “FIG. 76 illustrates such a general-purpose computing device 9000. In the illustrated embodiment, computing device 9000 includes one or more processors 9010 coupled to a system memory”)
and a memory communicatively coupled to the processor, wherein the memory stores processors instructions, which, on execution, causes the processors to: (Dirac, Para. [0372] discloses “FIG. 76 illustrates such a general-purpose computing device 9000. In the illustrated embodiment, computing device 9000 includes one or more processors 9010 coupled to a system memory”, Para [0373] discloses “Processors 9010 may be any suitable processors capable of executing instructions” and Para [0374] discloses “System memory 9020 may be configured to store instructions and data accessible by processor(s)”)
receive data associated with an entity from a data source, wherein the data comprises a current data and a reference data (Dirac, Para. [0091] discloses “the MLS (machine learning service) may include… storage devices that are used to store input data sets… and the network pathways used for transferring client input data and results” and Para. [0346] discloses “MLS customers may be informed relatively early during the processing of a given data set DS1 (such as a test data set for a model) whether DS1 has a high probability of containing records that were also in a second data set DS2 (such as the training data set for the model)” (current data is DS2 and reference data is DS1))
predict a category of the current data to be one of, duplicate data and non-duplicate data, with respect to the reference data, [[using a plurality of Supervised Machine Learning (SML) classifiers, wherein each of the plurality of SML classifiers predicts the category of the data individually]] (Dirac, Para. [0349 discloses “In the depicted embodiment, a probabilistic duplicate detector 7036 of the MLS may use the alternate representation 7030 to make one of the following determinations regarding a given OR Te-k of the test data set 7004: either (a) Te-k is not a duplicate of any of the ORs of the training data set or (b) Te-k has a non-zero probability of being a duplicate of an OR of the training data set” and Fig. 76 discloses Processor 9010n)
[[based on the prediction of each of the plurality of SML classifiers]] (Dirac, Para. [0349] discloses “In some embodiments the probabilistic duplicate detector 7036 may be able to estimate or compute a confidence level or certainty level associated with a labeling of a given OR as a duplicate”) 
determine the current data to be one of, the duplicate data and the non-duplicate data based on the confidence factor (Dirac, Para. [0350] discloses “The duplicate detector 7036 may examine some number of ORs of the test data set 7004 and obtain one or more duplication metrics 7040 for the examined ORs.”. Para [0351] further discloses “In at least some embodiments, if the duplication metric 7040 meets a threshold criterion…one or more duplication responses 7045 may be implemented by the MLS. Any of a number of different responsive actions may be undertaken in different embodiments—e.g…” (using the threshold criterion (confidence factor), the system can make a determination as to whether or not a data record is a duplicate or not through a responsive action))
provide instructions to a system based on the determination of the current data to be one of the duplicate data and the non-duplicate data to manage redundant data (Dirac, Para [0351] discloses “one or more duplication responses 7045 may be implemented by the MLS. Any of a number of different responsive actions may be undertaken in different embodiments—e.g., clients may be sent warning messages indicating the possibility of duplicates, likely duplicates may be removed or deleted from the test data set 7004, a machine learning job that involves the use of the test data may be suspended, canceled or abandoned, and so on.” and Para [0365] discloses “in response to the identification of potential or likely duplicates within a data set, the MLS may suspend, abandon or cancel a machine learning job which involves the use of the data set or is otherwise associated with the data set.”)
Dirac fails to explicitly teach:
using a plurality of Supervised Machine Learning (SML) classifiers, wherein each of the plurality of SML classifiers predicts the category of the data individually
based on the prediction of each of the plurality of SML classifiers
However, Leon teaches:
using a plurality of Supervised Machine Learning (SML) classifiers, wherein each of the plurality of SML classifiers predicts the category of the data individually (Leon, Abstract discloses “Bagging is a popular method used to increase the accuracy of classification, by training a set of classifiers on slightly different datasets and aggregating their output by voting” and Introduction second paragraph discloses “Ensemble methods have been proven to be a good approach for the improvement of classification accuracy. Among the most popular methods, one can mention bagging…” (Ensemble method uses plurality of classifiers where each outputs a vote (prediction) regarding a category of the data. Ensemble methods are a class of supervised machine learning, so because bagging is a subclass of ensemble methods, bagging itself is a supervised machine learning method))
based on the prediction of each of the plurality of SML classifiers (Leon, Abstract discloses “Bagging is a popular method used to increase the accuracy of classification, by training a set of classifiers on slightly different datasets and aggregating their output by voting” and Introduction second paragraph discloses “Ensemble methods have been proven to be a good approach for the improvement of classification accuracy. Among the most popular methods, one can mention bagging…” (Ensemble method uses plurality of classifiers where each outputs a vote (prediction) regarding a category of the data. Ensemble methods are a class of supervised machine learning, so because bagging is a subclass of ensemble methods, bagging itself is a supervised machine learning method))
Same motivation to combine Dirac and Leon as claim 1
Dirac fails to explicitly teach:
facilitate learning for one or more SML classifiers of the plurality of SML classifiers, associated with the category of the data having a minimum confidence factor
However, Russell teaches:
facilitate learning for one or more SML classifiers of the plurality of SML classifiers, associated with the category of the data having a minimum confidence factor (Russell, Para. [0028] discloses “In at least one of the various embodiments, if the number of classification errors exceeds one or more defined thresholds, additional actions may be performed. In at least one of the various embodiments, one or more of the classifiers may be tuned and/or modified based on data corresponding to one or more observed classification errors” (threshold is minimum confidence factor) and Para. [0093] discloses “In some embodiments, the system may employ fully supervised training… Any number of supervised learning algorithms may be used to analyze the training data and produce a function that is stored as Model(s)” and Para. [0111] discloses “In other embodiment, the DLNN model may be arranged to re-train if a number of detected classification errors (e.g., false positive, label conflicts, or the like) exceeds a defined threshold.” and Fig. 6 steps 608 and 610  (Classifiers are modified/tuned/learned/retrained by the determination if there are too many label conflicts/false positives exceeding a threshold. This indicates a low level of confidence as need to be subsequently retrained))
Same motivation to combine Dirac and Russell as claim 1

As per claim 11, the combination of Dirac, Leon, and Russell as shown above teaches the data management system as claimed in claim 10, Dirac further teaches:
wherein the processor converts format of the data to a predefined format of [[the plurality of SML classifiers]] (Dirac, Para [0004] discloses “For many machine learning problems, transformations may have to be applied on various input data variables before the data can be used effectively to train models” (transforming (converting) data to a format that is acceptable to various SML classifiers))
Leon further teaches:
 the plurality of SML classifiers (Leon, Abstract discloses “Bagging is a popular method used to increase the accuracy of classification, by training a set of classifiers on slightly different datasets and aggregating their output by voting” and Introduction second paragraph discloses “Ensemble methods have been proven to be a good approach for the improvement of classification accuracy. Among the most popular methods, one can mention bagging…” (Ensemble method uses plurality of classifiers where each outputs a vote (prediction) regarding a category of the data. Ensemble methods are a class of supervised machine learning, so because bagging is a subclass of ensemble methods, bagging itself is a supervised machine learning method))


As per claim 15, the combination of Dirac, Leon, and Russell as shown above teaches the data management system as claimed in claim 10, Dirac further teaches:
wherein the processor is configured to determine the current data to be duplicate data, when the confidence factor of the duplicate data category is greater than the confidence factor of the non-duplicate data category (Dirac, Para. [0349] discloses a probabilistic duplicate detector 7036 of the MLS may use the alternate representation 7030 to make one of the following determinations regarding a given OR Te-k of the test data set 7004: either (a) Te-k is not a duplicate of any of the ORs of the training data set or (b) Te-k has a non-zero probability of being a duplicate of an OR of the training data set. The probabilistic duplicate detector 7036 may be able to estimate or compute a confidence level or certainty level associated with a labeling of a given OR as a duplicate. Para [0350] discloses “The duplicate detector 7036 may examine some number of ORs of the test data set 7004 and obtain one or more duplication metrics 7040 for the examined ORs” where Para [0351] discloses “if the duplication metric 7040 meets a threshold criterion…one or more duplication responses 7045 may be implemented by the MLS. Any of a number of different responsive actions may be undertaken in different embodiments—e.g…” (using the threshold (confidence factor), the system can make a determination to as to whether data is duplicate))

As per claim 16, the combination of Dirac, Leon, and Russell as shown above teaches the data management system as claimed in claim 10, Dirac further teaches
(Dirac, Para. [0349] discloses a probabilistic duplicate detector 7036 of the MLS may use the alternate representation 7030 to make one of the following determinations regarding a given OR Te-k of the test data set 7004: either (a) Te-k is not a duplicate of any of the ORs of the training data set or (b) Te-k has a non-zero probability of being a duplicate of an OR of the training data set. The probabilistic duplicate detector 7036 may be able to estimate or compute a confidence level or certainty level associated with a labeling of a given OR as a duplicate. Para [0350] discloses “The duplicate detector 7036 may examine some number of ORs of the test data set 7004 and obtain one or more duplication metrics 7040 for the examined ORs” where Para [0351] discloses “if the duplication metric 7040 meets a threshold criterion…one or more duplication responses 7045 may be implemented by the MLS. Any of a number of different responsive actions may be undertaken in different embodiments—e.g…” (using the threshold (confidence factor), the system can make a determination to as to whether data is non-duplicate))

As per claim 19, Dirac teaches:
a non-transitory computer readable medium including instructions stored thereon that when processed by at least one processor causes a data management system to perform an operation comprising: (Dirac, Para [0377] discloses A non-transitory computer-accessible storage medium … may be included in some embodiments of computing device 9000 as system memory” (as shown in claim 10, the system memory has instructions which a processor can execute))
receiving data associated with an entity from a data source, wherein the data comprises a current data and a reference data a (Dirac, Para. [0091] discloses “the MLS (machine learning service) may include… storage devices that are used to store input data sets… and the network pathways used for transferring client input data and results” and Para. [0346] discloses “MLS customers may be informed relatively early during the processing of a given data set DS1 (such as a test data set for a model) whether DS1 has a high probability of containing records that were also in a second data set DS2 (such as the training data set for the model)” and Fig. 76 discloses Processor 9010n (current data is DS2 and reference data is DS1))
predicting a category of the current data to be one of, duplicate data and non-duplicate data, with respect to the reference data, [[using a plurality of Supervised Machine Learning (SML) classifiers, wherein each of the plurality of SML classifiers predicts the category of the data individually]] (Dirac, Para. [0349 discloses “In the depicted embodiment, a probabilistic duplicate detector 7036 of the MLS may use the alternate representation 7030 to make one of the following determinations regarding a given OR Te-k of the test data set 7004: either (a) Te-k is not a duplicate of any of the ORs of the training data set or (b) Te-k has a non-zero probability of being a duplicate of an OR of the training data set” and Fig. 76 discloses Processor 9010n)
generating a confidence factor of the duplicate data category and the non-duplicate data category [[based on the prediction of each of the plurality of SML classifiers]] (Dirac, Para. [0349] discloses “In some embodiments the probabilistic duplicate detector 7036 may be able to estimate or compute a confidence level or certainty level associated with a labeling of a given OR as a duplicate”) 
determining the current data to be one of, the duplicate data and the non-duplicate data based on the confidence factor (Dirac, Para. [0350] discloses “The duplicate detector 7036 may examine some number of ORs of the test data set 7004 and obtain one or more duplication metrics 7040 for the examined ORs.”. Para [0351] further discloses “In at least some embodiments, if the duplication metric 7040 meets a threshold criterion…one or more duplication responses 7045 may be implemented by the MLS. Any of a number of different responsive actions may be undertaken in different embodiments—e.g…” (using the threshold criterion (confidence factor), the system can make a determination as to whether or not a data record is a duplicate or not through a responsive action))
providing instructions to a system based on the determination of the current data to be one of the duplicate data and the non-duplicate data to manage redundant data (Dirac, Para [0351] discloses “one or more duplication responses 7045 may be implemented by the MLS. Any of a number of different responsive actions may be undertaken in different embodiments—e.g., clients may be sent warning messages indicating the possibility of duplicates, likely duplicates may be removed or deleted from the test data set 7004, a machine learning job that involves the use of the test data may be suspended, canceled or abandoned, and so on.” and Para [0365] discloses “in response to the identification of potential or likely duplicates within a data set, the MLS may suspend, abandon or cancel a machine learning job which involves the use of the data set or is otherwise associated with the data set.”)
Dirac fails to explicitly teach:
using a plurality of Supervised Machine Learning (SML) classifiers, wherein each of the plurality of SML classifiers predicts the category of the data individually
based on the prediction of each of the plurality of SML classifiers
However, Leon teaches:
using a plurality of Supervised Machine Learning (SML) classifiers, wherein each of the plurality of SML classifiers predicts the category of the data individually (Leon, Abstract discloses “Bagging is a popular method used to increase the accuracy of classification, by training a set of classifiers on slightly different datasets and aggregating their output by voting” and Introduction second paragraph discloses “Ensemble methods have been proven to be a good approach for the improvement of classification accuracy. Among the most popular methods, one can mention bagging…” (Ensemble method uses plurality of classifiers where each outputs a vote (prediction) regarding a category of the data. Ensemble methods are a class of supervised machine learning, so because bagging is a subclass of ensemble methods, bagging itself is a supervised machine learning method))
based on the prediction of each of the plurality of SML classifiers (Leon, Abstract discloses “Bagging is a popular method used to increase the accuracy of classification, by training a set of classifiers on slightly different datasets and aggregating their output by voting” and Introduction second paragraph discloses “Ensemble methods have been proven to be a good approach for the improvement of classification accuracy. Among the most popular methods, one can mention bagging…” (Ensemble method uses plurality of classifiers where each outputs a vote (prediction) regarding a category of the data. Ensemble methods are a class of supervised machine learning, so because bagging is a subclass of ensemble methods, bagging itself is a supervised machine learning method))
Same motivation to combine Dirac and Leon as claim 1
Dirac fails to explicitly teach:
facilitating learning for one or more SML classifiers of the plurality of SML classifiers, associated with the category of the data having a minimum confidence factor
However, Russell teaches:
facilitating learning for one or more SML classifiers of the plurality of SML classifiers, associated with the category of the data having a minimum confidence factor (Russell, Para. [0028] discloses “In at least one of the various embodiments, if the number of classification errors exceeds one or more defined thresholds, additional actions may be performed. In at least one of the various embodiments, one or more of the classifiers may be tuned and/or modified based on data corresponding to one or more observed classification errors” (threshold is minimum confidence factor) and Para. [0093] discloses “In some embodiments, the system may employ fully supervised training… Any number of supervised learning algorithms may be used to analyze the training data and produce a function that is stored as Model(s)” and Para. [0111] discloses “In other embodiment, the DLNN model may be arranged to re-train if a number of detected classification errors (e.g., false positive, label conflicts, or the like) exceeds a defined threshold.” and Fig. 6 steps 608 and 610  (Classifiers are modified/tuned/learned/retrained by the determination if there are too many label conflicts/false positives exceeding a threshold. This indicates a low level of confidence as need to be subsequently retrained))
Same motivation to combine Dirac and Russell as claim 1

Claims 3-5, and 12-14 are rejected under 35 U.S.C. 103 as being unpatentable over Dirac, in view of Leon, further in view of Russell, and further in view of “Supervised Learning for Detection of Duplicates in Genomic Sequence” to Chen, et al. (hereinafter, “Chen”)
As per claim 3, the combination of Dirac, Leon, and Russell as shown above teaches the method as claimed in claim 1, Dirac further teaches:
wherein [[the plurality of SML classifiers are trained based on]] a plurality of master datasets associated with the entity [[analyzed by one or more data experts as duplicate and non-duplicate]] (Dirac, Para. [0091] discloses “the MLS (machine learning service) may include… storage devices that are used to store input data sets… and the network pathways used for transferring client input data and results” and Para. [0346] discloses “MLS customers may be informed relatively early during the processing of a given data set DS1 (such as a test data set for a model) whether DS1 has a high probability of containing records that were also in a second data set DS2 (such as the training data set for the model)” (current data is DS2 and reference data is DS1))
Leon further teaches:
the plurality of SML classifiers are trained based on (Leon, Abstract discloses “Bagging is a popular method used to increase the accuracy of classification, by training a set of classifiers on slightly different datasets and aggregating their output by voting” and Introduction second paragraph discloses “Ensemble methods have been proven to be a good approach for the improvement of classification accuracy. Among the most popular methods, one can mention bagging…” (Ensemble method uses plurality of classifiers where each outputs a vote (prediction) regarding a category of the data. Ensemble methods are a class of supervised machine learning, so because bagging is a subclass of ensemble methods, bagging itself is a supervised machine learning method))
The combination of Dirac and Leon fails to explicitly teach:
analyzed by one or more data experts as duplicate and non-duplicate
However, Chen (Chen addresses the issue of finding duplicates in genomic databases) teaches:
analyzed by one or more data experts as duplicate and non-duplicate (Chen, end of page 3 discloses “best way to understand duplicates is via expert curation. Human review— experts checking additional resources, and applying their experience and intuition—can best decide whether a pair is a duplicate, particularly for pairs whose identity cannot be easily determined automatically”)
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing of the claimed invention, to modify Dirac as modified to use the expert curation analysis as disclosed by Chen. The combination would have been obvious because a person of ordinary skill in the art would be motivated to have a more accurate set of predictions as the data has been thoroughly reviewed before being used to train models.

As per claim 4, the combination of Dirac, Leon, Russell and Chen as shown above teaches the method as claimed in claim 3, Dirac further teaches:
evaluating [[the plurality of trained SML classifiers]] based on one or more metrics and a exploratory visualization technique (Dirac, Para [0253] discloses “Any combination of a variety of prediction quality metrics may be identified by the MLS component for different types of machine learning problems”, and Para. [0314] discloses “Such an interactive graphical interface, which may for example be implemented via a collection of web sites or web pages (e.g., pages of a web-based MLS console), or via standalone graphical user interface (GUI) tools, may enable users of the MLS to browse or explore visualizations of results of various model executions…”)
Leon further teaches:
the plurality of trained SML classifiers (Leon, Abstract discloses “Bagging is a popular method used to increase the accuracy of classification, by training a set of classifiers on slightly different datasets and aggregating their output by voting” and Introduction second paragraph discloses “Ensemble methods have been proven to be a good approach for the improvement of classification accuracy. Among the most popular methods, one can mention bagging…” (Ensemble method uses plurality of classifiers where each outputs a vote (prediction) regarding a category of the data. Ensemble methods are a class of supervised machine learning, so because bagging is a subclass of ensemble methods, bagging itself is a supervised machine learning method))
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing of the claimed invention, to modify Dirac as modified to use the plurality of SM 

As per claim 5, the combination of Dirac, Leon, Russell and Chen as shown above teaches the method as claimed in claim 4, Dirac further teaches:
wherein the one or more metrics comprises accuracy metrics, precision metrics, recall metrics and F1-score metric which is a combination of precision and recall metrics (Dirac, Para. [0253] discloses “Any combination of a variety of prediction quality metrics may be identified by the MLS component for different types of machine learning problems, such as an AUC (area under curve) metric, an accuracy metric, a recall metric, a sensitivity metric, a true positive rate, a specificity metric, a true negative rate, a precision metric, a false positive rate, a false negative rate, an F1 score, a coverage metric, an absolute percentage error metric, or a squared error metric”)

As per claim 12, the combination of Dirac, Leon, and Russell as shown above teaches the data management system as claimed in claim 10, Dirac further teaches:
wherein the processor is configured to [[train the plurality of SML classifiers based on]] a plurality of master datasets associated with the entity, [[analyzed by one or more data experts as duplicate and non-duplicate]] (Dirac, Para. [0091] discloses “the MLS (machine learning service) may include… storage devices that are used to store input data sets… and the network pathways used for transferring client input data and results” and Para. [0346] discloses “MLS customers may be informed relatively early during the processing of a given data set DS1 (such as a test data set for a model) whether DS1 has a high probability of containing records that were also in a second data set DS2 (such as the training data set for the model)” (current data is DS2 and reference data is DS1))
Leon further teaches:
train the plurality of SML classifiers based on (Leon, Abstract discloses “Bagging is a popular method used to increase the accuracy of classification, by training a set of classifiers on slightly different datasets and aggregating their output by voting” and Introduction second paragraph discloses “Ensemble methods have been proven to be a good approach for the improvement of classification accuracy. Among the most popular methods, one can mention bagging…” (Ensemble method uses plurality of classifiers where each outputs a vote (prediction) regarding a category of the data. Ensemble methods are a class of supervised machine learning, so because bagging is a subclass of ensemble methods, bagging itself is a supervised machine learning method))
The combination of Dirac and Leon fails to explicitly teach:
analyzed by one or more data experts as duplicate and non-duplicate
However, Chen teaches:
analyzed by one or more data experts as duplicate and non-duplicate (Chen, end of page 3 discloses “best way to understand duplicates is via expert curation. Human review— experts checking additional resources, and applying their experience and intuition—can best decide whether a pair is a duplicate, particularly for pairs whose identity cannot be easily determined automatically”)
Same motivation to combine Dirac, Leon, Russell, and Chen as claim 3

As per claim 13, the combination of Dirac, Leon, Russell and Chen as shown above teaches the data management system as claimed in claim 12, Dirac further teaches:
wherein the processor is configured to evaluate [[the plurality of trained SML classifiers]] based on one or more metrics and a data exploratory visualization technique (Dirac, Para [0253] discloses “Any combination of a variety of prediction quality metrics may be identified by the MLS component for different types of machine learning problems”, and Para. [0314] discloses “Such an interactive graphical interface, which may for example be implemented via a collection of web sites or web pages (e.g., pages of a web-based MLS console), or via standalone graphical user interface (GUI) tools, may enable users of the MLS to browse or explore visualizations of results of various model executions…”)
Leon further teaches:
the plurality of trained SML classifiers (Leon, Abstract discloses “Bagging is a popular method used to increase the accuracy of classification, by training a set of classifiers on slightly different datasets and aggregating their output by voting” and Introduction second paragraph discloses “Ensemble methods have been proven to be a good approach for the improvement of classification accuracy. Among the most popular methods, one can mention bagging…” (Ensemble method uses plurality of classifiers where each outputs a vote (prediction) regarding a category of the data. Ensemble methods are a class of supervised machine learning, so because bagging is a subclass of ensemble methods, bagging itself is a supervised machine learning method))
Same motivation to combine Dirac, Leon, Russell, and Chen as claim 4

As per claim 14, the combination of Dirac, Leon, Russell and Chen as shown above teaches the data management system as claimed in claim 13, Dirac further teaches:
wherein the one or more metrics comprises accuracy metrics, precision metrics, recall metrics and F1-score metric which is a combination of precision and recall metrics (Dirac, Para. [0253] discloses “Any combination of a variety of prediction quality metrics may be identified by the MLS component for different types of machine learning problems, such as an AUC (area under curve) metric, an accuracy metric, a recall metric, a sensitivity metric, a true positive rate, a specificity metric, a true negative rate, a precision metric, a false positive rate, a false negative rate, an F1 score, a coverage metric, an absolute percentage error metric, or a squared error metric”)

Claims 20-21 are rejected under 35 U.S.C. 103 as being unpatentable over Dirac, in view of Leon, further in view of Russell, and further in view of U.S. Pub. No. US 20130073533 A1 to Hickey, et al. (hereinafter, “Hlckey”)
As per claim 20, the combination of Dirac, Leon, and Russell as shown above teaches the method of claim 1, Dirac further teaches further comprising:
a master dataset and a corresponding source dataset (Dirac, Para. [0091] discloses “the MLS (machine learning service) may include… storage devices that are used to store input data sets… and the network pathways used for transferring client input data and results” and Para. [0346] discloses “MLS customers may be informed relatively early during the processing of a given data set DS1 (such as a test data set for a model) whether DS1 has a high probability of containing records that were also in a second data set DS2 (such as the training data set for the model)” (Master dataset being DS2 and source dataset being DS1))
The combination of Dirac, Leon, and Russell fails to explicitly teach:
determining a similarity score based on a comparison of information between [[a master dataset and a corresponding source dataset]], wherein the similarity score comprises a numerical value in a range between 0 and 1
However, Hickey (Hickey addresses the issue of determining a similarity between electronic data records) teaches:
determining a similarity score based on a comparison of information between [[a master dataset and a corresponding source dataset]], wherein the similarity score comprises a numerical value in a range between 0 and 1 (Hickey, Para. [0016] discloses “One embodiment of the present invention is a computational system for comparing the electronic-data representations of two projects to produce a similarity metric that expresses a computed similarity of the two projects. In one embodiment of the present invention, the computed similarity is a real value within the range [0, 1].”)
 Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing of the claimed invention, to modify Dirac as modified to determine similarity scores between data as disclosed by Hickey. The combination would have been obvious 

As per claim 21, the combination of Dirac, Leon, and Russell as shown above teaches the data management system of claim 10, Dirac further teaches:
a master dataset and a corresponding source dataset (Dirac, Para. [0091] discloses “the MLS (machine learning service) may include… storage devices that are used to store input data sets… and the network pathways used for transferring client input data and results” and Para. [0346] discloses “MLS customers may be informed relatively early during the processing of a given data set DS1 (such as a test data set for a model) whether DS1 has a high probability of containing records that were also in a second data set DS2 (such as the training data set for the model)” (Master dataset being DS2 and source dataset being DS1))
The combination of Dirac, Leon, and Russell fails to explicitly teach:
wherein the processor is configured to determine a similarity score based on a comparison of information between [[a master dataset and a corresponding source dataset]], wherein the similarity score comprises a numerical value in a range between 0 and 1
However, Hickey teaches:
wherein the processor is configured to determine a similarity score based on a comparison of information between [[a master dataset and a corresponding source dataset]], wherein the similarity score comprises a numerical value in a range between 0 and 1 (Hickey, Para. [0016] discloses “One embodiment of the present invention is a computational system for comparing the electronic-data representations of two projects to produce a similarity metric that expresses a computed similarity of the two projects. In one embodiment of the present invention, the computed similarity is a real value within the range [0, 1].” And Fig. 9 discloses processors)
Same motivation to combine Dirac, Leon, Russel, and Hickey as claim 20
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HAMZA RAZZAQ MUGHAL whose telephone number is 571-272-8833. The examiner can normally be reached on M-TR from 7:30 to 5:00.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ALEXEY SHMATOV, can be reached at telephone number 571-270-3428. The fax 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://portal.uspto.gov/external/portal. Should you have questions about access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

/H.R.M./Examiner, Art Unit 2123                                                                                                                                                                                                        

/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123