data deduplication, learning-based
Rule-based deduplication utilizes expert domain knowledge to identify and remove duplicate data records. Achieving high accuracy in a rule-based system requires the creation of rules containing a good combination of discriminatory clues. Unfortunately, accurate rule-based deduplication often requires significant manual tuning of both the rules and the corresponding thresholds. This need for manual tuning reduces the efficacy of rule-based deduplication and its applicability to real-world data sets. No adequate solution exists for this problem. We propose a novel technique for rule-based deduplication. We apply individual deduplication rules, and combine the resultant match scores via learning-based information fusion. We show empirically that our fused deduplication technique achieves higher average accuracy than traditional rule-based deduplication. Further, our technique alleviates the need for manual tuning of the deduplication rules and corresponding thresholds.
Original Publication Citation
Jared Dinerstein, Sabra Dinerstein, Parris K. Egbert, Stephen W. Clyde. "Learning-based Fusion for Data Deduplication", In Proceedings of The Seventh International Conference on Machine Learning and Applications, pp. 66 - 71, 28. IEEE Computer Society.
BYU ScholarsArchive Citation
Dinerstein, Sabra; Egbert, Parris K.; Clyde, Stephen W.; and Dinerstein, Jared, "Learning-based Fusion for Data Deduplication" (2008). All Faculty Publications. 901.
Physical and Mathematical Sciences
© 2008 Institute of Electrical and Electronics Engineers
Copyright Use Information