Keywords
data deduplication, learning-based
Abstract
Rule-based deduplication utilizes expert domain knowledge to identify and remove duplicate data records. Achieving high accuracy in a rule-based system requires the creation of rules containing a good combination of discriminatory clues. Unfortunately, accurate rule-based deduplication often requires significant manual tuning of both the rules and the corresponding thresholds. This need for manual tuning reduces the efficacy of rule-based deduplication and its applicability to real-world data sets. No adequate solution exists for this problem. We propose a novel technique for rule-based deduplication. We apply individual deduplication rules, and combine the resultant match scores via learning-based information fusion. We show empirically that our fused deduplication technique achieves higher average accuracy than traditional rule-based deduplication. Further, our technique alleviates the need for manual tuning of the deduplication rules and corresponding thresholds.
Original Publication Citation
Jared Dinerstein, Sabra Dinerstein, Parris K. Egbert, Stephen W. Clyde. "Learning-based Fusion for Data Deduplication", In Proceedings of The Seventh International Conference on Machine Learning and Applications, pp. 66 - 71, 28. IEEE Computer Society.
BYU ScholarsArchive Citation
Dinerstein, Sabra; Egbert, Parris K.; Clyde, Stephen W.; and Dinerstein, Jared, "Learning-based Fusion for Data Deduplication" (2008). Faculty Publications. 901.
https://scholarsarchive.byu.edu/facpub/901
Document Type
Peer-Reviewed Article
Publication Date
2008-12-11
Permanent URL
http://hdl.lib.byu.edu/1877/2402
Publisher
IEEE
Language
English
College
Physical and Mathematical Sciences
Department
Computer Science
Copyright Status
© 2008 Institute of Electrical and Electronics Engineers
Copyright Use Information
http://lib.byu.edu/about/copyright/