Faculty Publications

Learning-based Fusion for Data Deduplication

Keywords

data deduplication, learning-based

Abstract

Rule-based deduplication utilizes expert domain knowledge to identify and remove duplicate data records. Achieving high accuracy in a rule-based system requires the creation of rules containing a good combination of discriminatory clues. Unfortunately, accurate rule-based deduplication often requires significant manual tuning of both the rules and the corresponding thresholds. This need for manual tuning reduces the efficacy of rule-based deduplication and its applicability to real-world data sets. No adequate solution exists for this problem. We propose a novel technique for rule-based deduplication. We apply individual deduplication rules, and combine the resultant match scores via learning-based information fusion. We show empirically that our fused deduplication technique achieves higher average accuracy than traditional rule-based deduplication. Further, our technique alleviates the need for manual tuning of the deduplication rules and corresponding thresholds.

Original Publication Citation

Jared Dinerstein, Sabra Dinerstein, Parris K. Egbert, Stephen W. Clyde. "Learning-based Fusion for Data Deduplication", In Proceedings of The Seventh International Conference on Machine Learning and Applications, pp. 66 - 71, 28. IEEE Computer Society.

BYU ScholarsArchive Citation

Dinerstein, Sabra; Egbert, Parris K.; Clyde, Stephen W.; and Dinerstein, Jared, "Learning-based Fusion for Data Deduplication" (2008). Faculty Publications. 901.
https://scholarsarchive.byu.edu/facpub/901

Document Type

Peer-Reviewed Article

Publication Date

2008-12-11

Permanent URL

http://hdl.lib.byu.edu/1877/2402

Publisher

IEEE

Language

English

College

Physical and Mathematical Sciences

Department

Computer Science

Copyright Status

Copyright Use Information

http://lib.byu.edu/about/copyright/

Download

Included in

Computer Sciences Commons

COinS

BYU ScholarsArchive

Faculty Publications

Learning-based Fusion for Data Deduplication

Keywords

Abstract

Original Publication Citation

BYU ScholarsArchive Citation

Document Type

Publication Date

Permanent URL

Publisher

Language

College

Department

Copyright Status

Copyright Use Information

Included in

Search

Browse

BYU Links

Author Corner

Hosted by the

BYU ScholarsArchive

Faculty Publications

Learning-based Fusion for Data Deduplication

Authors

Keywords

Abstract

Original Publication Citation

BYU ScholarsArchive Citation

Document Type

Publication Date

Permanent URL

Publisher

Language

College

Department

Copyright Status

Copyright Use Information

Included in

Share

Search

Browse

BYU Links

Author Corner

Hosted by the