Faculty Publications

Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

Daniel D. Walker, Brigham Young University - ProvoFollow
William B. Lund, Brigham Young University - ProvoFollow
Eric K. Ringger, Brigham Young University - ProvoFollow

Keywords

Latent Document Semantics, OCR

Abstract

Models of latent document semantics such as the mixture of multinomials model and Latent Dirichlet Allocation have received substantial attention for their ability to discover topical semantics in large collections of text. In an effort to apply such models to noisy optical character recognition (OCR) text output, we endeavor to understand the effect that character-level noise can have on unsupervised topic modeling. We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA. To our knowledge, this study is the first of its kind.

Original Publication Citation

Walker, D. D., Lund, W. B., and Ringger, E. K. (2010). Evaluating Models of Latent Document Semantics in the Presence of OCR Errors. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing 2010 (pp. 240-250). Cambridge, Mass.

BYU ScholarsArchive Citation

Document Type

Peer-Reviewed Article

Publication Date

2010

Permanent URL

http://hdl.lib.byu.edu/1877/3561

Publisher

Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

Language

English

College

Harold B. Lee Library

Copyright Use Information

http://lib.byu.edu/about/copyright/

Download

Included in

Computer Sciences Commons, Library and Information Science Commons

COinS

BYU ScholarsArchive

Faculty Publications

Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

Keywords

Abstract

Original Publication Citation

BYU ScholarsArchive Citation

Document Type

Publication Date

Permanent URL

Publisher

Language

College

Copyright Use Information

Included in

Search

Browse

BYU Links

Author Corner

Hosted by the

BYU ScholarsArchive

Faculty Publications

Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

Authors

Keywords

Abstract

Original Publication Citation

BYU ScholarsArchive Citation

Document Type

Publication Date

Permanent URL

Publisher

Language

College

Copyright Use Information

Included in

Share

Search

Browse

BYU Links

Author Corner

Hosted by the