Faculty Publications

How Well Does Multiple OCR Error Correction Generalize?

Keywords

Historical Documents, Optical Character Recognition, OCR Error Correction, Ensemble Methods

Abstract

As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are: 1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the data requirements of the correction learning method. First, we correct errors using conditional random fields (CRF) trained on synthetic training data sets in order to demonstrate the applicability of the methodology to unrelated test sets. Second, we show the strength of lexical features from the training sets on two unrelated test sets, yielding a relative reduction in word error rate on the test sets of 6.52%. New features capture the recurrence of hypothesis tokens and yield an additional relative reduction in WER of 2.30%. Further, we show that only 2.0% of the full training corpus of over 500,000 feature cases is needed to achieve correction results comparable to those using the entire training corpus, effectively reducing both the complexity of the training process and the learned correction model.

Original Publication Citation

Lund, W., Ringger, E. K., & Walker, D. W. (2014) How Well Does Multiple OCR Error Correction Generalize? In Proceedings of The 20th Document Recognition and Retrieval (DRR 2014). San Francisco, Calif.

BYU ScholarsArchive Citation

Lund, William B.; Ringger, Eric K.; and Walker, Daniel D., "How Well Does Multiple OCR Error Correction Generalize?" (2014). Faculty Publications. 1647.
https://scholarsarchive.byu.edu/facpub/1647

Document Type

Peer-Reviewed Article

Publication Date

2014

Permanent URL

http://hdl.lib.byu.edu/1877/3559

Publisher

Society of Photo-Optical Instrumentation Engineers

Language

English

College

Harold B. Lee Library

Copyright Status

Copyright 2014 Society of Photo-Optical Instrumentation Engineers. One print or electronic copy may be made for personal use only. Systematic reproduction and distribution, duplication of any material in this paper for a fee or for commercial purposes, or modification of the content of the paper are prohibited. doi: 10.1117/12.2042502

Copyright Use Information

http://lib.byu.edu/about/copyright/

Lund, William.pdf (357 kB)

Download

Additional files available below

Included in

Computer Sciences Commons, Library and Information Science Commons

COinS

BYU ScholarsArchive

Faculty Publications

How Well Does Multiple OCR Error Correction Generalize?

Keywords

Abstract

Original Publication Citation

BYU ScholarsArchive Citation

Document Type

Publication Date

Permanent URL

Publisher

Language

College

Copyright Status

Copyright Use Information

Included in

Search

Browse

BYU Links

Author Corner

Hosted by the

BYU ScholarsArchive

Faculty Publications

How Well Does Multiple OCR Error Correction Generalize?

Authors

Keywords

Abstract

Original Publication Citation

BYU ScholarsArchive Citation

Document Type

Publication Date

Permanent URL

Publisher

Language

College

Copyright Status

Copyright Use Information

Included in

Share

Search

Browse

BYU Links

Author Corner

Hosted by the