Keywords

synthetic document images, OCR, datasets, document degradation models, historical document processing

Abstract

Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and labeling historical documents is expensive. As a result, existing real-world document image datasets with such accompanying resources are rare and often relatively small. We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation model applied in a novel way. Included in the datasets is the OCR output from real OCR engines including the commercial ABBYY FineReader and the open-source Tesseract engines. These synthetic datasets are designed to exhibit some of the characteristics of an example real-world document image dataset, the Eisenhower Communiques. The new datasets also benefit from additional metadata that exist due to the nature of their collection and prior labeling efforts. We demonstrate the usefulness of the synthetic datasets by training an existing multi-engine OCR correction method on the synthetic data and then applying the model to reduce word error rates on the historical document dataset. The synthetic datasets will be made available for use by other researchers.

Original Publication Citation

Walker, D. D., Lund, W. B., and Ringger, E. K. (2012) A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods. In Proceedings of Document Recognition and Retrieval XIX (DRR 2012), San Francisco, Calif. DOI: 10.1117/12.912203

Document Type

Peer-Reviewed Article

Publication Date

2012

Permanent URL

http://hdl.lib.byu.edu/1877/3560

Publisher

Society of Photo-Optical Instrumentation Engineers

Language

English

College

Harold B. Lee Library

Share

COinS