synthetic document images, OCR, datasets, document degradation models, historical document processing
Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and labeling historical documents is expensive. As a result, existing real-world document image datasets with such accompanying resources are rare and often relatively small. We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation model applied in a novel way. Included in the datasets is the OCR output from real OCR engines including the commercial ABBYY FineReader and the open-source Tesseract engines. These synthetic datasets are designed to exhibit some of the characteristics of an example real-world document image dataset, the Eisenhower Communiques. The new datasets also benefit from additional metadata that exist due to the nature of their collection and prior labeling efforts. We demonstrate the usefulness of the synthetic datasets by training an existing multi-engine OCR correction method on the synthetic data and then applying the model to reduce word error rates on the historical document dataset. The synthetic datasets will be made available for use by other researchers.
Original Publication Citation
Walker, D. D., Lund, W. B., and Ringger, E. K. (2012) A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods. In Proceedings of Document Recognition and Retrieval XIX (DRR 2012), San Francisco, Calif. DOI: 10.1117/12.912203
BYU ScholarsArchive Citation
Walker, Daniel; Lund, William; and Ringger, Eric, "A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods" (2012). Faculty Publications. 1649.
Society of Photo-Optical Instrumentation Engineers
Harold B. Lee Library
Copyright 2012 Society of Photo-Optical Instrumentation Engineers. One print or electronic copy may be made for personal use only. Systematic reproduction and distribution, duplication of any material in this paper for a fee or for commercial purposes, or modification of the content of the paper are prohibited. DOI: 10.1117/12.912203
Copyright Use Information