Abstract

The usefulness of digitized documents is directly related to the quality of the extracted text. Optical Character Recognition (OCR) has reached a point where well-formatted and clean machine- printed documents are easily recognizable by current commercial OCR products; however, older or degraded machine-printed documents present problems to OCR engines resulting in word error rates (WER) that severely limit either automated or manual use of the extracted text. Major archives of historical machine-printed documents are being assembled around the globe, requiring an accurate transcription of the text for the automated creation of descriptive metadata, full-text searching, and information extraction. Given document images to be transcribed, ensemble recognition methods with multiple sources of evidence from the original document image and information sources external to the document have been shown in this and related work to improve output. This research introduces new methods of evidence extraction, feature engineering, and evidence combination to correct errors from state-of-the-art OCR engines. This work also investigates the success and failure of ensemble methods in the OCR error correction task, as well as the conditions under which these ensemble recognition methods reduce the Word Error Rate (WER), improving the quality of the OCR transcription, showing that the average document word error rate can be reduced below the WER of a state-of-the-art commercial OCR system by between 7.4% and 28.6% depending on the test corpus and methods. This research on OCR error correction contributes within the larger field of ensemble methods as follows. Four unique corpora for OCR error correction are introduced: The Eisenhower Communiqués, a collection of typewritten documents from 1944 to 1945; The Nineteenth Century Mormon Articles Newspaper Index from 1831 to 1900; and two synthetic corpora based on the Enron (2001) and the Reuters (1997) datasets. The Reverse Dijkstra Heuristic is introduced as a novel admissible heuristic for the A* exact alignment algorithm. The impact of the heuristic is a dramatic reduction in the number of nodes processed during text alignment as compared to the baseline method. From the aligned text, the method developed here creates a lattice of competing hypotheses for word tokens. In contrast to much of the work in this field, the word token lattice is created from a character alignment, preserving split and merged tokens within the hypothesis columns of the lattice. This alignment method more explicitly identifies competing word hypotheses which may otherwise have been split apart by a word alignment. Lastly, this research explores, in order of increasing contribution to word error rate reduction: voting among hypotheses, decision lists based on an in-domain training set, ensemble recognition methods with novel feature sets, multiple binarizations of the same document image, and training on synthetic document images.

Degree

PhD

College and Department

Physical and Mathematical Sciences; Computer Science

Rights

http://lib.byu.edu/about/copyright/

Date Submitted

2014-04-03

Document Type

Dissertation

Handle

http://hdl.lib.byu.edu/1877/etd6915

Keywords

historical document recognition, optical character recognition, OCR, OCR error correction, multiple sequence alignment, MSA, text alignment, progressive alignment, machine learning

Language

English

Share

COinS