Abstract
The use of digital images of handwritten historical documents has increased in recent years. This has been possible through the Internet, which allows users to access a vast collection of historical documents and makes historical and data research more attainable. However, the insurmountable number of images available in these digital libraries is cumbersome for a single user to read and process. Computers could help read these images through methods known as Optical Character Recognition (OCR), which have had significant success for printed materials but only limited success for handwritten ones. Most of these OCR methods work well only when the images have been preprocessed by getting rid of anything in the image that is not text. This preprocessing step is usually known as binarization. The binarization of images of historical documents that have been affected by degradation and that are of poor image quality is difficult and continues to be a focus of research in the field of image processing. We propose two novel approaches to attempt to solve this problem. One combines recursive Otsu thresholding and selective bilateral filtering to allow automatic binarization and segmentation of handwritten text images. The other adds background normalization and a post-processing step to the algorithm to make it more robust and to work even for images that present bleed-through artifacts. Our results show that these techniques help segment the text in historical documents better than traditional binarization techniques.
Degree
MS
College and Department
Physical and Mathematical Sciences; Computer Science
Rights
http://lib.byu.edu/about/copyright/
BYU ScholarsArchive Citation
Nina, Oliver, "Text Segmentation of Historical Degraded Handwritten Documents" (2010). Theses and Dissertations. 2585.
https://scholarsarchive.byu.edu/etd/2585
Date Submitted
2010-08-05
Document Type
Thesis
Handle
http://hdl.lib.byu.edu/1877/etd3901
Keywords
thresholding, binarization, thresholding of historical documents, text segmentation, binarization algorithm, binarization for OCR
Language
English