Abstract

Identifying the type of a scanned form greatly facilitates processing, including automated field segmentation and field recognition. Contrary to the majority of existing techniques, we focus on unsupervised type identification, where the set of form types are not known apriori, and on noisy collections that contain very similar document types. This work presents a novel algorithm: CONFIRM (Clustering Of Noisy Form Images using Robust Matching), which simultaneously discovers the types in a collection of forms and assigns each form to a type. CONFIRM matches type-set text and rule lines between forms to create domain specific features, which we show outperform Bag of Visual Word (BoVW) features employed by the current state-of-the-art. To scale to large document collections, we use a bootstrap approach to clustering, where only a small subset of the data is clustered directly, while the rest of the data is assigned to clusters in linear time. We show that CONFIRM reduces average cluster impurity by 44% compared to the state-of-the art on 5 collections of historical forms that contain significant noise. We also show competitive performance on the relatively clean NIST tax form collection.

Degree

College and Department

Physical and Mathematical Sciences; Computer Science

Rights

http://lib.byu.edu/about/copyright/