Abstract
We all have access to large collections of digital text documents, which are useful only if we can make sense of them all and distill important information from them. Good document clustering algorithms that organize such information automatically in meaningful ways can make a difference in how effective we are at using that information. In this paper we use model-based document clustering algorithms as a base for bisecting methods in order to identify increasingly cohesive clusters from larger, more diverse clusters. We specifically use the EM algorithm and Gibbs Sampling on a mixture of multinomials as the base clustering algorithms on three data sets. Additionally, we apply a refinement step, using EM, to the final output of each clustering technique. Our results show improved agreement with human annotated document classes when compared to the existing base clustering algorithms, with marked improvement in two out of three data sets.
Degree
MS
College and Department
Physical and Mathematical Sciences; Computer Science
Rights
http://lib.byu.edu/about/copyright/
BYU ScholarsArchive Citation
Davis, Aaron Samuel, "Bisecting Document Clustering Using Model-Based Methods" (2009). Theses and Dissertations. 1938.
https://scholarsarchive.byu.edu/etd/1938
Date Submitted
2009-12-09
Document Type
Thesis
Handle
http://hdl.lib.byu.edu/1877/etd3332
Keywords
document clustering, text mining, model-based
Language
English