As large amounts of digital information become more and more accessible, the ability to effectively find relevant information is increasingly important. Search engines have historically performed well at finding relevant information by relying primarily on lexical and word based measures. Similarly, standard approaches to organizing and categorizing large amounts of textual information have previously relied on lexical and word based measures to perform grouping or classification tasks. Quite often, however, these processes take place without respect to semantics, or word meanings. This is perhaps due to the fact that the idea of meaningful similarity is naturally qualitative, and thus difficult to incorporate into quantitative processes. In this thesis we formally present a method for computing quantitative document-level semantic distance, which is designed to model the degree to which humans would associate two documents with respect to conceptual similarity. We show how this metric can be applied to document retrieval and clustering problems. We conclude that while our metric is not well suited for text indexing, the use of our semantic distance metric can improve document retrieval through result set re-ranking and query expansion. We also conclude that our semantic distance metric can be used to improve document clustering in distance-based clustering algorithms.
College and Department
Physical and Mathematical Sciences; Computer Science
BYU ScholarsArchive Citation
Davis, Nathan Scott, "An Analysis of Document Retrieval and Clustering Using an Effective Semantic Distance Measure" (2008). Theses and Dissertations. 1600.
computational linguistics, semantics, computer science, document clustering, information retrieval