Abstract
Topic modeling is an effective tool for analyzing the thematic content of large collections of text. However, traditional probabilistic topic modeling is limited to a small number of topics (typically no more than hundreds). We introduce fine-grained topic models, which have large numbers of nuanced and specific topics. We demonstrate that fine-grained topic models enable use cases not currently possible with current topic modeling techniques, including an automatic cross-referencing task in which short passages of text are linked to other topically related passages. We do so by leveraging anchor methods, a recent class of topic model based on non-negative matrix factorization in which each topic is anchored by a single word. We explore extensions of the anchor algorithm, including tandem anchors, which relaxes the restriction that anchors be formed of single words. By doing so, we are able to produce anchor-based topic models with thousands of fine-grained topics. We also develop metrics for evaluating token level topic assignments and use those metrics to improve the accuracy of fine-grained topic models.
Degree
PhD
College and Department
Physical and Mathematical Sciences; Computer Science
Rights
http://lib.byu.edu/about/copyright
BYU ScholarsArchive Citation
Lund, Jeffrey A., "Fine-Grained Topic Models Using Anchor Words" (2018). Theses and Dissertations. 7559.
https://scholarsarchive.byu.edu/etd/7559
Date Submitted
2018-12-20
Document Type
Dissertation
Handle
http://hdl.lib.byu.edu/1877/etd10553
First Advisor
Kevin Seppi
Second Advisor
David Wingate
Third Advisor
William Barrett
Fourth Advisor
Michael Jones
Fifth Advisor
Dennis Ng
Keywords
Topic Modeling, Anchor Words, Cross-reference Generation
Language
English