Abstract

Probabilistic models of text are a useful tool for enabling the analysis of large collections of digital text. For example, Latent Dirichlet Allocation can quickly produce topical summaries of large collections of text documents. Many important uses cases of such models include human interaction during the inference process for these models of text. For example, the Interactive Topic Model extends Latent Dirichlet Allocation to incorporate human expertiese during inference in order to produce topics which are better suited to individual user needs. However, interactive use cases of probabalistic models of text introduce new constraints on inference - the inference procedure must not only be accurate, but also fast enough to facilitate human interaction. If the inference is too slow, then the human interaction will be harmed, and the interactive aspect of the probalistic model will be less useful. Unfortunately, the most popular inference algorithms in use today either require strong approximations which can degrade the quality of some models, or require time-consuming sampling. We explore the use of Iterated Conditional Modes, an algorithm which is able to obtain locally optimal maximum a posteriori estimates, as an alternative to popular inference algorithms such as Gibbs sampling or mean field variational inference. Iterated Conditional Modes algorithm is not only fast enough to facilitate human interaction, but can produce better maximum a posteriori estimates than sampling. We demonstrate the superior performance of Iterated Conditional Modes on a wide variety of models. First we use a DP Mixture of Multinomials model applied to the problem of web search result cluster, and show that not only can we outperform previous methods in clustering quality, but we can achieve interactive runtimes when performing inference with Iterated Conditional Modes. We then apply Iterated Conditional Modes to the Interactive Topic Model. Not only is Iterated Conditional Modes much faster than the previous published Gibbs sampler, but we are better able to incorporate human feedback during inference, as measured by accuracy on a classification task using the resultant topic model. Finally, we utilize Iterated Conditional Modes with MomResp, a model used to aggregate multiple noisy crowdsourced data. Compared with Gibbs sampling, Iterated Conditional Modes is better able to recover ground truth labels from simulated noisy annotations, and runs orders of magnitude faster.

Degree

College and Department

Physical and Mathematical Sciences; Computer Science

Rights

http://lib.byu.edu/about/copyright/