Text analysis is a significant branch of natural language processing, and includes manydifferent sub-fields such as topic modeling, document classification, and sentiment analysis.Unsurprisingly, those who do text analysis are concerned with the runtime of their algorithmsSome of these algorithms have runtimes that depend jointly on the size of the corpus beinganalyzed, as well as the size of that corpus's vocabulary. Trivially, a user may reduce theamount of data they feed into their model to speed it up, but we assume that users will behesitant to do this as more data tends to lead to better model quality. On the other hand,when the runtime also depends on the vocabulary of the corpus, a user may instead modifythe vocabulary to attain a faster runtime. Because elements of the vocabulary also add tomodel quality, this puts users into the position of needing to modify the corpus vocabulary inorder to reduce the runtime of their algorithm while maintaining model quality. To this end,we look at the relationship between model quality and runtime for text analysis by looking atthe effect that current techniques in vocabulary reduction have on algorithmic runtime andcomparing that with their effect on model quality. Despite the fact that this is an importantrelationship to investigate, it appears little work has been done in this area. We find thatmost preprocessing methods do not have much of an effect on more modern algorithms, butproper rare word filtering gives the best results in the form of significant runtime reductionstogether with slight improvements in accuracy and a vocabulary size that scales efficiently aswe increase the size of the data.



College and Department

Physical and Mathematical Sciences; Computer Science



Date Submitted


Document Type





document classification, text preprocessing, vocabulary reduction, nlp