Theses and Dissertations

Exploring the Relationship Between Vocabulary Scaling and Algorithmic Performance in Text Classification for Large Datasets

Wilson Murray Fearn, Brigham Young UniversityFollow

Abstract

Text analysis is a significant branch of natural language processing, and includes manydifferent sub-fields such as topic modeling, document classification, and sentiment analysis.Unsurprisingly, those who do text analysis are concerned with the runtime of their algorithmsSome of these algorithms have runtimes that depend jointly on the size of the corpus beinganalyzed, as well as the size of that corpus's vocabulary. Trivially, a user may reduce theamount of data they feed into their model to speed it up, but we assume that users will behesitant to do this as more data tends to lead to better model quality. On the other hand,when the runtime also depends on the vocabulary of the corpus, a user may instead modifythe vocabulary to attain a faster runtime. Because elements of the vocabulary also add tomodel quality, this puts users into the position of needing to modify the corpus vocabulary inorder to reduce the runtime of their algorithm while maintaining model quality. To this end,we look at the relationship between model quality and runtime for text analysis by looking atthe effect that current techniques in vocabulary reduction have on algorithmic runtime andcomparing that with their effect on model quality. Despite the fact that this is an importantrelationship to investigate, it appears little work has been done in this area. We find thatmost preprocessing methods do not have much of an effect on more modern algorithms, butproper rare word filtering gives the best results in the form of significant runtime reductionstogether with slight improvements in accuracy and a vocabulary size that scales efficiently aswe increase the size of the data.

Degree

College and Department

Physical and Mathematical Sciences; Computer Science

Rights

https://lib.byu.edu/about/copyright/

BYU ScholarsArchive Citation

Fearn, Wilson Murray, "Exploring the Relationship Between Vocabulary Scaling and Algorithmic Performance in Text Classification for Large Datasets" (2019). Theses and Dissertations. 9053.
https://scholarsarchive.byu.edu/etd/9053

Date Submitted

2019-12-05

Document Type

Thesis

Handle

http://hdl.lib.byu.edu/1877/etd11691

Keywords

document classification, text preprocessing, vocabulary reduction, nlp

Language

english

Download

Included in

Physical Sciences and Mathematics Commons

COinS

BYU ScholarsArchive

Theses and Dissertations

Exploring the Relationship Between Vocabulary Scaling and Algorithmic Performance in Text Classification for Large Datasets

Abstract

Degree

College and Department

Rights

BYU ScholarsArchive Citation

Date Submitted

Document Type

Handle

Keywords

Language

Included in

Search

Browse

BYU Links

Author Corner

Hosted by the

BYU ScholarsArchive

Theses and Dissertations

Exploring the Relationship Between Vocabulary Scaling and Algorithmic Performance in Text Classification for Large Datasets

Author

Abstract

Degree

College and Department

Rights

BYU ScholarsArchive Citation

Date Submitted

Document Type

Handle

Keywords

Language

Included in

Share

Search

Browse

BYU Links

Author Corner

Hosted by the