Keywords
Part-of-speech annotated corpus, Active learning, Maximum Entropy Markov Model tagger, Query by Uncertainty (QBU), Query by Committee (QBC)
Abstract
In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful. We focus on Query by Uncertainty (QBU) and Query by Committee (QBC) and report on experiments with several baselines and new variations of QBC and QBU, inspired by weaknesses particular to their use in this application. Experiments on English prose and poetry test these approaches and evaluate their robustness. The results allow us to make recommendations for both types of text and raise questions that will lead to further inquiry.
Original Publication Citation
Eric Ringger, Peter McClanahan, Robbie Haertel, George Busby, Marc Carmen, James Carroll,Kevin Seppi and Deryle Lonsdale (2007). Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation; Proceedings of the ACL 2007 Linguistic AnnotationWorkshop; Prague, Czech Republic; June 2007, Association for Computational Linguistics, pp.101-108.
BYU ScholarsArchive Citation
Lonsdale, Deryle W.; Ringger, Eric K.; McClanahan, Peter J.; Haertel, Robbie A.; Busby, George; Carmen, Marc A.; Carroll, James; and Seppi, Kevin, "Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation" (2007). Faculty Publications. 6844.
https://scholarsarchive.byu.edu/facpub/6844
Document Type
Conference Paper
Publication Date
2007
Publisher
Association for Computational Linguistics
Language
English
College
Humanities
Department
Linguistics
Copyright Use Information
https://lib.byu.edu/about/copyright/