Keywords
corpus-dictionary linkage, lemmatization, machine assistance
Abstract
The task of corpus-dictionary linkage (CDL) is to annotate each word in a corpus with a link to an appropriate dictionary entry that documents the sense and usage of the word. Corpus-dictionary linked resources include concordances, dictionaries with word usage examples, and corpora annotated with lemmas or word senses. Such CDL resources are essential for many tasks including assisting language learners, linguistic research, philology, and translation. Lemmatization is a common approximation to automating corpus-dictionary linkage, where lemmas stand in for the headwords of an actual dictionary. In our machine-assisted CDL system design, data-driven lemmatization models provide machine assistance to human annotators performing the actual CDL task. Assistance is provided in the form of pre-annotations that will reduce the costs of CDL annotation. In this work we adapt the discriminative string transducer DirecTL+ to perform lemmatization for classical Syriac, a low-resource language. We compare the accuracy of DirecTL+ with the Morfette discriminative lemmatizer. DirecTL+ achieves 96.92% overall accuracy, an improvement of 0.86% over Morfette but at the cost of a longer time to train the model. Error analysis on the models provides guidance on how to apply these models in a machine assistance setting for corpus-dictionary linkage.
Original Publication Citation
Kevin Black, Eric Ringger, Paul Felt, Kevin Seppi, Kristian Heal, and Deryle Lonsdale. 2014. Evaluating Lemmatization Models for Machine-Assisted Corpus-Dictionary Linkage. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3798–3805, Reykjavik, Iceland. European Language Resources Association (ELRA)
BYU ScholarsArchive Citation
Lonsdale, Deryle W.; Black, Kevin; Ringger, Eric K.; Felt, Paul; Seppi, Kevin; and Heal, Kristian, "Evaluating Lemmatization Models for Machine-Assisted Corpus-Dictionary Linkage" (2014). Faculty Publications. 6868.
https://scholarsarchive.byu.edu/facpub/6868
Document Type
Conference Paper
Publication Date
2014
Publisher
European Language Resources Association
Language
English
College
Humanities
Department
Linguistics
Copyright Use Information
https://lib.byu.edu/about/copyright/