corpus-dictionary linkage, lemmatization, machine assistance


The task of corpus-dictionary linkage (CDL) is to annotate each word in a corpus with a link to an appropriate dictionary entry that documents the sense and usage of the word. Corpus-dictionary linked resources include concordances, dictionaries with word usage examples, and corpora annotated with lemmas or word senses. Such CDL resources are essential for many tasks including assisting language learners, linguistic research, philology, and translation. Lemmatization is a common approximation to automating corpus-dictionary linkage, where lemmas stand in for the headwords of an actual dictionary. In our machine-assisted CDL system design, data-driven lemmatization models provide machine assistance to human annotators performing the actual CDL task. Assistance is provided in the form of pre-annotations that will reduce the costs of CDL annotation. In this work we adapt the discriminative string transducer DirecTL+ to perform lemmatization for classical Syriac, a low-resource language. We compare the accuracy of DirecTL+ with the Morfette discriminative lemmatizer. DirecTL+ achieves 96.92% overall accuracy, an improvement of 0.86% over Morfette but at the cost of a longer time to train the model. Error analysis on the models provides guidance on how to apply these models in a machine assistance setting for corpus-dictionary linkage.

Original Publication Citation

Kevin Black, Eric Ringger, Paul Felt, Kevin Seppi, Kristian Heal, and Deryle Lonsdale. 2014. Evaluating Lemmatization Models for Machine-Assisted Corpus-Dictionary Linkage. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3798–3805, Reykjavik, Iceland. European Language Resources Association (ELRA)

Document Type

Conference Paper

Publication Date



European Language Resources Association







University Standing at Time of Publication

Associate Professor

Included in

Linguistics Commons