Abstract

Most organizations use an increasing number of domain- or organization-specific words and phrases. A translation process, whether human or automated, must also be able to accurately and efficiently use these specific multilingual terminology collections. However, comparatively little has been done to explore the use of vetted terminology as an input to machine translation (MT) for improved results. In fact, no single established process currently exists to integrate terminology into MT as a general practice, and especially no established process for neural machine translation (NMT) exists to ensure that the translation of individual terms is consistent with an approved terminology collection. The use of tokenization as a method of injecting terminology and of evaluating terminology injection is the focus of this thesis. I use the attention mechanism prevalent in state-of-the-art NMT systems to produce the desired results. Attention vectors play an important part of this method to correctly identify semantic entities and to align the tokens that represent them. My methods presented in this thesis use these attention vectors to align the source tokens in the sentence to be translated with the target tokens in the final translation output. Then, supplied terminology is injected, where these alignments correctly identify semantic entities. My methods demonstrate significant improvement to the state-of-the-art results for NMT using terminology injection.

Degree

College and Department

Physical and Mathematical Sciences; Computer Science

Rights

http://lib.byu.edu/about/copyright/