Abstract
Statistical machine translation (SMT) requires large quantities of bitexts (i.e., bilingual parallel corpora) as training data to yield good quality translations. While obtaining a large amount of training data is critical, the similarity between training and test data also has a significant impact on SMT performance. Many SMT studies define data similarity in terms of domain-overlap, and domains are defined to be synonymous with data sources. Consequently, the SMT community has focused on domain adaptation techniques that augment small (in-domain) datasets with large datasets from other sources (hence, out-of-domain, per the definition). However, many training datasets consist of topically diverse data, and not all data contained in a single dataset are useful for translations of a specific target task. In this study, we propose a new perspective on data quality and topical similarity to enhance SMT performance. Using our data adaptation approach called topic adaptation, we select topically suitable training data corresponding to test data in order to produce better translations. We propose three topic adaptation approaches for the SMT process and investigate the effectiveness in both idealized and realistic settings using large parallel corpora. We measure performance of SMT systems trained on topically similar data and their effectiveness based on BLEU, the widely-used objective SMT performance metric. We show that topic adaptation approaches outperform baseline systems (0.3 – 3 BLEU points) when data selection parameters are carefully determined.
Degree
MS
College and Department
Physical and Mathematical Sciences; Computer Science
Rights
http://lib.byu.edu/about/copyright/
BYU ScholarsArchive Citation
Matsushita, Hitokazu, "Data Selection using Topic Adaptation for Statistical Machine Translation" (2015). Theses and Dissertations. 5781.
https://scholarsarchive.byu.edu/etd/5781
Date Submitted
2015-11-01
Document Type
Thesis
Handle
http://hdl.lib.byu.edu/1877/etd8156
Keywords
topic adaptation, data selection, statistical machine translation
Language
english