Abstract
The morphological structure of Semitic languages like Arabic and Hebrew is based on non-concatenative roots and templates. This complex word structure used by humans is obscured to neural models that use traditional tokenization algorithms. In this work, I present and evaluate Semitic Root Encoding (SRE), a tokenization method that represents Semitic words with distinct root and template stem tokens, and apply it to neural machine translation (NMT). I evaluate whether tokenization with the SRE method leads to increases in translation quality, whether it allows NMT models to hypothesize unseen word forms needed at inference, and how impervious resulting NMT models are to generating dubious word stems.
Degree
MS
College and Department
Computational, Mathematical, and Physical Sciences; Computer Science
Rights
https://lib.byu.edu/about/copyright/
BYU ScholarsArchive Citation
Hatch, Brendan T., "Semitic Root Encoding: Tokenization Based on the Templatic Morphology of Semitic Languages in NMT" (2025). Theses and Dissertations. 10860.
https://scholarsarchive.byu.edu/etd/10860
Date Submitted
2025-06-10
Document Type
Thesis
Handle
http://hdl.lib.byu.edu/1877/etd13696
Keywords
tokenization, sub-word segmentation, neural machine translation, Semitic languages, Arabic, roots, templates, morphology
Language
english