Abstract

The morphological structure of Semitic languages like Arabic and Hebrew is based on non-concatenative roots and templates. This complex word structure used by humans is obscured to neural models that use traditional tokenization algorithms. In this work, I present and evaluate Semitic Root Encoding (SRE), a tokenization method that represents Semitic words with distinct root and template stem tokens, and apply it to neural machine translation (NMT). I evaluate whether tokenization with the SRE method leads to increases in translation quality, whether it allows NMT models to hypothesize unseen word forms needed at inference, and how impervious resulting NMT models are to generating dubious word stems.

Degree

MS

College and Department

Computational, Mathematical, and Physical Sciences; Computer Science

Rights

https://lib.byu.edu/about/copyright/

Date Submitted

2025-06-10

Document Type

Thesis

Handle

http://hdl.lib.byu.edu/1877/etd13696

Keywords

tokenization, sub-word segmentation, neural machine translation, Semitic languages, Arabic, roots, templates, morphology

Language

english

Share

COinS