Abstract
Automatic readability assessment aims to predict how difficult a text is for readers to understand. The concept of readability has been widely studied across languages due to its significant impact on reading comprehension, leading to the development of various methods and approaches. In this study, I addressed two key challenges in readability assessment. First, I examine the performance of statistical classifiers and transformer-based models, including CAMeLBERT(Inoue et al, 2021.) with a focus on their ability to generalize under domain, genre, and stylistic shifts. Second, I investigate the interpretability of transformer performance in readability task, which are often criticized for their black box nature.
To address these challenges, I built a new corpus for automatic Arabic readability assessment consisting of 82,512 samples and expanded it with the DARES corpus (El-Haj et al., 2024), which includes 10,755 samples. I also proposed hybrid approaches that combine transformer-based representations with rich handcrafted linguistic features, as well as statistical model that integrate embeddings and linguistic features. I trained the models on a large, highquality dataset constructed from Jordanian and Saudi curricula and evaluated on a separate benchmark corpus, BAREC (Elmadani, Habash, & Taha-Thomure, 2025), across multiple domains. To improve interpretability, I employed probing experiments to analyze what linguistic information is captured by the CLS representation in BERT-based models.
In addition, I utilized the Captum framework to compute attribution scores for input tokens and align them with outputs from CAMeL Tools to identify the most influential linguistic features. Furthermore, I conducted an ablation study to assess the contribution of different features and model components in both in-domain and cross-domain settings.
I found that transformer-based models perform well in in-domain settings but degrade significantly under cross-domain conditions. This suggests that these models rely heavily on domain-specific semantic and topical cues, which do not generalize well across domains. In contrast, statistical classifiers demonstrate relatively stronger cross-domain generalization, likely because they rely more on stable linguistic features rather than topic-dependent signals. Moreover, the statistical classifier XGBoost model, trained on linguistic features, outperforms hybrid and transformer-based approaches in cross-domain evaluation.
The interpretability analysis further reveals that CAMeLBERT captures many of the handcrafted linguistic features implicitly across its layers. Lower layers emphasize lexical cues, middle layers capture grammatical features, and higher layers focus on semantic and task-specific information. Attribution analysis of Captum shows that nominal group is among the most important features for readability prediction, followed by adjectives and verbs. Among morphological features, definiteness (marked by al-, the Arabic definite article “لا”), singular number, and surface-form features are particularly influential.
Degree
MA
College and Department
Humanities; Linguistics
Rights
https://lib.byu.edu/about/copyright/
BYU ScholarsArchive Citation
Alzu'bi, Sarah, "Robust and Interpretable Cross-Domain Arabic Readability Prediction Using Hybrid Modeling: Evidence on the Limitations of Transformer-Based Classifiers" (2026). Theses and Dissertations. 11272.
https://scholarsarchive.byu.edu/etd/11272
Date Submitted
2026-04-22
Document Type
Thesis
Permanent Link
https://arks.lib.byu.edu/ark:/34234/q2ceb10361
Keywords
readability, automatic readability assessment, Arabic language, cross-domain evaluation, interpretability, CAMeLBERT, Captum
Language
english