Conventional language models estimate the probability that a word sequence within a chosen language will occur. By contrast, the purpose of our work is to estimate the probability that the word sequence belongs to the chosen language. The language of interest in our research is comprehensible well-formed English. We explain how conventional language models assume what we refer to as a degree of generalization, the extent to which a model generalizes from a given sequence. We explain why such an assumption may hinder estimation of the probability that a sequence belongs. We show that the probability that a word sequence belongs to a chosen language (represented by a given sequence) can be estimated by avoiding an assumed degree of generalization, and we introduce two methods for doing so: Minimal Number of Segments (MINS) and Segment Selection. We demonstrate that in some cases both MINS and Segment Selection perform better at distinguishing sequences that belong from those that do not than any other method we tested, including Good-Turing, interpolated modified Kneser-Ney, and the Sequence Memoizer.
College and Department
Physical and Mathematical Sciences; Computer Science
BYU ScholarsArchive Citation
Cook, Kevin Michael Brooks, "Probability of Belonging to a Language" (2013). All Theses and Dissertations. 4023.
degree of generalization, language model, Minimal Number of Segments (MINS), probability of belonging, Segment Selection, word sequence