Abstract

Conventional language models estimate the probability that a word sequence within a chosen language will occur. By contrast, the purpose of our work is to estimate the probability that the word sequence belongs to the chosen language. The language of interest in our research is comprehensible well-formed English. We explain how conventional language models assume what we refer to as a degree of generalization, the extent to which a model generalizes from a given sequence. We explain why such an assumption may hinder estimation of the probability that a sequence belongs. We show that the probability that a word sequence belongs to a chosen language (represented by a given sequence) can be estimated by avoiding an assumed degree of generalization, and we introduce two methods for doing so: Minimal Number of Segments (MINS) and Segment Selection. We demonstrate that in some cases both MINS and Segment Selection perform better at distinguishing sequences that belong from those that do not than any other method we tested, including Good-Turing, interpolated modified Kneser-Ney, and the Sequence Memoizer.

Degree

College and Department

Physical and Mathematical Sciences; Computer Science

Rights

http://lib.byu.edu/about/copyright/