Given a sentence, a paraphrase generation system produces a sentence that says the same thing but usually in a different way. The paraphrase generation problem can be formulated in the machine translation paradigm; instead of translation of English to a foreign language, the system translates an English sentence (for example) to another English sentence. Quirk et al. (2004) demonstrated this approach to generate almost 90% acceptable paraphrases. However, most of the sentences had little variation from the original input sentence. Leveraging syntactic information, this thesis project presents an approach that successfully generated more varied paraphrase sentences than the approach of Quirk et al. while maintaining coverage of the proportion of acceptable paraphrases generated. The ParaMeTer system (Paraphrasing by MT) identifies syntactic chunks in paraphrase sentences and substitutes labels for those chunks. This enables the system to generalize movements that are more syntactically plausible, as syntactic chunks generally capture sets of words that can change order in the sentence without losing grammaticality. ParaMeTer then uses statistical phrase-based MT techniques to learn alignments for the words and chunk labels alike. The baseline system followed the same pattern as the Quirk et al. system - a statistical phrase-based MT system. Human judgments showed that the syntactic approach and baseline both achieve approximately the same ratio of fluent, acceptable paraphrase sentences per fluent sentences. These judgments also showed that the ParaMeTer system has more phrase rearrangement than the baseline system. Though the baseline has more within-phrase alteration, future modifications such as a chunk-only translation model should improve ParaMeTer's variation for phrase alteration as well.
College and Department
Physical and Mathematical Sciences; Computer Science
BYU ScholarsArchive Citation
Madsen, Rebecca Diane, "Generating Paraphrases with Greater Variation Using Syntactic Phrases" (2006). Theses and Dissertations. 1088.
paraphrase generation, paraphrase, sentential paraphrase, syntax, statistical machine translation, machine translation, natural language processing