•  
  •  
 

Journal of Undergraduate Research

Keywords

memorized patterns, word order, unsupervised language learning, natural language

College

Physical and Mathematical Sciences

Department

Computer Science

Abstract

Despite the ever-increasing abilities of computers, natural language analysis is still a challenge. The intricacies of natural language are far too many to enumerate, giving rise to automated algorithms which learn how the language is used from large text corpora. Many current methods use complex statistical approaches involving multi-dimensional vectors and factor analysis, and admittedly fail to take into account valuable contextual information such as word order. This project continued to develop a novel, simple, unsupervised learning approach for determining word similarity based on context in reoccurring word sequences; previous work confirms that similar words are used in similar contexts. The algorithm reads raw, untagged text, recording repeated word-use patterns and their frequency. These patterns preserve the context of words by remembering where each word appears in relation to those surrounding it. By examining the context of a target word, other words or phrases with similar meaning can be found within similar contexts.

Share

COinS