Abstract

This report addresses the lack of progress in the field of Second Language Attrition (L2A). Review of L2A history and literature show this to be cause by lack of appropriate data. Five criteria for appropriate data are suggested and a corpus of L2A data (57,000 words, spoken Spanish) which meets the criteria is presented. The history of the corpus is explained in detail, including subject selection, instruments and methods of collection, and markup -- XML was used to annotate the corpus with nineteen categories of speech errors, adapted from Nation's (2001) "Learning Vocabulary in Another Language." An example analysis of how the corpus can be used for L2A research is provided with step-by-step instructions on writing scripts for data extraction and post-processing in the Perl language. Source code is included in the text. Complete beginners tutorials on the XML and Perl languages are included in the appendices. The report also introduces a website, developed specifically to host the corpus, where researchers may register, download the corpus and share work they have done with the corpus. All files used in the example project, as well as this report, are available for download at the website. Findings from the example analysis support Plateau Phases, the Regression Hypothesis and suggest the Threshold Hypothesis does not apply to marked forms. This shows the corpus to be of great value to the L2A research community.

Degree

College and Department

Humanities; Linguistics and English Language

Rights

http://lib.byu.edu/about/copyright/