Russian Language Journal


Russian language learning, learning patterns, corpus linguistics


Although learner corpus research has been progressively growing into an independent branch of corpus linguistics, the learner corpus cannot yet fully benefit from corpus analysis methods. This is due to several technical obstacles involving data collection, error annotation, and finally, data processing. When it comes to data collection, compared to corpus linguistics, learner corpus is biased because some of the learner corpora are still collected manually: Optical character recognition (OCR) is not yet sophisticated enough to transform a student’s handwritten copy to a digitized text. This fact significantly slows the collection of learner corpora. Furthermore, typed students’ texts present another problem: access to spellcheckers and other proofing tools obscures students’ real language skills. Moreover, annotation of the learner corpora presents inherent difficulties: the learner corpus represents a collection of productions in the language, also called interlanguage, which deviates from the codified standard language on several linguistic levels (morphologically, syntactically, discursively), and these deviations are not yet taken into account by the processing software. This constitutes one of the challenges of current learner corpus research (Granger et al. 2015). Finally, unannotated learner corpora usually cannot be fully processed by quantitative analysis, as is the case with computerized corpora of standard texts, because of a number of erroneous forms, most of which cannot be yet recognized by the machine. However, it is possible to digitally analyze the annotated data, and this opens new perspectives particularly in the fields of foreign language acquisition and teaching.