Keywords

large-scale audio corpora, phonetic analysis of formant measurements, DARLA and FAVE processing, challenges in corpus building, time and effort in transcription

Abstract

Large-scale transcribed audio corpora are available on Buckeye Corpus, Santa Barbara Corpus, etc.

How do these come to be? What’s the on-the-ground process of building such a corpus?

Here we discuss:

  • Methods for large-scale transcription
  • Early data & analysis resulting from transcription

Large-scale transcription:

  • Time to transcribe. Estimated: 10:1; Reality:13:1

Phonetic Analysis:

  • Comparison of formant measurements
  • In-house Praat script no good
  • DARLA filtered out 53%
  • Too early to tell if FAVE modifications were better

Original Publication Citation

Rachel Olsen, Michael Olsen, Joseph A. Stanley & Margaret E. L. Renwick. “Transcribing the Digital Archive of Southern Speech: Methods and Preliminary Analysis.” 84th Meeting of the Southeastern Conference on Linguistics (SECOL84). Charleston, SC. March 8–11, 2017.

Document Type

Presentation

Publication Date

2017

Publisher

84th Meeting of the Southeastern Conference on Linguistics

Language

English

College

Humanities

Department

Linguistics

University Standing at Time of Publication

Assistant Professor

Included in

Linguistics Commons

Share

COinS