*Vocal Emotion Corpus README *Written by Logan Kowallis *Last Modified 6 July 2020 This document explains the directory structure and files in this corpus along with reasons for decisions made when organizing the Vocal Emotion Corpus. All files included in the Vocal Emotion Corpus are licensed under a Creative Commons license: CC BY-NC-SA 4.0. More information can be found in the corpus in the "LICENSE" file. A simpler overview of the files in the corpus can be found in the Data Set Description Summary found in its listing on ScholarsArchive: https://scholarsarchive.byu.edu/ Data Collection Details about the equipment, script, recording procedure, and other relevant data collection information are found in the dissertation document hosted in the Theses and Dissertations section of Scholar's Archive and in files found in the "/Documents/1. data collection/" directory. Recording began in September 2019 in a recording studio on Brigham Young University campus. Partway into the study, however, issues arose. The worst of these issues was that a Mac OS X update broke driver compatibility with the audio mixer being used in the recording studio. Unfortunately, this issue meant that audio monitoring could not occur simultaneously with audio recording, so the research assistant present at each recording session could not give feedback to participants while recording. Because of this issue, all recordings were checked after the study completed so that bad recordings could be identified and removed. In a few cases, subjects did not meet the strict criteria for inclusion in the dissertation analysis because of issues that were not corrected because there was no audio monitoring during recording. However, these recordings are still useful and are included in the corpus. These subjects' data are included in the "Additional_Subjects.zip" data directory. Any recordings where a subject did not correctly read all of the words of the script in order but did manage to stay in character and finish through the end of the reading are kept in a separate folder, called "Additional_Corpus_Files.zip". Knowing that we would not reach the originally planned goal of 192 subjects because of project delays, I made the decision to set our new goal as 128 subjects. Some data had already been identified as unusable or sparse, so the final subjects scheduled were given subject numbers to complete the counterbalancing of emotion order as well as possible. This is the reason why the contiguous numbering jumps up near the end and some subject numbers are not included in the dataset. The subject numbers go from 001 to 142, but there are only 131 subjects included in the corpus. See the "Subject numbers and info.xlsx" for more details on the order of emotions acted, which is found in "/Documents/1. data collection notes/". The remaining files in "/Documents/1. data collection notes/" are explained below. The organizing file from the demographics survey taken by each subject is called "Anonymous Demographics.xlsx". Note that this is the only file that identifies individual subject numbers with their gender and other demographics. A clearer summary of responses is found in a table in the dissertation. The written script that was read for every recording is found, with fonts and formatting intact, in "new spoken script 5-3-2019.docx". The form for recording notes during a session is found in "Notes Excel Version2.xlsx". The protocol for the experiment that was printed and available during all recording sessions to research assistants is called "Recording Booth Protocol newest 10-4-2019.docx". An additional document to help with common issues in the recording studio is provided, called "Recording Booth Troubleshooting Guide 10-4-2019.docx". Data Preprocessing In the interest of learning, I have included scripts for data processing and labeling. I attempted to always use free tools that would be available in a Windows environment. However, the recordings were made using Adobe Audition and the first step of processing, exporting individual readings of the script separately, was completed manually in Audition without a script. These steps are outlined in "Audition Export Protocol.docx" and is found in the "Documents/2. data preparation notes/" directory. After that process was completed, more processing was completed using Windows PowerShell, R, Praat, ffmpeg, and Windows Command Prompt. Additional software packages were used for analysis, but those will be explained later in this document. Scripts are found in the "/Documents/scripts/" directory. Paths in scripts were directed to local directories, so these scripts cannot work on your computer without modifying paths first. The first step in processing was to run the Windows PowerShell script called "ALL SCRIPTS FOR LABELING.txt". This script is made up of smaller scripts that have various functions, which include standardizing filenames, adding three-character labels to filenames, and moving files from their original directories to their final destinations. More details can be found in the script's comments. Some helpful scripts that were created around the same time as this script were created in regular Command Prompt as batch files. They include: "filename space remover.bat", which changes spaces to underscores in filenames; "get filenames.bat", which saves a list of filenames following a naming pattern; "save additional subjects filenames.bat", which is similar to the previous script but for the problematic subjects' files; "moving additional subjects.bat", which is a basic script for moving files; and "moving files batch.bat" which is similar to the previous script, but moves other specific files. All of these scripts are found in the "/Documents/scripts/" directory. Labels follow a certain pattern: CCC_CCC_###_Marker_###.wav. The first three characters are either "REG" or "VAR" to represent which instructions were used for this recording, either, the "regular" instructions that are typical for vocal emotion studies or the experimental "variety" instructions that are meant to require a new way of expressing an emotion for each recording. The second label is for the emotion with the possible labels being "ANG", "FEA", "HAP", "NEU", or "SAD" for "anger", "fear", "happiness", "neutral", and "sadness". The last section always identifies the marker numbers, which are the original identifying values of the recording for a specific recording session, in descending order from the start of the recording session, so values that are larger were recorded relatively later than values that are smaller for that subject. This labeling format is found in all of the .wav filenames, as well as the corresponding .TextGrid files in the linguistic analysis. During this process of labeling, some files were identified as corrupted or truncated and were fixed. No evidence of corruption exists in the final .wav files. A document explaining some of my reasoning as I wrote code for labeling as well as in identifying potential corrupted files in found in "LOGAN RELABELING STEPS.docx" and is found in the "/Documents/2. data preparation notes/" directory. There are three sets of analyses that were part of the dissertation project: a feature extraction approach that relies on linguistic theory and tools, a neural networks approach that relies on computer science theory and tools, and simple summary statistics calculated on the variety instructions set of recordings. Linguistic Analysis The linguistic analysis needed several steps to complete: Creating TextGrid files, populating TextGrid files, extracting features, mean-centering data from neutral emotion means, and finally calculating statistics. The Praat script called "TEXTGRID CREATION.txt" was run to create two blank tiers in a new TextGrid file for every recording. This script was not original, and since the original license could not be verified despite it being hosted on numerous university websites, a placeholder file with more information is present in its place. The Montreal Forced Aligner was used to populate the TextGrid files with start and end time markers for each word in the first tier and each phone in the second tier. The script used is called "forced alignment.bat". During this process it was discovered that there is a documented bug in the current version of the Montreal Forced Aligner: any configuration file given to the aligner is ignored and the default configuration values are always used. Because the configuration cannot be tuned, I could not increase the recognition rates. The Montreal Forced Aligner has an algorithm for determining if too many errors occur when recognizing the content of an audio file and will fail to produce results if this occurs. I tabulated the files that failed to complete, which included 168 files, 132 of which were the targeted REG files. I decided to remove these files from the dissertation analysis even though there was nothing about them that would be an obvious reason for them to fail. I did this to ensure that all recordings analyzed will be the same for each analysis and that features will be available for each recording at every level. These files removed are kept in their own folder, called "Corpus_Files_Without_Forced_Alignment.zip". The TextGrid files from this analysis are included in "Full_Textgrids.zip" along with others that were accidentally run along with the intended TextGrid files, which may be of use to someone. I spot checked 10 completed TextGrid files from 10 different subjects and found that only 5 errors occurred at the word or phone level, with each being minor errors of timing where silence was partially included within the boundary of a word and phone. A forced aligner with settings that can be changed could fix this issue easily. Any subject with fewer than 5 recordings per emotion (including neutral) was excluded from further analysis. I calculated with Praat some basic info from the .wav files without the use of the .TextGrid files. These two variables were total duration and mean frequency (f0), created by using my "EXTRACT ALL ACOUSTIC VARIABLES.txt" script. For the pause duration between the words "was" and "interesting", I relied on two scripts: one Command Prompt batch file for identifying the line numbers for these two works and an R script to check all of the .TextGrid files on those lines and output their corresponding timing information to a new file. Specifically, the R script checks for start of the word "interesting" (its xmin) and subtracts from it the end of the end of the previous word, "was" (its xmax). A simple subtraction was then performed in Excel and the resulting variable was added to the total duration and mean frequency variables. The batch file is called "batch script for pause duration between words.bat" and the R script is called "get pause duration between was and interesting.R". The resulting data was combined in an Excel file called "ACOUSTIC DATA.xlsx", which is found in the "/Documents/3. data analysis notes/" directory. Before using these three variables (total duration, mean frequency, and pause between was and interesting) as the dependent variables in a one-way MANOVA where emotion was the independent variable, they had to be mean-centered to the neutral emotion's mean first. The idea of creating these simple difference scores is to use the regular speaking voice of each subject as that subject's baseline for each variable. So means are calculated then all values subtracted within each subject from that subject's neutral mean. Then neutral was excluded from analysis. These difference-score versions of the variables are saved in a data file called "3 acoustic vars diff scores from neutral.csv" and is found in the "/Documents/3. data analysis notes/" directory. Subject nested within emotion variance was calculated as the error term for emotion instead of just using the residual, unexplained variance as the error term. This better accounts for repeated measures effects of multiple recordings belonging to each subject and therefore expected to be intercorrelated, meaning that the usual regression assumption of independence of observations is violated without a modification to properly account for this. Stata statistical software was used for calculating the MANOVA, and the script is called "MANOVA 3 acoustic vars.do" found in the "Documents/scripts/" directory. Neural Network Analysis For the neural network analysis, Google servers were used as part of Google Colab. The script was written in Python, and the neural networks package used was PyTorch. First, .wav data were silence-padded, meaning they were all made the same length by adding silence (0 values) to the end of each file until they matched the same length. This was accomplished by first using ffmpeg in a batch script to probe the length of each file and save it as output. Then that information was collected in Excel to create a new batch file, with a line of code for modifying each individual .wav fill by padding it the amount it needed to reach the target length. That Excel file was used to create the text of a new batch file, which altered the .wav files when it was run. The former batch file is called "probe.bat" and the latter batch file is called "padding.bat" both located in the "/Documents/scripts/" directory. Next, the python script took each silence-padded .wav file and compressed the audio data into a smaller, easier-to-use representation of a spectrogram using the torchaudio package. Once the spectrograms were created, a 50-50 randomized split for training and testing data was assigned. This randomization used Excel's rand function to assign a value to every file and then put them in ascending order. The spectrograms were separated into two folders, one for training the model and one for testing the model. The spectrograms of the training set were then loaded as input into a simple Recurrent Neural Network (RNN) design. After the model was created, the testing set evaluated the prediction accuracy of the model, with the five emotions (anger, fear, happiness, neutral, and sadness) as the categories tested. The Python script for the entire neural network analysis is called "kowallisemotiondissertation2020neuralnetworks.py" and is found in "/Documents/scripts/". Variety Recordings Statistics Simple descriptive statistics were calculate using Stata and organized in Excel. The Stata script is called "varietyrecordings.do" and is found in the "/Documents/scripts/" directory. The statistics for variety recordings are organized in "varietyrecordings.xlsx" and the same statistics for regular recordings are organized in "regularrecordings.xlsx", with both files found in the "Documents/3. data analysis notes/" directory. Other A script for zipping folders of the corpus is called "zip folders batch.bat" and a script for performing an MD5 checksum on each zipped folder using Microsoft's File Integrity Verifier is called "checksum batch.bat", and both scripts are found in the "/Documents/scripts" directory. Overview of Data Folders Several zipped data folders contains audio files encoded as 44.1 kHz mono .wav. These folders are: Additional_Corpus_Files.zip Additional_Subjects.zip Corpus_Files_Without_Forced_Alignment.zip Regular_Instructions.zip Variety_Instructions.zip As mentioned above, various audio or script-reading issues are found in Additional_Corpus_Files.zip and Additional_Subjects.zip, but it is less clear if there are problems with Corpus_Files_Without_Forced_Alignment.zip. The cleanest audio is found among the 120 subjects in Regular_Instructions.zip and Variety_Instructions.zip. MD5 checksums for each .zip file are presented below: Additional_Corpus_Files.zip 846b631926f62cd56e5b8e68ca429789 Additional_Subjects.zip 068eac3d48fd35010f78dd157dfe3802 Corpus_Files_Without_Forced_Alignment.zip 7dde39cf429b027972f561ca0477f292 Documents.zip aa0d517fcfd19032634ca5f13d957132 Full_Textgrids.zip 1a17ece37084231f7b537417b84ef62c Regular_Instructions.zip 442b33c2d26b991cd8bc3c98d3d2b1c9 Variety_Instructions.zip 65056de99fbb31d9b885f3758b2821f4