Abstract

We propose revisions of Connectionist Temporal Classification (CTC) loss and forced alignment to support longer input sequences. CTC loss is normally used for training in speech and handwriting recognition tasks when the labels are not fully aligned with the data, while forced alignment is a downstream task that finds the optimal path for segmentation purposes. The current implementations consider all possible alignments between the input and label, which grows exponentially with the size of the input. We show that in cases when silence is removed and speaking rate is consistent, the true alignment is a mostly straight line across time and transcription. We experiment with adding a Bayesian prior of phoneme length, scaled by speaking speed to constrict the alignment space. We find significant improvements to forced alignment, reducing the time complexity to less than O(n sqrt n log n) and the space complexity to O(n). However, we also find that the prior negatively affected training in our experiments.

Degree

College and Department

Computer Science; Computational, Mathematical, and Physical Sciences

Rights

https://lib.byu.edu/about/copyright/