Abstract

Facial motion, particularly lip movement, carries rich information about human speech that can be leveraged for a variety of computer vision and speech-processing tasks. Unlike other pixel-based methods, this work investigates the use of facial landmarks as a compact, privacy- preserving, and computationally efficient representation for lip-motion analysis. We evaluate landmark-based models across multiple tasks, including visual voice activity detection (VVAD), lip–audio synchroniza- tion, and visual speech recognition (VSR) for liveness detection. Using LRS3-VVAD1 and LRS22 datasets, we demonstrate that landmark-only models can achieve performance comparable to pixel-based systems for VVAD and synchronization, while significantly reducing parameter counts and inference cost. Preliminary results on VSR liveness detection suggest that landmarks encode sufficient cues for speech recognition, though further refinement and multimodal integration are needed to close the remaining performance gap. Additionally, by using exclusively landmarks, we avoid storing images of individuals’ faces, preserving their privacy. These findings highlight the potential of using facial landmarks as an interpretable and ethical foundation for visual and audiovisual speech modeling.

Degree

College and Department

Ira A. Fulton College of Engineering; Electrical and Computer Engineering

Rights

https://lib.byu.edu/about/copyright/