Abstract
Facial motion, particularly lip movement, carries rich information about human speech that can be leveraged for a variety of computer vision and speech-processing tasks. Unlike other pixel-based methods, this work investigates the use of facial landmarks as a compact, privacy- preserving, and computationally efficient representation for lip-motion analysis. We evaluate landmark-based models across multiple tasks, including visual voice activity detection (VVAD), lip–audio synchroniza- tion, and visual speech recognition (VSR) for liveness detection. Using LRS3-VVAD1 and LRS22 datasets, we demonstrate that landmark-only models can achieve performance comparable to pixel-based systems for VVAD and synchronization, while significantly reducing parameter counts and inference cost. Preliminary results on VSR liveness detection suggest that landmarks encode sufficient cues for speech recognition, though further refinement and multimodal integration are needed to close the remaining performance gap. Additionally, by using exclusively landmarks, we avoid storing images of individuals’ faces, preserving their privacy. These findings highlight the potential of using facial landmarks as an interpretable and ethical foundation for visual and audiovisual speech modeling.
Degree
MS
College and Department
Ira A. Fulton College of Engineering; Electrical and Computer Engineering
Rights
https://lib.byu.edu/about/copyright/
BYU ScholarsArchive Citation
Wright, Kimi S., "Using Facial Landmarks for Lip Motion Interpretation Applications" (2026). Theses and Dissertations. 11161.
https://scholarsarchive.byu.edu/etd/11161
Date Submitted
2026-03-05
Document Type
Thesis
Permanent Link
https://arks.lib.byu.edu/ark:/34234/q23fad6815
Keywords
VVAD, VSR, Lip-motion, Facial landmarks
Language
english