Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation

Visual comparison of generated visemes

We propose a novel phonetic context-aware loss which explicitly models the influence of phonetic context on viseme transitions, ensuring smoother and perceptually consistent animations.

Abstract

Speech-driven 3D facial animation aims to generate realistic facial movements synchronized with audio. Traditional methods primarily minimize reconstruction loss by aligning each frame with ground-truth. However, this frame-wise approach often fails to capture the continuity of facial motion, leading to jittery and unnatural outputs due to coarticulation. To address this, we propose a novel phonetic context-aware loss, which explicitly models the influence of phonetic context on viseme transitions. By incorporating a viseme coarticulation weight, we assigns adaptive importance to facial movements based on their dynamic changes over time, ensuring smoother and perceptually consistent animations. Extensive experiments demonstrate that replacing the conventional reconstruction loss with ours improves both quantitative metrics and visual quality. It highlights the importance of explicitly modeling phonetic context-dependent visemes in synthesizing natural speech-driven 3D facial animation.

Interspeech 2025

What is Phonetic Context-aware Loss?

Traditional speech-driven 3D facial animation models are trained by minimizing the geometric error (e.g., MSE Loss) between the generated mesh and the ground truth on a frame-by-frame basis. This approach often fails to capture the continuous nature of facial motion, overlooking the phenomenon of 'coarticulation'—where the articulation of a sound is affected by its neighbors. This results in jittery and perceptually unnatural animations.

Our Phonetic Context-aware Loss (\mathcal{L}_{pc}) is a novel objective function that addresses this limitation. It computes a "viseme coarticulation weight" that quantifies how much a viseme's articulation is influenced by its phonetic context. By applying this weight, the model learns to pay more attention to frames with significant articulatory transitions, resulting in smoother, more natural, and intelligible 3D facial animations.

BibTeX

@inproceedings{kim2025learning,
  title={Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation},
  author={Hyung Kyu Kim and Hak Gu Kim},
  booktitle={Proc. INTERSPEECH},
  year={2025},
  organization={ISCA}
}