Speech-driven 3D facial animation aims to generate realistic facial movements synchronized with audio. Traditional methods primarily minimize reconstruction loss by aligning each frame with ground-truth. However, this frame-wise approach often fails to capture the continuity of facial motion, leading to jittery and unnatural outputs due to coarticulation. To address this, we propose a novel phonetic context-aware loss, which explicitly models the influence of phonetic context on viseme transitions. By incorporating a viseme coarticulation weight, we assign adaptive importance to facial movements based on their dynamic changes over time, ensuring smoother and perceptually consistent animations. Extensive experiments demonstrate that replacing the conventional reconstruction loss with ours improves both quantitative metrics and visual quality. It highlights the importance of explicitly modeling phonetic context-dependent visemes in synthesizing natural speech-driven 3D facial animation.
Traditional speech-driven 3D facial animation models are trained by minimizing the geometric error (e.g., MSE Loss) between the generated mesh and the ground truth on a frame-by-frame basis. This approach often fails to capture the continuous nature of facial motion, overlooking the phenomenon of 'coarticulation'—where the articulation of a sound is affected by its neighbors. This results in jittery and perceptually unnatural animations.
Our Phonetic Context-aware Loss is a novel objective function that addresses this limitation. To learn the phonetic context-dependent viseme, we first compute a "viseme coarticulation weight" by quantitatively measuring how much the facial vertices move within a temporal window. By applying this weight to the conventional reconstruction loss [cite: 90], the model can focus more on regions where articulatory movement changes significantly due to coarticulation. Minimizing this new loss helps the model generate more natural 3D facial animations with continuous and smooth transitions.
To verify temporal consistency, we visualize the pronunciation process of '/fa/'. The first row shows the Ground Truth (GT), the second shows a baseline model without considering phonetic context, and the third shows our method. While the baseline model produces inaccurate and abrupt mouth shapes during the transition from '/f/' to '/a/', our model generates smooth and natural transitions that are visibly closer to the GT. The Vertex Displacement analysis on the right further shows that the Mean and Std of our model's output are closer to the GT, indicating higher temporal stability.
We conduct an ablation study on the Window Size hyperparameter, which determines how much phonetic context to consider. The graph plots the Face Vertex Error (FVE) and Lip Vertex Error (LVE) against varying window sizes. The dotted lines represent the conventional method (L_rec), while the solid lines represent our method (Ours). Our method demonstrates a lower error rate across all window sizes and achieves optimal performance for both FVE and LVE with a window size of 5.
@inproceedings{kim2025learning,
title={Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation},
author={Hyung Kyu Kim and Hak Gu Kim},
booktitle={Proc. INTERSPEECH},
year={2025},
organization={ISCA}
}