Speech-driven 3D facial animation aims to synthesize realistic facial motion sequences from given audio, matching the speaker's speaking style. However, previous works often require priors such as class labels of a speaker or additional 3D facial meshes at inference, which makes them fail to reflect the speaking style and limits their practical use. To address these issues, we propose MemoryTalker which enables realistic and accurate 3D facial motion synthesis by reflecting speaking style only with audio input to maximize usability in applications. Our framework consists of two training stages: The 1-stage is storing and retrieving general motion (i.e., Memorizing), and the 2-stage is to perform the personalized facial motion synthesis (i.e., Animating) with the motion memory stylized by the audio-driven speaking style feature. In this second stage, our model learns about which facial motion types should be emphasized for a particular piece of audio. As a result, our MemoryTalker can generate a reliable personalized facial animation without additional prior information. With quantitative and qualitative evaluations, as well as user study, we show the effectiveness of our model and its performance enhancement for personalized facial animation over state-of-the-art methods.
Previous studies on personalized 3D facial animation have practical limitations. Methods based on (a) one-hot encoding require speaker identity classes during inference, making it impossible for them to handle new, unseen speakers. This severely limits their practical application. (b) a sequence of 3D facial meshes to control for different speaking styles. While these methods can capture the styles of unseen speakers, they are impractical for real-world applications because they require this additional 3D mesh data as input during inference. (c) Our proposed MemoryTalker overcomes these limitations. It is designed to reflect a speaker's style using only the audio input. MemoryTalker does not require any prior information—such as speaker ID classes or reference facial meshes—at inference time, making it a more practical approach for real-world applications
We propose MemoryTalker, a novel framework for personalized speech-driven 3D facial animation that can reflect an individual's speaking style using only audio input. Our method uses a two-stage training strategy: 1) Memorizing general facial motions and 2) Animating personalized motions using a stylized memory.
Stage 1: Storing and Recalling General Facial Motion (Memorizing) In the first stage, our goal is to create a Motion Memory that stores general facial motion features. To achieve this, we use the text representation extracted from an audio signal by an Automatic Speech Recognition (ASR) model as a query to access the memory. This allows the model to map the facial motions of various speakers for a single phoneme to a consistent representation, ensuring that it learns general and speaker-neutral facial movements (e.g., the common lip shape for the word "who"). The recalled general motion feature is then passed to a motion decoder to synthesize a base facial animation.
Stage 2: Generating Personalized Facial Motion (Animating) In the second stage, we synthesize a personalized facial animation that reflects the speaker's unique style. To do this, we first encode a speaking style feature from the input audio using a style encoder trained with a triplet loss, which helps distinguish between different speakers' styles. This style feature is then used to refine and update the motion memory from Stage 1, creating a Stylized Motion Memory. This stylized memory now contains personalized motion information. Finally, the personalized motion feature is recalled and used by the motion decoder to generate the final 3D facial animation that accurately reflects both the speech content and the individual's speaking style.
This figure presents a qualitative comparison of our MemoryTalker against state-of-the-art speech-driven 3D facial animation methods on the VOCASET and BIWI datasets.
This analysis demonstrates the effectiveness of our proposed two-stage training strategy by visualizing the facial dynamics for the word "Stab". The model trained only with the 1st stage (top row, 'w/o 2-stage') produces animations with limited and less expressive lip motion. In contrast, after incorporating the 2nd stage for personalization (middle row, 'w/ 2-stage'), the model generates much more dynamic and accurate movements that closely match the reference. This visually confirms that our two-stage approach is crucial for capturing the subtle dynamics of realistic facial expressions.
This t-SNE visualization illustrates how our two-stage training effectively disentangles unique speaker styles. In the left plot (a), showing the feature distribution without the 2nd stage, the motion features from different speakers (represented by different colors) are heavily overlapped and poorly separated. This indicates the model is only learning generic, non-personalized motions. However, the right plot (b) shows the result after our full two-stage training. The features for each speaker form distinct, well-defined clusters. This proves that our model successfully learns to capture and separate personalized speaking styles from audio alone.
@article{MemoryTalker,
author = {Kim, Hyung Kyu and Lee, Sangmin and Kim, Hak Gu},
title = {MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization},
journal = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {n-n+7}
}