ArtBoost: Synthetic Articulatory Data Augmentation for Acoustic-to-Articulatory Inversion

Abstract

Overview of ArtBoost: (a) large-scale synthetic articulatory augmentation from audio-aligned 3D facial mesh sequences, and (b) training the AAI model using the augmented articulatory supervision.

Recent acoustic-to-articulatory inversion (AAI) models rely on electromagnetic articulography (EMA) data, which are costly and limited in scale. To address this limitation, we propose ArtBoost, a novel data augmentation strategy that leverages large-scale speech–mesh datasets originally developed for speech-driven 3D facial animation to improve AAI under limited EMA supervision. ArtBoost extracts pseudo articulatory trajectories from visible facial anchors and uses them for pre-training before fine-tuning on real EMA data. Experiments show consistent improvements in PCC and RMSE. Trajectory analyses confirm that the pseudo articulatory signals reflect physically meaningful visible articulatory dynamics. Additional evaluations across different AAI architectures demonstrate stable performance gains, indicating that ArtBoost can be integrated into diverse AAI models. These results suggest that speech–mesh data provide an effective and scalable source of articulatory supervision for AAI.

Proposed Method

The ArtBoost pipeline: ASR-guided segmentation produces utterance-level clips, facial anchors are tracked to obtain articulatory trajectories, and the pseudo labels are used to pre-train the AAI model prior to EMA adaptation.

ArtBoost turns abundant speech–mesh recordings into pseudo articulatory supervision through three steps, expanding the effective training space without any additional sensor-based recordings.

1) ASR-guided Utterance Segmentation. Speech–mesh datasets consist of long, continuous video-level recordings, whereas EMA corpora are organized at the utterance level. We run automatic speech recognition to obtain word-level timestamps and group consecutive words into utterance candidates — splitting when the inter-word silence exceeds a threshold or a maximum word count is reached — producing synchronized utterance-level speech–mesh pairs.

2) Pseudo Articulatory Trajectory Extraction. For each utterance clip, we track visible facial anchors corresponding to articulators — the upper lip (UL), lower lip (LL), and lower incisor (LI) — where each anchor is the mean position of a predefined vertex region to reduce mesh noise. Following the conventional EMA representation, we retain the protrusion (z) and mouth-opening (y) motion components, assemble a 12-channel target (zeroing channels without a visible anchor), and resample to the target articulatory frame rate.

3) Pre-training then Fine-tuning. We first pre-train the AAI model on the pseudo trajectories using a channel-masked loss that supervises only the visible UL/LL/LI channels. We then fine-tune on real EMA trajectories with full-channel supervision. This lets the model learn a strong prior of visible articulatory motion from large-scale pseudo data before refining it with complete ground-truth EMA signals.

Quantitative Results

Under leave-one-speaker-out (unseen-speaker) evaluation, ArtBoost augmentation consistently improves both the Pearson correlation coefficient (PCC ↑) and root mean square error (RMSE ↓) on two AAI architectures and two benchmarks. Gains are most pronounced on USC-TIMIT, where ground-truth EMA data is scarcest — indicating that ArtBoost is especially effective under limited supervision.

Model	Dataset	PCC (↑)		RMSE (↓)
Model	Dataset	w/o	w/ ArtBoost	w/o	w/ ArtBoost
SSL-AAI	HPRC	0.678	0.698	0.736	0.717
SSL-AAI	USC-TIMIT	0.351	0.510	0.864	0.792
SI-AAI	HPRC	0.717	0.732	0.706	0.689
SI-AAI	USC-TIMIT	0.488	0.593	0.917	0.817

Overall results (mean over unseen speakers). Bold marks the better score within each model/dataset.

Articulator-wise PCC across EMA trajectories, comparing models with and without ArtBoost augmentation.

Articulator-wise PCC. Although pseudo trajectories are built only for the visible anchors (UL/LL/LI), ArtBoost improves prediction across multiple articulators — showing that the augmentation enhances the learned articulatory representation beyond the directly supervised channels.

Articulatory Trajectory Comparison

Qualitative comparison of predicted and ground-truth EMA trajectories on HPRC across protrusion and aperture channels for multiple articulators.

We compare predicted and ground-truth EMA trajectories on HPRC. Across the protrusion (“-X”) and aperture (“-Y”) channels for multiple articulators, the predicted trajectories follow the overall temporal trends of the ground truth, capturing peak movements and transition patterns. The model produces temporally coherent and physically plausible articulatory motion consistent with expected speech dynamics.

Pseudo Articulatory Trajectory Analysis

Visualization of pseudo trajectories (LI/UL/LL) extracted from speech-mesh data alongside the corresponding 3D facial mesh renderings.

This visualization shows how pseudo articulatory trajectories are derived from speech–mesh data and how they correspond to visible facial motion: synchronized audio, the pseudo LI/UL/LL trajectory signals, and the 3D facial mesh frames. During bilabial closure, reduced lip opening appears in the Y-direction components, while protrusion-related motion appears in the X-direction. The temporal alignment between mesh deformation and trajectory variation confirms that the mesh-derived signals encode physically interpretable articulatory dynamics — supporting their use as training targets in the absence of direct EMA measurements.

BibTeX

@inproceedings{kim2026artboost,
  author    = {Kim, Hyung Kyu and Hwang, Byungchan and Kim, Hak Gu},
  title     = {ArtBoost: Synthetic Articulatory Data Augmentation for Acoustic-to-Articulatory Inversion},
  booktitle = {Proc. Interspeech},
  year      = {2026}
}