SyncLipMAE：面向视听说话人脸表征的对比掩码预训练 (SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation)

We introduce SyncLipMAE, a self-supervised pretraining framework for talking-face video that learns synchronization-aware and transferable facial dynamics from unlabeled audio-visual streams. Our approach couples masked visual modeling with cross-modal contrastive alignment and employs three per-frame prompt tokens that explicitly encode the essential factors of a talking-face frame - identity, vocal motion (speech-synchronized facial dynamics), and ambient motion (audio-agnostic movements such as blinks and head pose). The contrastive objective uses time-aligned vocal-motion and audio tokens as positives and misaligned pairs as negatives, driving both modalities into a shared embedding space and yielding token-level audio-visual stream synchronization. After pretraining, the aligned audio tokens together with the visual prompt tokens (identity, vocal motion, ambient motion) form a unified interface for four disparate downstream settings: (i) audio-visual stream synchronization; (ii) facial emotion and head/face action recognition; (iii) visual speech recognition; and (iv) visual dubbing, for which we enable indistinguishable audio- or video-driven control within a single model. Across four task families that require distinct capabilities, SyncLipMAE achieves state-of-the-art results, underscoring the effectiveness of synchronization-aware, factorized self-supervised pretraining.

翻译：我们提出了SyncLipMAE，一种用于说话人脸视频的自监督预训练框架，能够从未标注的视听流中学习同步感知且可迁移的面部动态。该方法将掩码视觉建模与跨模态对比对齐相结合，并采用三个逐帧提示令牌，显式编码说话人脸帧的关键要素——身份、发声动作（与语音同步的面部动态）以及环境动作（与音频无关的运动，如眨眼和头部姿态）。对比学习目标以时间对齐的发声动作令牌和音频令牌作为正样本，以未对齐的配对作为负样本，从而驱动两种模态进入共享嵌入空间，并实现令牌级的视听流同步。预训练完成后，对齐的音频令牌与视觉提示令牌（身份、发声动作、环境动作）共同构成一个统一接口，适用于四种不同的下游场景：（i）视听流同步；（ii）面部表情及头部/面部动作识别；（iii）视觉语音识别；以及（iv）视觉配音——我们在单一模型内实现了难以区分的音频驱动或视频驱动控制。在需要不同能力的四类任务族中，SyncLipMAE均取得了最先进的结果，这印证了同步感知、因子化自监督预训练的有效性。