TARO：基于起始感知条件化的时间步自适应表征对齐同步视频到音频合成方法 (TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis)

This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53% lower Frechet Distance (FD), 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.

翻译：本文提出了一种新颖的高保真且时序连贯的视频到音频合成框架——基于起始感知条件化的时间步自适应表征对齐（TARO）。该框架建立在基于流的Transformer之上，后者通过稳定训练和连续变换能力提升了同步性与音频质量。TARO引入了两项关键创新：（1）时间步自适应表征对齐（TRA），其通过依据噪声调度动态调整对齐强度来实现潜在表征的动态对齐，确保平滑演化并提升保真度；（2）起始感知条件化（OAC），该方法融合了起始提示信号作为音频相关视觉时刻的尖锐事件驱动标记，从而增强与动态视觉事件的同步性。在VGGSound和Landscape数据集上的大量实验表明，TARO显著优于现有方法，实现了相对降低53%的弗雷歇距离（FD）、降低29%的弗雷歇音频距离（FAD），以及97.19%的对齐准确率，凸显了其卓越的音频质量与同步精度。