Generative face video coding (GFVC) is vital for modern applications like video conferencing, yet existing methods primarily focus on video motion while neglecting the significant bitrate contribution of audio. Despite the well-established correlation between audio and lip movements, this cross-modal coherence has not been systematically exploited for compression. To address this, we propose an Audio-Visual Cross-Modal Compression (AVCC) framework that jointly compresses audio and video streams. Our framework extracts motion information from video and tokenizes audio features, then aligns them through a unified audio-video diffusion process. This allows synchronized reconstruction of both modalities from a shared representation. In extremely low-rate scenarios, AVCC can even reconstruct one modality from the other. Experiments show that AVCC significantly outperforms the Versatile Video Coding (VVC) standard and state-of-the-art GFVC schemes in rate-distortion performance, paving the way for more efficient multimodal communication systems.
翻译:生成式人脸视频编码(GFVC)对于视频会议等现代应用至关重要,然而现有方法主要关注视频运动,忽视了音频在码率分配中的显著贡献。尽管音频与唇部运动之间的相关性已得到充分证实,但这种跨模态的关联性尚未被系统性地用于压缩。为此,我们提出了一种视听跨模态压缩(AVCC)框架,用于联合压缩音频与视频流。该框架从视频中提取运动信息并对音频特征进行令牌化,随后通过统一的音视频扩散过程实现对齐,从而允许从共享表示中同步重建两种模态。在极低码率场景下,AVCC甚至能够从一种模态重建另一种模态。实验表明,AVCC在率失真性能上显著优于通用视频编码(VVC)标准及当前最先进的GFVC方案,为构建更高效的多模态通信系统开辟了道路。