In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It contains three separate encoders for single modality representations, and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio to the same common space, building vision-language, audio-language and audiovisual-language alignment simultaneously. MGC learns how to generate text tokens in conditions of vision, audio or their both. To promote vision-audio-language pretraining research, we construct a large-scale high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable videos with human annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks (e.g., retrieval, captioning and question answering), with different input modalities (e.g., vision-language, audio-language and audiovisual-language). VALOR achieves new state-of-the-art performances on series of public cross-modality benchmarks. Code and data are available at project page https://casia-iva-group.github.io/projects/VALOR.
翻译:本文提出了一种视听语言全方位感知预训练模型(VALOR)用于多模态理解和生成。不同于广泛研究的视觉-语言预训练模型,VALOR以端到端的方式共同建模了视觉、音频和语言之间的关系。它包含三个单模态编码器和一个用于多模态条件文本生成的解码器。我们设计了两个预文本任务来预训练VALOR模型,包括多模态分组对齐(MGA)和多模态分组字幕(MGC)。MGA将视觉、语言和音频投影到相同的公共空间中,同时建立了视觉-语言、音频-语言和音频视觉-语言对齐。MGC学习如何在视觉、音频或两者条件下生成文本标记。为推动视听语言预训练研究,我们构建了一个大规模高质量的三模态数据集VALOR-1M,其中包含100万个经过人工标注的带有音视频字幕的视频。大量实验证明,VALOR可以学习强大的多模态相关性,并能推广到各种下游任务(例如检索、字幕和问答),具有不同的输入模态(例如视觉-语言、音频-语言和音频视觉-语言)。VALOR在一系列公共跨模型基准测试中实现了新的最高性能。代码和数据可在项目页面https://casia-iva-group.github.io/projects/VALOR上获得。