With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.
翻译:最近,随着视觉和文字表现联合建模方面的进展,视觉语言预科培训(VLP)在许多多式下游任务上取得了令人印象深刻的业绩。然而,要求昂贵的说明,包括干净的图像说明和区域标签等昂贵的说明,限制了现有方法的可扩展性,并使培训前程序与引入多个数据集具体目标复杂化。在这项工作中,我们放松这些限制,并提出了一个最低培训前框架,名为简单视觉语言模型(SimVLM)。与以前的工作不同,SimVLM通过利用大规模薄弱的监督,降低了培训复杂性,并经过培训,以单一的前方语言建模目标为端对端。在不使用额外数据或特定任务定制的情况下,所产生的模式大大超越了以往的培训前方法的可扩展性,并在广泛的歧视性和基因化视觉语言基准(VQA (+3.74% vqa-score)、NLV2 (+1.17% 准确度)、SNLI-V-V(+1.37% 准确度)和图像说明性说明性前方位任务(+10.11%的准确度)以及Simal-filationalational commal commal commissional commissional commissional commissional commissional)上,我们获得了基础化的平均和开放的升级的零交接轨。