We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.
翻译:我们为愿景和语言变压器模型提出了一个培训前方法,该方法基于多种任务组合,我们探讨在培训前使用无需额外监督的图像文字字幕数据,以及预先培训模型的客观认知战略。我们评估了一些文本化愿景+语言任务的方法,如视觉问答、视觉要求和字幕,并展示了在标准培训前方法上的巨大收益。