培训少实行双模式监督的愿景-语言模式培训 (Training Vision-Language Models with Less Bimodal Supervision)

Standard practice in pretraining multimodal models, such as vision-language models, is to rely on pairs of aligned inputs from both modalities, for example, aligned image-text pairs. However, such pairs can be difficult to obtain in low-resource settings and for some modality pairs (e.g., structured tables and images). In this work, we investigate the extent to which we can reduce the reliance on such parallel data, which we term \emph{bimodal supervision}, and use models that are pretrained on each modality independently. We experiment with a high-performing vision-language model, and analyze the effect of bimodal supervision on three vision-language tasks. We find that on simpler tasks, such as VQAv2 and GQA, one can eliminate bimodal supervision completely, suffering only a minor loss in performance. Conversely, for NLVR2, which requires more complex reasoning, training without bimodal supervision leads to random performance. Nevertheless, using only 5\% of the bimodal data (142K images along with their captions), or leveraging weak supervision in the form of a list of machine-generated labels for each image, leads to only a moderate degradation compared to using 3M image-text pairs: 74\%$\rightarrow$$\sim$70\%. Our code is available at https://github.com/eladsegal/less-bimodal-sup.

翻译：培训前多式联运模型的标准做法,例如视觉语言模型,是依赖两种模式的一对一致投入,例如,统一的图像-文本配对。然而,在低资源环境下和某些模式配对(如结构化表格和图像)中,这类配对可能很难获得。在这项工作中,我们调查我们在多大程度上可以减少对类似平行数据的依赖,我们称之为“emph{bimondal 监督”,并使用在每种模式上都经过事先培训的模型。我们试验高性能的视觉语言模型,分析双式监督对三种视觉语言任务的影响。我们发现,在诸如VQAv2和GQA等更简便的任务中,可以完全消除双式监督,只造成轻微的绩效损失。相反,对于NLVR2来说,需要更复杂的推理,没有双式监督的培训导致随机的性能。然而,我们仅使用双式数据(142K图像及其字幕)的5 ⁇,或者在每张机器-B$/MQ_BRI/M的平面图像列表中利用薄弱的监督形式。只能通过机器-roqual-deal Das-dealbaldaldaldaldaldals。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/