Standard practice in pretraining multimodal models, such as vision-language models, is to rely on pairs of aligned inputs from both modalities, for example, aligned image-text pairs. However, such pairs can be difficult to obtain in low-resource settings and for some modality pairs (e.g., structured tables and images). In this work, we investigate the extent to which we can reduce the reliance on such parallel data, which we term \emph{bimodal supervision}, and use models that are pretrained on each modality independently. We experiment with a high-performing vision-language model, and analyze the effect of bimodal supervision on three vision-language tasks. We find that on simpler tasks, such as VQAv2 and GQA, one can eliminate bimodal supervision completely, suffering only a minor loss in performance. Conversely, for NLVR2, which requires more complex reasoning, training without bimodal supervision leads to random performance. Nevertheless, using only 5\% of the bimodal data (142K images along with their captions), or leveraging weak supervision in the form of a list of machine-generated labels for each image, leads to only a moderate degradation compared to using 3M image-text pairs: 74\%$\rightarrow$$\sim$70\%. Our code is available at https://github.com/eladsegal/less-bimodal-sup.
翻译:培训前多式联运模型的标准做法,例如视觉语言模型,是依赖两种模式的一对一致投入,例如,统一的图像-文本配对。然而,在低资源环境下和某些模式配对(如结构化表格和图像)中,这类配对可能很难获得。在这项工作中,我们调查我们在多大程度上可以减少对类似平行数据的依赖,我们称之为“emph{bimondal 监督”,并使用在每种模式上都经过事先培训的模型。我们试验高性能的视觉语言模型,分析双式监督对三种视觉语言任务的影响。我们发现,在诸如VQAv2和GQA等更简便的任务中,可以完全消除双式监督,只造成轻微的绩效损失。相反,对于NLVR2来说,需要更复杂的推理,没有双式监督的培训导致随机的性能。然而,我们仅使用双式数据(142K图像及其字幕)的5 ⁇,或者在每张机器-B$/MQ_BRI/M的平面图像列表中利用薄弱的监督形式。只能通过机器-roqual-deal Das-dealbaldaldaldaldaldals。