Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at learning cross-modal alignment with little or no paired data, such as aligned images and captions. Recent W-VLP methods, which pair visual features with object tags, help achieve performances comparable with some VLP models trained with aligned pairs in various V-L downstream tasks. This, however, is not the case in cross-modal retrieval (XMR). We argue that the learning of such a W-VLP model is curbed and biased by the object tags of limited semantics. We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH), which is trained via weak supervision as a W-VLP model, not requiring images paired with captions. WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities. Empirically, WFH consistently boosts the prior W-VLP works, e.g. U-VisualBERT (U-VB), over a variety of V-L tasks, i.e. XMR, Visual Question Answering, etc. Notably, benchmarked with recall@{1,5,10}, it consistently improves U-VB on image-to-text and text-to-image retrieval on two popular datasets Flickr30K and MSCOCO. Meanwhile, it gains by at least 14.5% in cross-dataset generalization tests on these XMR tasks. Moreover, in other V-L downstream tasks considered, our WFH models are on par with models trained with paired V-L data, revealing the utility of unpaired data. These results demonstrate greater generalization of the proposed W-VLP model with WFH.
翻译:微弱监督的视觉语言( V-L) 前训练( W- VLP), 目的是学习与微小或无配对数据( 如对齐图像和字幕) 的跨模式匹配。 最近的 W- VLP 方法, 将视觉特征与对象标签配对, 帮助实现与在 V- L 下游任务中经过对齐培训的一些 VLP 模型相似的性能。 然而, 跨模式检索( XMR) 中的情况并非如此。 我们争论说, 这样的W- V- VLP 模型的学习受到有限语义化对象标记的抑制和偏向。 我们解决了缺乏配对VL数据用于模型监督的配对式数据, 新的V- VLP VL 工具, 通过微弱监督这些W- VLP 模型, 普通的图像, VL- LL 数据, 与原始版本的文本比前两个V- LLLL 数据, 在 V- RV- RV- T 中, 数据上显示。