Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In this paper, we observe several key disadvantages of MLM in this setting. First, as captions tend to be short, in a third of the sentences no token is sampled. Second, the majority of masked tokens are stop-words and punctuation, leading to under-utilization of the image. We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings, aiming for better fusion of text and image in the learned representation. When pre-training the LXMERT model, our alternative masking strategies consistently improve over the original masking strategy on three downstream tasks, especially in low resource settings. Further, our pre-training approach substantially outperforms the baseline model on a prompt-based probing task designed to elicit image objects. These results and our analysis indicate that our method allows for better utilization of the training data.
翻译:隐蔽语言模型( MLMM) 是视觉语言预培训中的关键子任务之一。 在跨模式设置中, 句子中的标语被随机遮盖, 模型预测了图像和文字中的隐蔽符号。 在本文中, 我们观察到了 MLM 的几种关键缺点 。 首先, 标题往往很短, 在三分之一的句子中, 没有标语样本 。 其次, 大部分隐蔽符号是断字和标语, 导致图像利用不足 。 我们调查了用于解决这些缺陷的跨模式设置中的一系列替代掩码策略, 目的是在学习的演示中更好地融合文本和图像。 在对 LDCMERT 模型进行预培训之前, 我们的替代掩码战略不断改进, 超越了在三种下游任务上最初的掩码战略, 特别是在低资源环境下。 此外, 我们的预培训方法大大超出了用于定位图像对象的快速探测任务的基线模型。 这些结果和我们的分析表明, 我们的方法允许更好地利用培训数据。