Visual grounding is a promising path toward more robust and accurate Natural Language Processing (NLP) models. Many multimodal extensions of BERT (e.g., VideoBERT, LXMERT, VL-BERT) allow a joint modeling of texts and images that lead to state-of-the-art results on multimodal tasks such as Visual Question Answering. Here, we leverage multimodal modeling for purely textual tasks (language modeling and classification) with the expectation that the multimodal pretraining provides a grounding that can improve text processing accuracy. We propose possible strategies in this respect. A first type of strategy, referred to as {\it transferred grounding} consists in applying multimodal models to text-only tasks using a placeholder to replace image input. The second one, which we call {\it associative grounding}, harnesses image retrieval to match texts with related images during both pretraining and text-only downstream tasks. We draw further distinctions into both strategies and then compare them according to their impact on language modeling and commonsense-related downstream tasks, showing improvement over text-only baselines.
翻译:视觉定位是走向更加稳健和准确的自然语言处理(NLP)模式的一条充满希望的道路。BERT的许多多式联运扩展(例如视频BERT、LXMERT、VL-BERT)允许对文本和图像进行联合建模,从而在视觉问答等多式联运任务上产生最先进的结果。在这里,我们为纯文本任务(语言建模和分类)利用多式模式建模,期望多式联运预培训能够提供一个基础,从而提高文本处理的准确性。我们在这方面提出了可能的战略。第一种战略,称为 ~it传输地基 ), 包括将多式模型应用到只文本的任务中, 使用占位符取代图像输入。 第二种,我们称之为“ 联合基点”, 利用图像检索, 在培训前和仅文本的下游任务中将文本与相关图像相匹配。我们进一步区分这两个战略,然后根据其对语言建模和通用下游任务的影响加以比较。我们建议了这方面的可能的战略。第一种战略,称为 —— —— —— —— —— 转移地基 —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— ——