We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint representations of vision and language. ERNIE-ViL tries to construct the detailed semantic connections (objects, attributes of objects and relationships between objects in visual scenes) across vision and language, which are essential to vision-language cross-modal tasks. Incorporating knowledge from scene graphs, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction in the pre-training phase. More specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can model the joint representation characterizing the alignments of the detailed semantics across vision and language. Pre-trained on two large image-text alignment datasets (Conceptual Captions and SBU), ERNIE-ViL learns better and more robust joint representations. It achieves state-of-the-art performance on 5 vision-language downstream tasks after fine-tuning ERNIE-ViL. Furthermore, it ranked the 1st place on the VCR leader-board with an absolute improvement of 3.7%.
翻译:我们提议一种知识强化方法,即ERNIE-Vil,以学习视觉和语言的联合表述。ERNIE-Vil试图在视觉和语言之间构建详细的语义联系(视觉场景中物体、物体属性和物体之间的关系),这些联系对于视觉和语言的跨模式任务至关重要。我们从场景图、ERNIE-Vil构建的图像预测任务(即目标预测、属性预测和训练前阶段的关系预测)中引入知识。更具体地说,这些预测任务是通过预测场景图中从句子中分离出不同类型节点来实施的。因此,ERNIE-VIL可以建模共同表达方式,说明详细语义在视觉和语言之间的对齐。在两个大型图像-文字对齐数据集(概念描述和SBUB)上,ERNIE-VEL学习更好、更强有力的联合表述。在精确调整ERIE-VIE-VIL后,在5个视觉下游任务中实现了最先进的表现。此外,VCRIE-3.7的绝对领导人排名。