We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the vision-and-language downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on massive-scale Conceptual Captions dataset with three tasks: masked language modeling with visual clues, masked RoI classification with linguistic clues, and sentence-image relationship prediction. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual question answering, visual commonsense reasoning and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark.
翻译:我们为视觉语言任务引入了新的可预选通用代表,称为视觉语言任务(VL-BERT,简称VL-BERT)。VL-BERT采用简单而强大的变压器模型作为主干线,将其扩展为视觉和语言嵌入特性。输入的每个要素要么是输入句中的单词,要么是输入图像中感兴趣的区域(RoI),旨在适应大多数视觉语言下游任务。为了更好地利用通用代表,我们预先开发大规模概念控制数据集的VL-BERT,有三个任务:以视觉线索进行遮蔽语言模型,用语言线索进行蒙蔽的RoI分类,以及句面图像关系预测。广泛的实证分析表明,培训前程序可以更好地调整视觉语言线索,有利于下游任务,如视觉回答、视觉常见推理和描述表达理解。值得指出的是,VL-BERT在VCR基准列列首位上实现了单一模型。