We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark.
翻译:我们为视觉语言任务引入了新的可预选通用代表,称为视觉语言任务(VL-BERT,简称VL-BERT)。VL-BERT采用简单而强大的变压器模型作为主干线,将其扩展,将视觉和语言嵌入特性作为输入。输入的每个要素要么是输入句的一个单词,要么是输入图像的一个区域(ROI),旨在适合大多数视觉语言下游任务。为了更好地利用通用代表,我们在大规模概念化数据集上对VL-BERT进行了预先培训,并加上只使用文本的文体。广泛的实证分析表明,培训前程序可以更好地将视觉语言线索与下游任务(如视觉常识推理、视觉问题解答和表达理解)相适应。值得指出的是,VL-BERT在VCR基准的首选板上取得了单一模型的第一位置。