VL-BERT:通用视觉语言代表制培训前 (VL-BERT: Pre-training of Generic Visual-Linguistic Representations)

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark. Code is released at \url{https://github.com/jackroos/VL-BERT}.

翻译：我们为视觉语言任务引入了新的可预选通用代表,称为视觉语言任务(VL-BERT 简称VL-BERT) 。 VL-BERT 采用简单而强大的变压器模型作为主干线,并将该模型扩展为将视觉和语言嵌入的特征作为输入。输入的每个元素要么来自输入句的一个单词,要么来自输入图像的区域(ROI ) 。它旨在适合大多数视觉语言下游任务。为了更好地利用通用代表,我们在大规模概念控制数据集上预设VL-BERT,加上只使用文本的内容。广泛的实证分析表明,培训前程序可以更好地协调视觉语言线索,使下游任务受益,例如视觉常识推理、视觉问题解答和表达理解。值得指出的是,VL-BERT在VCR基准的首选位置上实现了单一模型。代码在{http://gruls/grubub.comgro}/LrockosV中发布。