Pre-trained language models have recently contributed to significant advances in NLP tasks. Recently, multi-modal versions of BERT have been developed, using heavy pre-training relying on vast corpora of aligned textual and image data, primarily applied to classification tasks such as VQA. In this paper, we are interested in evaluating the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data. We choose to study Visual Question Generation, a task of great interest for grounded dialog, that enables to study the impact of each modality (as input can be visual and/or textual). Moreover, the generation aspect of the task requires an adaptation since BERT is primarily designed as an encoder. We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations. The results reported under different configurations indicate an innate capacity for BERT-gen to adapt to multi-modal data and text generation, even with few data available, avoiding expensive pre-training. The proposed model obtains substantial improvements over the state-of-the-art on two established VQG datasets.
翻译:培训前语言模型最近为国家劳工政策任务的显著进展作出了贡献;最近,利用大量统一的文本和图像数据组成的大量编程前培训,开发了多式BERT的多式版本,主要用于VQA等分类任务。在本文件中,我们有兴趣通过避免就补充数据进行预先培训,评估BERT外箱的视觉能力;我们选择研究视觉问题生成,这是对基础对话十分感兴趣的一项任务,能够研究每种模式的影响(因为输入可以是视觉的和/或文字的)。此外,任务的生成方面需要调整,因为BERT主要是设计成编码器。我们引入BERTgen,这是基于BERT的文本生成结构,能够利用单式或多式表达方式。在不同配置下报告的结果显示BERTgen适应多式数据和文本生成的内在能力,即使现有数据很少,也避免了昂贵的预培训。拟议的模型在两个既定的VQ数据集方面得到了重大改进。