The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability as their training objective. In this paper, we propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and to connect what is recognized to the rest of the image content. We implement PreSTU using a simple transformer-based encoder-decoder architecture, combined with large-scale image-text datasets with scene text obtained from an off-the-shelf OCR system. We empirically demonstrate the effectiveness of this pre-training approach on four visual question answering and two image captioning benchmarks.
翻译:视力和语言(V & L)模型往往缺乏认识和理解视觉输入中所含文本的能力,或许是因为V & L 培训前的方法往往没有包括培训目标等能力。在本文中,我们提议PreSTU,这是专门用于现场文字理解的新颖的培训前食谱。PreSTU引入了通过OCR-awar 培训前的目标,鼓励模型从图像中识别文本,并将所识别的内容与图像内容的其他内容联系起来。我们使用一个简单的变压器的编码器-解码器结构实施PreSTU,与大型图像文本数据集和从现成的OCR系统中获取的场景文本相结合。我们从经验上展示了在四个视觉问题回答和两个图像字幕基准上这一培训前方法的有效性。