The ability to read and reason about texts in an image is often lacking in vision-and-language (V&L) models. How can we learn V&L models that exhibit strong scene-text understanding (STU)? In this paper, we propose PreSTU, a simple pre-training recipe specifically designed for scene-text understanding. PreSTU combines a simple OCR-aware pre-training objective with a large-scale image-text dataset with off-the-shelf OCR signals. We empirically demonstrate the superiority of this pre-training objective on TextVQA, TextCaps, ST-VQA, and VizWiz-VQA. We also study which factors affect STU performance, where we highlight the importance of image resolution and dataset scale during pre-training.
翻译:视觉和语言模式往往缺乏阅读和解释图像文本的能力。我们如何学习显示对场景文字理解强的V&L模型?在本论文中,我们提议PreSTU,这是专门为理解现场文字而设计的简单培训前食谱。PreSTU将简单的OCR-awar培训前目标与大型图像-文本数据集与现成的OCR信号结合起来。我们从经验上证明了TextVQA、TextCaps、ST-VQA和VizWiz-VQA等培训前目标的优越性。我们还研究了哪些因素影响STU的表现,我们在培训前强调图像分辨率和数据设置尺度的重要性。