Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient human-labeled question-answer (QA) pairs. However, we observe that, in general, the scene text is not fully exploited in the existing datasets -- only a small portion of text in each image participates in the annotated QA activities. This results in a huge waste of useful information. To address this deficiency, we develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image. Specifically, we propose, TAG, a text-aware visual question-answer generation architecture that learns to produce meaningful, and accurate QA samples using a multimodal transformer. The architecture exploits underexplored scene text information and enhances scene understanding of Text-VQA models by combining the generated QA pairs with the initial training data. Extensive experimental results on two well-known Text-VQA benchmarks (TextVQA and ST-VQA) demonstrate that our proposed TAG effectively enlarges the training data that helps improve the Text-VQA performance without extra labeling effort. Moreover, our model outperforms state-of-the-art approaches that are pre-trained with extra large-scale data. Code will be made publicly available.
翻译:文本- VQA 旨在回答需要理解图像文本提示的问题。尽管现有文本- VQA 方法取得了巨大进展,但其性能却因人标的问答(QA)配对不足而受到影响。然而,我们注意到,总体而言,现场文本没有在现有数据集中充分利用 -- -- 每张图像中只有一小部分文字参与了附加说明的QA活动。这导致大量浪费有用的信息。为了解决这一缺陷,我们开发了一种新的方法,通过明确利用每张图像现场现有的丰富文本,产生高质量和多样化的质量保证配对。具体地说,我们提议,TAG,一个有文字标识的视频问答生成结构,学会使用多式联运变异器制作有意义和准确的QA样本。该结构利用了尚未被探索的现场文本信息,并通过将生成的QA配对与初始培训数据结合起来,提高了对文本的实地理解。在两种众所周知的文本-VA模型(TextVA 和STV QA ) 上的广泛实验结果将展示我们拟议的大规模业绩培训方法,从而改进了我们现有的数据库。