Previous studies such as VizWiz find that Visual Question Answering (VQA) systems that can read and reason about text in images are useful in application areas such as assisting visually-impaired people. TextVQA is a VQA dataset geared towards this problem, where the questions require answering systems to read and reason about visual objects and text objects in images. One key challenge in TextVQA is the design of a system that effectively reasons not only about visual and text objects individually, but also about the spatial relationships between these objects. This motivates the use of 'edge features', that is, information about the relationship between each pair of objects. Some current TextVQA models address this problem but either only use categories of relations (rather than edge feature vectors) or do not use edge features within the Transformer architectures. In order to overcome these shortcomings, we propose a Graph Relation Transformer (GRT), which uses edge information in addition to node information for graph attention computation in the Transformer. We find that, without using any other optimizations, the proposed GRT method outperforms the accuracy of the M4C baseline model by 0.65% on the val set and 0.57% on the test set. Qualitatively, we observe that the GRT has superior spatial reasoning ability to M4C.
翻译:VizWiz 等先前的研究发现,视觉问答系统(VizWiz)能够读懂和理解图像中文本的系统在帮助视障人士等应用领域有用。 TextVQA 是针对这一问题的 VQA 数据集, 这些问题要求回答系统读懂和理解图像中的视觉对象和文本对象。 TextVQA 中的一个关键挑战是设计一个系统,该系统不仅能有效地解释视觉和文本对象,而且能有效解释这些对象之间的空间关系。这促使使用“ 尖端功能”, 即关于每对对象之间关系的信息。 目前的一些 TextVQA 模型处理这一问题,但可能只使用关系类别( 而不是边缘特性矢量), 或者不使用变异结构结构中的边缘特征。为了克服这些缺陷,我们建议了一个图表关系变异变器(GRT), 它除了使用节点信息来计算变换器中的图像关注度。我们发现, 在不使用其他任何优化的情况下, 提议的GRT RT 方法在每对对象之间的关系中, 某些 Text 方法将超越了 m4C 的精确度, 将M4C 标定为0.15 和我们测量了 mevill 的 meval 的 meval 的 mainal 。