Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality representations. Previous approaches extensively employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, or the interactions across question words and object features. These attempts aim to improve the cross-modality representations, while ignoring their internal relations. Instead, we propose to apply structured alignments, which work with graph representation of visual and textual content, aiming to capture the deep connections between the visual and textual modalities. Nevertheless, it is nontrivial to represent and integrate graphs for structured alignments. In this work, we attempt to solve this issue by first converting different modality entities into sequential nodes and the adjacency graph, then incorporating them for structured alignments. As demonstrated in our experimental results, such a structured alignment improves reasoning performance. In addition, our model also exhibits better interpretability for each generated answer. The proposed model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset.
翻译:视觉问题解答( VQA) 吸引了产业界和学术界的极大关注。 作为一个多时制任务,它具有挑战性,因为它不仅需要视觉和文字的理解,而且需要协调跨时制代表形式的能力。 以往的方法广泛采用实体一级的调整,例如视觉区域及其语义标签之间的相关性,或跨问题词和对象特征的相互作用。 这些尝试的目的是改进跨时制表达,同时忽视它们的内部关系。 相反,我们建议应用结构化的调整,它涉及视觉和文字内容的图形表达,旨在捕捉视觉和文字模式之间的深层联系。然而,为结构化的调整而代表并整合图表是非技术性的。 在这项工作中,我们试图解决这个问题,首先将不同模式的实体转换成顺序节点和相近图,然后将它们纳入结构化的调整。 正如我们的实验结果所显示的那样,这种结构化的调整提高了性能。 此外,我们的模型还展示了每一种生成答案的更好解释性,目的是要捕捉到视觉和文字模式之间的深处。 在任何预导的G2 方法上,拟议的模型上, 将数据结构化为非状态的方法。