Visual question answering requires a deep understanding of both images and natural language. However, most methods mainly focus on visual concept; such as the relationships between various objects. The limited use of object categories combined with their relationships or simple question embedding is insufficient for representing complex scenes and explaining decisions. To address this limitation, we propose the use of text expressions generated for images, because such expressions have few structural constraints and can provide richer descriptions of images. The generated expressions can be incorporated with visual features and question embedding to obtain the question-relevant answer. A joint-embedding multi-head attention network is also proposed to model three different information modalities with co-attention. We quantitatively and qualitatively evaluated the proposed method on the VQA v2 dataset and compared it with state-of-the-art methods in terms of answer prediction. The quality of the generated expressions was also evaluated on the RefCOCO, RefCOCO+, and RefCOCOg datasets. Experimental results demonstrate the effectiveness of the proposed method and reveal that it outperformed all of the competing methods in terms of both quantitative and qualitative results.
翻译:然而,大多数方法主要侧重于视觉概念;例如不同对象之间的关系。有限地使用物体类别及其关系或简单的嵌入问题不足以代表复杂的场景和解释决定。为解决这一局限性,我们建议使用为图像生成的文字表达方式,因为这种表达方式没有多少结构性限制,能够提供更丰富的图像描述。生成的表达方式可以与视觉特征和嵌入的问题相结合,以获得与问题相关的回答。还提议建立一个联合召集的多领导人关注网络,以共同关注的方式模拟三种不同的信息模式。我们从数量和质量上评价了VQA v2数据集的拟议方法,并在答复预测方面与最新方法进行了比较。生成的表达方式的质量也经过了RefCOCO、RefCO+和RefCOCOg数据集的评估。实验结果表明拟议方法的有效性,并表明它在定量和定性结果方面都超越了所有相互竞争的方法。