Visual Question Answering (VQA) requires integration of feature maps with drastically different structures and focus of the correct regions. Image descriptors have structures at multiple spatial scales, while lexical inputs inherently follow a temporal sequence and naturally cluster into semantically different question types. A lot of previous works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. In this work, we propose Question Type-guided Attention (QTA). It utilizes the information of question type to dynamically balance between bottom-up and top-down visual features, respectively extracted from ResNet and Faster R-CNN networks. We experiment with multiple VQA architectures with extensive input ablation studies over the TDIUC dataset and show that QTA systematically improves the performance by more than 5% across multiple question type categories such as "Activity Recognition", "Utility" and "Counting" on TDIUC dataset. By adding QTA on the state-of-art model MCB, we achieve 3% improvement for overall accuracy. Finally, we propose a multi-task extension to predict question types which generalizes QTA to applications that lack of question type, with minimal performance loss.
翻译:视觉问题解答(VQA) 需要将地貌图与差异巨大的结构和正确区域的重点整合在一起。图像描述符有多个空间尺度的结构,而词汇输入本身就遵循一个时间序列,自然地分组成不同的问题类型。许多以前的工作都使用复杂的模型来提取特征描述,但忽略了使用高层次信息摘要,如学习中的问题类型。在这项工作中,我们建议使用问题类型引导注意(QTA)。它利用问题类型的信息来动态平衡自下而上和自上而下的视觉特征,分别取自ResNet和快速R-CNN网络。我们试验多个VQA结构,对TDIUC数据集进行广泛的投入反动研究,并表明QTA系统在多个问题类型类别(如“行为识别”、“效用”和“补偿”)中系统地提高性能。通过在高层次模型MCB中添加问题类型(即状态模型和自上至下至下方的视觉特征特征)之间动态平衡,我们在总体准确性方面实现了3%的改进。最后,我们提议对多塔克类型应用的扩展问题进行最低限度的预测,以预测。