In Visual Question Answering (VQA), answers have a great correlation with question meaning and visual contents. Thus, to selectively utilize image, question and answer information, we propose a novel trilinear interaction model which simultaneously learns high level associations between these three inputs. In addition, to overcome the interaction complexity, we introduce a multimodal tensor-based PARALIND decomposition which efficiently parameterizes trilinear interaction between the three inputs. Moreover, knowledge distillation is first time applied in Free-form Opened-ended VQA. It is not only for reducing the computational cost and required memory but also for transferring knowledge from trilinear interaction model to bilinear interaction model. The extensive experiments on benchmarking datasets TDIUC, VQA-2.0, and Visual7W show that the proposed compact trilinear interaction model achieves state-of-the-art results when using a single model on all three datasets.
翻译:在视觉问答(VQA)中,答案与问题含义和视觉内容有着密切的关联。因此,为了有选择地使用图像、问答信息,我们建议了一种新的三线互动模式,同时学习这三种投入之间的高层次关联。此外,为了克服互动的复杂性,我们引入了基于多式高压分解,从而有效地将三种投入之间的三线互动参数化。此外,知识蒸馏是首次在自由式开放VQA中应用的。它不仅用于降低计算成本和所需的记忆,而且用于将知识从三线互动模式转移到双线互动模式。关于基准数据集的大型实验TDIUC、VQA-2.0和VeVaV7W显示,拟议的三线互动模型在所有三个数据集中使用单一模型时,能够取得最新的结果。