We propose a generalized class of multimodal fusion operators for the task of visual question answering (VQA). We identify generalizations of existing multimodal fusion operators based on the Hadamard product, and show that specific non-trivial instantiations of this generalized fusion operator exhibit superior performance in terms of OpenEnded accuracy on the VQA task. In particular, we introduce Nonlinearity Ensembling, Feature Gating, and post-fusion neural network layers as fusion operator components, culminating in an absolute percentage point improvement of $1.1\%$ on the VQA 2.0 test-dev set over baseline fusion operators, which use the same features as input. We use our findings as evidence that our generalized class of fusion operators could lead to the discovery of even superior task-specific operators when used as a search space in an architecture search over fusion operators.
翻译:我们为视觉问题解答任务建议了一个通用的多式联运聚合操作员类别(VQA),我们根据Hadamard产品确定了现有多式联运聚合操作员的概括性,并表明该通用融合操作员的具体非三重性即时性在VQA任务的开放性准确性方面表现优异,特别是,我们将非线性集合、特质定位和聚聚后神经网络层作为聚合操作员组成部分,最终使VQA 2.0 测试-dev对基准融合操作员设置的绝对百分点提高了1.1美元,这些操作员使用与投入相同的特征。我们利用我们的调查结果作为证据,证明我们通用的聚变操作员类别,在用于对聚变操作员进行建筑搜索时,可以发现甚至优越的特定任务操作员。