Transformer-based models achieve great performance on Visual Question Answering (VQA). However, when we evaluate them on systematic generalization, i.e., handling novel combinations of known concepts, their performance degrades. Neural Module Networks (NMNs) are a promising approach for systematic generalization that consists on composing modules, i.e., neural networks that tackle a sub-task. Inspired by Transformers and NMNs, we propose Transformer Module Network (TMN), a novel Transformer-based model for VQA that dynamically composes modules into a question-specific Transformer network. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, namely, CLEVR-CoGenT, CLOSURE and GQA-SGL, in some cases improving more than 30% over standard Transformers.
翻译:以变换器为基础的模型在视觉问题解答(VQA)上取得了很高的绩效。然而,当我们评估这些模型时,在系统化的概括化方面,即处理已知概念的新组合时,其性能会下降。神经模块网络(NMNNs)是系统化的概括化的有希望的方法,它由组成模块组成,即处理子任务的神经网络。在变换器和NMNs的启发下,我们提出了变换器模块网络(TMN),这是VQA的新型变换器模型,它能动态地将模块组合成一个特定问题的变换器网络。 TMNs在三个VQA数据集(即CLEVR-CoGenT、CLOSURE和GQA-SGL)中实现最先进的系统化化性能,有时比标准变换器提高30%以上。