Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.
翻译:Transformer在视觉问答(VQA)上取得了极佳的成绩。然而,它们的系统化泛化能力,即处理已知概念的新组合,尚不清楚。我们揭示神经模块网络(NMNS)比传统的Transformer取得更好或相似的系统化泛化性能,即使NMNs的模块是基于CNN的。为了解决Transformer与NMNs相比存在的不足,本文研究模块性如何为Transformer带来好处。即,我们引入了Transformer模块网络(TMN),它是基于Transformer模块组成的新型NMN。TMN在三个VQA数据集中实现了最先进的系统化泛化性能,在处理子任务的新组合时比标准Transformer提高了30%以上。我们展示了模块组合以及为每个子任务特殊化的模块都是这种性能提升的关键。