Visual Question Answering (VQA) is fundamentally compositional in nature, and many questions are simply answered by decomposing them into modular sub-problems. The recent proposed Neural Module Network (NMN) employ this strategy to question answering, whereas heavily rest with off-the-shelf layout parser or additional expert policy regarding the network architecture design instead of learning from the data. These strategies result in the unsatisfactory adaptability to the semantically-complicated variance of the inputs, thereby hindering the representational capacity and generalizability of the model. To tackle this problem, we propose a Semantic-aware modUlar caPsulE Routing framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics and refine the discriminative representations for prediction. Particularly, five powerful specialized modules as well as dynamic routers are tailored in each layer of the SUPER network, and the compact routing spaces are constructed such that a variety of customizable routes can be sufficiently exploited and the vision-semantic representations can be explicitly calibrated. We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets, as well as the parametric-efficient advantage. It is worth emphasizing that this work is not to pursue the state-of-the-art results in VQA. Instead, we expect that our model is responsible to provide a novel perspective towards architecture learning and representation calibration for VQA.
翻译:视觉问题解答(VQA)本质上是构成性的,许多问题的解答只是通过将其分解成模块的子问题而简单解答。最近提议的神经模块网络(NMNN)采用这一战略回答问题,而大量依靠现成的布局分析器或对网络结构设计的额外专家政策而不是从数据中学习。这些战略导致对投入的内在复杂差异的适应性不尽如人意,从而妨碍模型的代表性能力和普遍性。为了解决这一问题,我们提议了一个称为SUPER的语义觉觉觉能模版的卡普勒路特框架(NMNMNMNNN)来更好地捕捉到具体实例的愿景特征,并改进用于预测的歧视性表述。特别是,五个强大的专门模块以及动态路由器在SUPER网络的每个层中都作了定制,并且构建了紧凑模式空间,以便能够充分开发各种可定制的路径,而且视觉-界面图表能够明确校准。我们相对来说,一个高效和高效的模型是作为我们的拟议数据基准,我们所提出的五级基准的优势是用来衡量我们所建的。