In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks.
翻译:在本文中,我们讨论参考表达理解:将自然语言表达方式描述的图像区域本地化。虽然最近的工作将表达作为单一单位处理,但我们建议将其分解成三个模块组件,这些组件与主题外观、位置和与其他对象的关系有关。这使我们能够灵活地适应在端到端框架中包含不同类型信息的表达式。在模型中,我们称之为模块关注网络(MAttNet),使用两种关注类型:基于语言的注意,既学习模块的重量,也学习每个模块应当关注的词/口;视觉的注意,使主题和关系模块能够侧重于相关图像组成部分。模块的重量将所有三个模块的分数动态地组合到总分数。实验显示,MATtNet在捆绑式框级和像素级理解任务上有很大的边距,比以往的状态方法更优。