Solving grounded language tasks often requires reasoning about relationships between objects in the context of a given task. For example, to answer the question ``What color is the mug on the plate?'' we must check the color of the specific mug that satisfies the ``on'' relationship with respect to the plate. Recent work has proposed various methods capable of complex relational reasoning. However, most of their power is in the inference structure, while the scene is represented with simple local appearance features. In this paper, we take an alternate approach and build contextualized representations for objects in a visual scene to support relational reasoning. We propose a general framework of Language-Conditioned Graph Networks (LCGN), where each node represents an object, and is described by a context-aware representation from related objects through iterative message passing conditioned on the textual input. E.g., conditioning on the ``on'' relationship to the plate, the object ``mug'' gathers messages from the object ``plate'' to update its representation to ``mug on the plate'', which can be easily consumed by a simple classifier for answer prediction. We experimentally show that our LCGN approach effectively supports relational reasoning and improves performance across several tasks and datasets.
翻译:解决基于语言的任务通常要求根据特定任务对对象之间的关系进行推理。 例如,为了回答“ 盘子上的杯子是什么颜色” 问题, 我们必须检查满足“ on” 与盘子关系的具体杯子的颜色。 最近的工作提出了各种能够复杂关联推理的方法。 然而, 它们的大部分力量都存在于推论结构中, 而场景则具有简单的本地外观特征。 在本文中, 我们采取一种替代方法, 为视觉场景中的对象建立背景化的表达方式, 以支持关联推理。 我们提议了一个语言化图表网络(LCGN)的总体框架, 其中每个节点代表一个对象, 并且用相关对象的上下文识别的表达方式描述, 以文本输入为条件的迭代信息为条件。 E. g. 调整“ on” 与板块的关系, 对象“ mug” 收集了来自“plate” 对象的信息, 以更新其显示“ ug” 板板块上的表达方式。 我们提议了“ ug”, 通过简单的分类方法可以轻松地消耗一个对象, 来支持我们对结果的推论, 。