'Actions' play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform 'Reasoning about Actions & Change' (RAC). Recently, there has been growing interest in the study of RAC with visual and linguistic inputs. Graphs are often used to represent semantic structure of the visual content (i.e. objects, their attributes and relationships among objects), commonly referred to as scene-graphs. In this work, we propose a novel method that leverages scene-graph representation of images to reason about the effects of actions described in natural language. We experiment with existing CLEVR_HYP (Sampat et. al, 2021) dataset and show that our proposed approach is effective in terms of performance, data efficiency, and generalization capability compared to existing models.
翻译:“行动”在人类如何与世界互动方面发挥着关键作用。 因此,能帮助我们完成日常任务的自主代理机构也要求有能力执行“对行动与变化的反思” (RAC) 。 最近,人们对利用视觉和语言投入研究RAC的兴趣日益浓厚。 图表常常用来代表视觉内容(即物体、其属性和物体之间的关系)的语义结构,通常被称为现场绘图。 在这项工作中,我们提出了一种新的方法,利用图像的现场绘图表达方式来解释自然语言描述的行动的效果。 我们实验了现有的CLEVR_HYP(Sampat等人,2021年)数据集,并表明我们所提议的方法与现有模型相比,在性能、数据效率和一般化能力方面是有效的。