The success of scene graphs for visual scene understanding has brought attention to the benefits of abstracting a visual input (e.g., image) into a structured representation, where entities (people and objects) are nodes connected by edges specifying their relations. Building these representations, however, requires expensive manual annotation in the form of images paired with their scene graphs or frames. These formalisms remain limited in the nature of entities and relations they can capture. In this paper, we propose to leverage a widely-used meaning representation in the field of natural language processing, the Abstract Meaning Representation (AMR), to address these shortcomings. Compared to scene graphs, which largely emphasize spatial relationships, our visual AMR graphs are more linguistically informed, with a focus on higher-level semantic concepts extrapolated from visual input. Moreover, they allow us to generate meta-AMR graphs to unify information contained in multiple image descriptions under one representation. Through extensive experimentation and analysis, we demonstrate that we can re-purpose an existing text-to-AMR parser to parse images into AMRs. Our findings point to important future research directions for improved scene understanding.
翻译:视觉场景理解的景象图的成功吸引了人们注意将视觉输入(例如图像)抽象成结构化的表达方式的好处,实体(人和物体)是用显示其关系的边缘连接的节点。然而,建立这些表达方式需要昂贵的人工说明,其形式是图像与场景图或框架相配。这些形式主义在实体的性质和它们能够捕捉的关系方面仍然有限。在本文中,我们提议利用自然语言处理领域广泛使用的含义,即抽象表示方式(AMR),来纠正这些缺点。与主要强调空间关系的景象图相比,我们的视觉的光学光学光束图在语言上更加知情,重点是从视觉输入中推断出更高层次的语义概念。此外,它们使我们能够生成元-AMR图,以统一多种图像描述中所含的信息。通过广泛的实验和分析,我们证明我们可以重新使用现有的文本到AMRP(AMR) 来将图像分析到AMR(AMR) 。我们的调查结果指向重要的未来研究方向,以便改进对图像的理解。