应对利用地方到全球的相互作用在现场图绘制方面的挑战 (Tackling the Challenges in Scene Graph Generation with Local-to-Global Interactions)

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

In this work, we seek new insights into the underlying challenges of the Scene Graph Generation (SGG) task. Quantitative and qualitative analysis of the Visual Genome dataset implies -- 1) Ambiguity: even if inter-object relationship contains the same object (or predicate), they may not be visually or semantically similar, 2) Asymmetry: despite the nature of the relationship that embodied the direction, it was not well addressed in previous studies, and 3) Higher-order contexts: leveraging the identities of certain graph elements can help to generate accurate scene graphs. Motivated by the analysis, we design a novel SGG framework, Local-to-Global Interaction Networks (LOGIN). Locally, interactions extract the essence between three instances - subject, object, and background - while baking direction awareness into the network by constraining the input order. Globally, interactions encode the contexts between every graph components -- nodes and edges. Also we introduce Attract & Repel loss which finely adjusts predicate embeddings. Our framework enables predicting the scene graph in a local-to-global manner by design, leveraging the possible complementariness. To quantify how much LOGIN is aware of relational direction, we propose a new diagnostic task called Bidirectional Relationship Classification (BRC). We see that LOGIN can successfully distinguish relational direction than existing methods (in BRC task) while showing state-of-the-art results on the Visual Genome benchmark (in SGG task).

翻译：在这项工作中,我们寻求对《景色图集(SGG)》任务的基本挑战的新洞察力。对视觉基因组数据集的定量和定性分析意味着 -- -- 1) 模糊性:即使对象间关系包含相同的对象(或上游),它们可能不是视觉或语义上的相似性,2) 不对称:尽管体现了方向的关系性质,但在以往的研究中并没有很好地处理,3) 高层次背景:利用某些图形元素的身份来帮助生成准确的场景图。在分析的激励下,我们设计了一个新型的 SGG 框架,即地方到全球的视觉互动网络(LOGIN ) 。局部性互动在三个例子(主题、对象和背景)之间提取了本质,同时通过限制输入顺序来将方向定位意识意识定位到网络中,2) 不对称每个图表组成部分 -- -- 节点和边缘 -- 之间的背景。此外,我们引入了微量和重粒损失,可以细微地调整上游嵌入。我们的框架能够以本地到全球的方式预测场景图图图图图图图图图,通过设计,利用B级任务和背景关系,同时确定我们现有的分析关系。