In this work, we seek new insights into the underlying challenges of the Scene Graph Generation (SGG) task. Quantitative and qualitative analysis of the Visual Genome dataset implies -- 1) Ambiguity: even if inter-object relationship contains the same object (or predicate), they may not be visually or semantically similar, 2) Asymmetry: despite the nature of the relationship that embodied the direction, it was not well addressed in previous studies, and 3) Higher-order contexts: leveraging the identities of certain graph elements can help to generate accurate scene graphs. Motivated by the analysis, we design a novel SGG framework, Local-to-Global Interaction Networks (LOGIN). Locally, interactions extract the essence between three instances - subject, object, and background - while baking direction awareness into the network by constraining the input order. Globally, interactions encode the contexts between every graph components -- nodes and edges. Also we introduce Attract & Repel loss which finely adjusts predicate embeddings. Our framework enables predicting the scene graph in a local-to-global manner by design, leveraging the possible complementariness. To quantify how much LOGIN is aware of relational direction, we propose a new diagnostic task called Bidirectional Relationship Classification (BRC). We see that LOGIN can successfully distinguish relational direction than existing methods (in BRC task) while showing state-of-the-art results on the Visual Genome benchmark (in SGG task).
翻译:在这项工作中,我们寻求对《景色图集(SGG)》任务的基本挑战的新洞察力。对视觉基因组数据集的定量和定性分析意味着 -- -- 1) 模糊性:即使对象间关系包含相同的对象(或上游),它们可能不是视觉或语义上的相似性,2) 不对称:尽管体现了方向的关系性质,但在以往的研究中并没有很好地处理,3) 高层次背景:利用某些图形元素的身份来帮助生成准确的场景图。在分析的激励下,我们设计了一个新型的 SGG 框架,即地方到全球的视觉互动网络(LOGIN ) 。 局部性互动在三个例子(主题、对象和背景)之间提取了本质,同时通过限制输入顺序来将方向定位意识意识定位到网络中,2) 不对称每个图表组成部分 -- -- 节点和边缘 -- 之间的背景。此外,我们引入了微量和重粒损失,可以细微地调整上游嵌入。我们的框架能够以本地到全球的方式预测场景图图图图图图图图图,通过设计,利用B级任务和背景关系,同时确定我们现有的分析关系。