To generate "accurate" scene graphs, almost all existing methods predict pairwise relationships in a deterministic manner. However, we argue that visual relationships are often semantically ambiguous. Specifically, inspired by linguistic knowledge, we classify the ambiguity into three types: Synonymy Ambiguity, Hyponymy Ambiguity, and Multi-view Ambiguity. The ambiguity naturally leads to the issue of \emph{implicit multi-label}, motivating the need for diverse predictions. In this work, we propose a novel plug-and-play Probabilistic Uncertainty Modeling (PUM) module. It models each union region as a Gaussian distribution, whose variance measures the uncertainty of the corresponding visual content. Compared to the conventional deterministic methods, such uncertainty modeling brings stochasticity of feature representation, which naturally enables diverse predictions. As a byproduct, PUM also manages to cover more fine-grained relationships and thus alleviates the issue of bias towards frequent relationships. Extensive experiments on the large-scale Visual Genome benchmark show that combining PUM with newly proposed ResCAGCN can achieve state-of-the-art performances, especially under the mean recall metric. Furthermore, we prove the universal effectiveness of PUM by plugging it into some existing models and provide insightful analysis of its ability to generate diverse yet plausible visual relationships.
翻译:为了生成“ 准确” 场景图, 几乎所有现有方法都以确定性的方式预测对称关系。 然而, 我们辩称视觉关系往往具有语义模糊性。 具体地说, 在语言知识的启发下, 我们将模糊性分为三种类型: 协同性模糊性、 Hyponymy 模糊性、 多视图模糊性。 这种模糊性自然地导致 \ emph{ impilit 多重标签 问题, 促使人们需要不同的预测。 在这项工作中, 我们提议了一个新的插和播放不稳定性模型(PUM)模块。 它将每个联盟区域建成一个高斯分布模型, 测量相应视觉内容的不确定性。 与传统的确定性方法相比, 这种不确定性模型带来了特征代表的随机性, 自然可以带来不同的预测。 作为副产品, PUM还设法覆盖更精细的关联, 从而缓解对频繁的关系的偏向问题。 在大型视觉基因组基准(PUM) 上进行广泛的实验, 显示将PUM与最近提出的直观性模型结合起来的能力。