As a structured prediction task, scene graph generation, given an input image, aims to explicitly model objects and their relationships by constructing a visually-grounded scene graph. In the current literature, such task is universally solved via a message passing neural network based mean field variational Bayesian methodology. The classical loose evidence lower bound is generally chosen as the variational inference objective, which could induce oversimplified variational approximation and thus underestimate the underlying complex posterior. In this paper, we propose a novel doubly reparameterized importance weighted structure learning method, which employs a tighter importance weighted lower bound as the variational inference objective. It is computed from multiple samples drawn from a reparameterizable Gumbel-Softmax sampler and the resulting constrained variational inference task is solved by a generic entropic mirror descent algorithm. The resulting doubly reparameterized gradient estimator reduces the variance of the corresponding derivatives with a beneficial impact on learning. The proposed method achieves the state-of-the-art performance on various popular scene graph generation benchmarks.
翻译:作为结构化的预测任务,场景图生成,给一个输入图像,目的是通过构建可见的场景图形来明确模拟物体及其关系。在目前的文献中,这种任务通过基于平均地位变异学的电文传递神经网络方法得到普遍解决。典型的松散证据下限一般被选为变式推断目标,这可能导致过于简单化的变差近似,从而低估了潜在的复杂后部。在本文中,我们提议一种新型的双倍重新计价重要性加权结构学习方法,该方法采用较严格的重要性加权更低的范围,作为变异推断目标。该方法从从可重新测量的 Gumbel-Softmax 取样器中提取的多个样本中计算出来,由此产生的受控变率任务则由通用的映射后位算法解决。由此产生的双倍再计梯度测算法降低了相应衍生物的差异,对学习产生了有利影响。拟议方法在各种流行图形生成基准上取得了最先进的表现。