现场图表预测的产生构成增强 (Generative Compositional Augmentations for Scene Graph Prediction)

from arxiv, ICCV 2021 camera ready. Added more baselines, combining GANs with Neural Motifs and t-sne visualizations. Code is available at https://github.com/bknyaz/sgg

Inferring objects and their relationships from an image in the form of a scene graph is useful in many applications at the intersection of vision and language. We consider a challenging problem of compositional generalization that emerges in this task due to a long tail data distribution. Current scene graph generation models are trained on a tiny fraction of the distribution corresponding to the most frequent compositions, e.g. <cup, on, table>. However, test images might contain zero- and few-shot compositions of objects and relationships, e.g. <cup, on, surfboard>. Despite each of the object categories and the predicate (e.g. 'on') being frequent in the training data, the models often fail to properly understand such unseen or rare compositions. To improve generalization, it is natural to attempt increasing the diversity of the training distribution. However, in the graph domain this is non-trivial. To that end, we propose a method to synthesize rare yet plausible scene graphs by perturbing real ones. We then propose and empirically study a model based on conditional generative adversarial networks (GANs) that allows us to generate visual features of perturbed scene graphs and learn from them in a joint fashion. When evaluated on the Visual Genome dataset, our approach yields marginal, but consistent improvements in zero- and few-shot metrics. We analyze the limitations of our approach indicating promising directions for future research.

翻译：在视觉和语言交汇处的许多应用中,用场景图的形式从图像中推断对象及其关系,对视觉和语言交汇处的许多应用非常有用。我们认为,由于长时间的尾尾细数据分布,这项任务中出现一个具有挑战性的组成概括问题。当前场景图生成模型在与最经常的构成(例如表上<cup,上,表>)相对应的分布中,有一小部分是经过培训的。然而,测试图像可能包含零和少量的物体和关系组成,例如“cup,上,冲浪板”。尽管每个对象类别和上游(例如“on”)在培训数据中频繁出现,但模型往往无法正确理解这种看不见或稀有的构成。为了改进,为了提高培训分布的多样性,自然要尝试增加培训分布的多样性。然而,在图形域中,这是非三角的。为此,我们提出了一个方法,通过对真实的方法来综合稀有但真实的景象图。我们然后提议和实验性地研究一个模型,以有条件的对准性对准的对准性对准性对准网络(GAN,但从我们未来的图表中,我们从远处对准的图像中得出了我们未来的图像的图像的图像的预测。让我们从未来的图像结果中,用一个直观测测测测测测出一个视觉的图像的图像的图像结果。