The visual world around us can be described as a structured set of objects and their associated relations. An image of a room may be conjured given only the description of the underlying objects and their associated relations. While there has been significant work on designing deep neural networks which may compose individual objects together, less work has been done on composing the individual relations between objects. A principal difficulty is that while the placement of objects is mutually independent, their relations are entangled and dependent on each other. To circumvent this issue, existing works primarily compose relations by utilizing a holistic encoder, in the form of text or graphs. In this work, we instead propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully. We further show that decomposition enables our model to effectively understand the underlying relational scene structure. Project page at: https://composevisualrelations.github.io/.
翻译:我们周围的视觉世界可以被描述为一组结构化的物体及其关联关系。一个房间的图像可能只是根据对基本物体及其关联关系的描述而合成的。虽然在设计深神经网络方面已经做了大量工作,这些网络可以组成单个物体,但在构建物体之间的个别关系方面所做的工作较少。一个主要的难题是,虽然物体的放置是相互独立的,它们之间的关系是相互缠绕和依赖的。为回避这一问题,现有工作主要通过使用一个整体编码器,以文字或图表的形式组成关系。在这项工作中,我们提议将每一种关系都作为非正常化的密度(一种基于能源的模式)来表述,使我们能够以因素化的方式构建不同的关系。我们表明,这种因因素化的分解使模型既产生又编辑具有多套关系的场景。我们进一步表明,分解使我们的模型能够有效地理解基本关系场结构。项目网页:https://compositevicionrecomlations.github.io/。