TopNet：基于Transformer的图像合成物体位置网络 (TopNet: Transformer-based Object Placement Network for Image Compositing)

We investigate the problem of automatically placing an object into a background image for image compositing. Given a background image and a segmented object, the goal is to train a model to predict plausible placements (location and scale) of the object for compositing. The quality of the composite image highly depends on the predicted location/scale. Existing works either generate candidate bounding boxes or apply sliding-window search using global representations from background and object images, which fail to model local information in background images. However, local clues in background images are important to determine the compatibility of placing the objects with certain locations/scales. In this paper, we propose to learn the correlation between object features and all local background features with a transformer module so that detailed information can be provided on all possible location/scale configurations. A sparse contrastive loss is further proposed to train our model with sparse supervision. Our new formulation generates a 3D heatmap indicating the plausibility of all location/scale combinations in one network forward pass, which is over 10 times faster than the previous sliding-window method. It also supports interactive search when users provide a pre-defined location or scale. The proposed method can be trained with explicit annotation or in a self-supervised manner using an off-the-shelf inpainting model, and it outperforms state-of-the-art methods significantly. The user study shows that the trained model generalizes well to real-world images with diverse challenging scenes and object categories.

翻译：我们研究了自动将物体放置到背景图像中进行图像合成的问题。给定一个背景图像和一个分割对象，目标是训练一个模型来预测合成的可能位置和比例。合成图像的质量高度依赖于所预测的位置/比例。现有的方法要么生成候选边界框，要么使用来自背景图像和物体图像的全局表示进行滑动窗口搜索，这些方法未能模拟背景图像中的局部信息。然而，背景图像中的局部线索对于确定将对象放置在特定位置/比例上具有重要作用。在本文中，我们提出通过Transformer模块学习对象特征和所有背景特征之间的相关性，以便提供所有可能的位置/比例组合的详细信息。进一步提出了一种稀疏对比损失，以稀疏方式对我们的模型进行训练。我们的新配方在一个网络正向传递中生成一个3D热图，指示所有位置/比例组合的合理性，比以前的滑动窗口方法快10多倍。当用户提供预定义的位置或比例时，它也支持交互式搜索。所提出的方法可以使用显式注释或自监督方法使用现成的修复模型进行训练，并且显着优于最先进的方法。用户研究表明，训练好的模型在有挑战的现实场景和对象类别的多样化图像中具有良好的泛化性。