Person image generation is an intriguing yet challenging problem. However, this task becomes even more difficult under constrained situations. In this work, we propose a novel pipeline to generate and insert contextually relevant person images into an existing scene while preserving the global semantics. More specifically, we aim to insert a person such that the location, pose, and scale of the person being inserted blends in with the existing persons in the scene. Our method uses three individual networks in a sequential pipeline. At first, we predict the potential location and the skeletal structure of the new person by conditioning a Wasserstein Generative Adversarial Network (WGAN) on the existing human skeletons present in the scene. Next, the predicted skeleton is refined through a shallow linear network to achieve higher structural accuracy in the generated image. Finally, the target image is generated from the refined skeleton using another generative network conditioned on a given image of the target person. In our experiments, we achieve high-resolution photo-realistic generation results while preserving the general context of the scene. We conclude our paper with multiple qualitative and quantitative benchmarks on the results.
翻译:个人图像的生成是一个令人着迷但又具有挑战性的问题。 但是,在受制约的情况下,这项任务变得更加困难。 在这项工作中,我们提议建立一个新的管道,在保护全球语义的同时,将符合背景的个人图像生成并插入到现有场景中。更具体地说,我们的目标是插入一个人,使被插入的人的位置、姿势和规模与现场现有人员混合在一起。我们的方法在一条连续的管道中使用三个单独的网络。首先,我们通过在现场现有人类骨骼上设置一个瓦塞尔斯坦·吉纳蒂·德versarial网络(WGAN)来预测新人的潜在位置和骨骼结构。接下来,通过一个浅线性网络对预测的骨骼进行精细化,以便在生成的图像中实现更高的结构精准性。最后,目标图像是利用另一个精细的骨骼,以目标人给定的图像为条件。在我们的实验中,我们通过保存场景的总体背景,在完成我们的论文时,我们用多种定性和定量基准来得出高分辨率的摄影现实生成结果。