The pose-guided person image generation task requires synthesizing photorealistic images of humans in arbitrary poses. The existing approaches use generative adversarial networks that do not necessarily maintain realistic textures or need dense correspondences that struggle to handle complex deformations and severe occlusions. In this work, we show how denoising diffusion models can be applied for high-fidelity person image synthesis with strong sample diversity and enhanced mode coverage of the learnt data distribution. Our proposed Person Image Diffusion Model (PIDM) disintegrates the complex transfer problem into a series of simpler forward-backward denoising steps. This helps in learning plausible source-to-target transformation trajectories that result in faithful textures and undistorted appearance details. We introduce a 'texture diffusion module' based on cross-attention to accurately model the correspondences between appearance and pose information available in source and target images. Further, we propose 'disentangled classifier-free guidance' to ensure close resemblance between the conditional inputs and the synthesized output in terms of both pose and appearance information. Our extensive results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios. We also show how our generated images can help in downstream tasks. Our code and models will be publicly released.
翻译:假造导人图像生成任务要求将人为任意配置的光真图像合成为光真图像。 现有方法使用基因对抗网络,这些网络不一定保持现实的质地或需要难以处理复杂变形和严重分解的密集对应材料。 在这项工作中, 我们展示了如何在高异性人图像合成中应用非异性扩散模型, 并具有很强的样本多样性, 以及所学数据分布的强化模式覆盖范围。 我们的拟议个人图像分解模型将复杂的传输问题分解成一系列更简单的前向后向分解步骤。 这有助于学习可信的源到目标的转换轨迹, 从而导致忠实的纹理和未扭曲的外观细节。 我们引入了一个基于交叉意图的“ 透性传播模块 ”, 以精确模拟外观和显示源和目标图像分布之间的对应信息。 此外, 我们建议“ 分解的分类器化模型”, 以确保有条件的输入和合成的外观信息的组合和外观信息相近似相匹配。 我们关于两个具有挑战性的用户模型的广泛结果将如何在下展示。 我们的下游图像模型中展示我们所生成的模型。