Numerous pose-guided human editing methods have been explored by the vision community due to their extensive practical applications. However, most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. This objective becomes ill-defined in cases when the target pose differs significantly from the input pose. Existing methods then resort to in-painting or style transfer to handle occlusions and preserve content. In this paper, we explore the utilization of multiple views to minimize the issue of missing information and generate an accurate representation of the underlying human model. To fuse knowledge from multiple viewpoints, we design a multi-view fusion network that takes the pose key points and texture from multiple source images and generates an explainable per-pixel appearance retrieval map. Thereafter, the encodings from a separate network (trained on a single-view human reposing task) are merged in the latent space. This enables us to generate accurate, precise, and visually coherent images for different editing tasks. We show the application of our network on two newly proposed tasks - Multi-view human reposing and Mix&Match Human Image generation. Additionally, we study the limitations of single-view editing and scenarios in which multi-view provides a better alternative.
翻译:众多姿势引导的人体编辑方法已被视觉领域探索,因为它们具有广泛的实际应用。然而,大多数这些方法仍然使用单个图像作为输入,以产生编辑后的图像作为输出的图像到图像公式。当目标姿势与输入姿势差别显著时,此目标变得不明确。现有的方法通过修补或样式传输来处理遮挡并保留内容。在本文中,我们探讨了利用多个视角的可能性,以最小化缺失信息的问题并生成潜在人类模型的准确表示。为了融合来自多视角的知识,我们设计了一个多视图融合网络,它获取多源图像中的姿势关键点和纹理,并生成可解释的每像素外观检索映射。然后,从单视图重定位任务训练的单独网络的编码在潜在空间中进行合并。这使我们能够为不同的编辑任务生成准确、精确和视觉上一致的图像。我们在两个新提出的任务上展示了我们的网络应用——多视图人体重定位和混合匹配人类图像生成。此外,我们研究了单视图编辑的局限性以及多视图提供更好替代方案的情况。