Numerous pose-guided human editing methods have been explored by the vision community due to their extensive practical applications. However, most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. This objective becomes ill-defined in cases when the target pose differs significantly from the input pose. Existing methods then resort to in-painting or style transfer to handle occlusions and preserve content. In this paper, we explore the utilization of multiple views to minimize the issue of missing information and generate an accurate representation of the underlying human model. To fuse knowledge from multiple viewpoints, we design a multi-view fusion network that takes the pose key points and texture from multiple source images and generates an explainable per-pixel appearance retrieval map. Thereafter, the encodings from a separate network (trained on a single-view human reposing task) are merged in the latent space. This enables us to generate accurate, precise, and visually coherent images for different editing tasks. We show the application of our network on two newly proposed tasks - Multi-view human reposing and Mix&Match Human Image generation. Additionally, we study the limitations of single-view editing and scenarios in which multi-view provides a better alternative.
翻译:视觉界已经探索了多种人造相导形的编辑方法,因为它们具有广泛的实际应用性。然而,大多数这些方法仍然使用图像到图像的配方,其中将单一图像作为输入输入,以生成编辑图像作为输出。当目标与输入构成大不相同时,这个目标就变得定义不清。 现有的方法然后采用内漆或风格传输方法,处理隔离和保存内容。 在本文件中,我们探索如何利用多种观点,以尽量减少缺失信息的问题,并准确反映人类基本模型。 为了从多个角度整合知识,我们设计了一个多视图聚合网络,从多个源图像中提取关键点和纹理,并生成一个可解释的每像素外观检索图。 之后,一个单独的网络的编码(经过单视人类重新布置任务培训的)将合并到潜藏空间。 这使我们能够为不同的编辑任务生成准确、准确和视觉一致的图像。 我们在两个新提议的任务上展示了我们的网络的应用情况:多视角人类重新定位和Mix & Match 人类图像的多版本。</s>