Clothes style transfer for person video generation is a challenging task, due to drastic variations of intra-person appearance and video scenarios. To tackle this problem, most recent AdaIN-based architectures are proposed to extract clothes and scenario features for generation. However, these approaches suffer from being short of fine-grained details and are prone to distort the origin person. To further improve the generation performance, we propose a novel framework with disentangled multi-branch encoders and a shared decoder. Moreover, to pursue the strong video spatio-temporal consistency, an inner-frame discriminator is delicately designed with input being cross-frame difference. Besides, the proposed framework possesses the property of scenario adaptation. Extensive experiments on the TEDXPeople benchmark demonstrate the superiority of our method over state-of-the-art approaches in terms of image quality and video coherence.
翻译:个人视频制作的服装风格传输是一项艰巨的任务,因为人际外观和视频情景存在巨大差异。为了解决这一问题,建议最新的AdaIN型建筑为一代人提取服装和情景特征。然而,这些方法因缺少精细细节而受到影响,而且容易扭曲原人。为了进一步改善生成功能,我们提议了一个新颖的框架,配有分解的多处编码器和一个共同的解码器。此外,为了追求强大的视频时空一致性,一个内框架歧视器在设计上非常微妙,投入是跨框架差异的。此外,拟议框架拥有情景适应的特性。TEX人基准的广泛实验表明,我们的方法优于最先进的图像质量和视频一致性方法。