Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis. An accurate matching between the source person and the target motion in both large and subtle motion changes is vital for improving the transferred motion quality. In this paper, we propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching, respectively. It consists of two ViT encoders to extract input features (i.e., a target motion image and a source human image) and a ViT decoder with several cascaded blocks for feature matching and motion transfer. In each block, we set the target motion feature as Query and the source person as Key and Value, calculating the cross-attention maps to conduct a global feature matching. Further, we introduce a convolutional layer to improve the local perception after the global cross-attention computations. This matching process is implemented in both warping and generation branches to guide the motion transfer. During training, we propose a mutual learning loss to enable the co-supervision between warping and generation branches for better motion representations. Experiments show that our Human MotionFormer sets the new state-of-the-art performance both qualitatively and quantitatively. Project page: \url{https://github.com/KumapowerLIU/Human-MotionFormer}
翻译:人类运动传输旨在将运动运动从目标动态人转移到运动合成的静态源中。在大型和微妙的动作变化中,源人和目标运动准确匹配对于提高转移运动的质量至关重要。在本文中,我们提出人类运动Former,这是一个等级性ViT框架,利用全球和地方的感知,分别捕捉大而微妙的运动匹配。它由两个维特编码器组成,以提取输入功能(即目标运动图像和源人类图像)和维特解码器,并配有多个相配和动作传输的级联块。在每个区,我们设定目标运动功能为Query,源人为关键和值,计算交叉注意图以进行全球特征匹配。此外,我们引入一个演进层,以便在全球交叉感量计算后改善当地感知。这个匹配过程在调和生成分支中实施,以指导运动传输。在培训期间,我们提议进行相互学习损失,以使调控和生成分支之间的联合监督能够更好地运动演示。实验性图案:人类运动/质量力/模型。