Image animation aims to animate a source image by using motion learned from a driving video. Current state-of-the-art methods typically use convolutional neural networks (CNNs) to predict motion information, such as motion keypoints and corresponding local transformations. However, these CNN based methods do not explicitly model the interactions between motions; as a result, the important underlying motion relationship may be neglected, which can potentially lead to noticeable artifacts being produced in the generated animation video. To this end, we propose a new method, the motion transformer, which is the first attempt to build a motion estimator based on a vision transformer. More specifically, we introduce two types of tokens in our proposed method: i) image tokens formed from patch features and corresponding position encoding; and ii) motion tokens encoded with motion information. Both types of tokens are sent into vision transformers to promote underlying interactions between them through multi-head self attention blocks. By adopting this process, the motion information can be better learned to boost the model performance. The final embedded motion tokens are then used to predict the corresponding motion keypoints and local transformations. Extensive experiments on benchmark datasets show that our proposed method achieves promising results to the state-of-the-art baselines. Our source code will be public available.
翻译:图像动画的目的是通过使用从驱动视频中获取的动作来动动源图像。 目前最先进的方法通常使用进化神经网络(CNNs)来预测运动信息,例如运动键点和相应的本地变换。 但是,这些CNN使用的方法并不明确地模拟动议之间的相互作用;结果,重要的基本运动关系可能被忽略,这可能导致在制作的动动画视频中产生显著的文物。 为此,我们提议了一种新的方法,即运动变压器,这是在视觉变压器的基础上建立运动估计器的首次尝试。更具体地说,我们在拟议方法中引入两种标志:一)通过补丁特征和相应的位置编码形成的图像符号;二)用运动信息编码的动作标志。两种标志被发送到视觉变压器中,通过多头的自我关注块促进它们之间的基本互动。通过这个过程,运动信息可以更好地学习来提升模型的性能。最后嵌入式运动标记被用来预测相应的运动关键点和本地变制结果。我们提出的基准将显示我们现有的公共变压基准。