Recent progress in diffusion models has significantly advanced the field of human image animation. While existing methods can generate temporally consistent results for short or regular motions, significant challenges remain, particularly in generating long-duration videos. Furthermore, the synthesis of fine-grained facial and hand details remains under-explored, limiting the applicability of current approaches in real-world, high-quality applications. To address these limitations, we propose a diffusion transformer (DiT)-based framework which focuses on generating high-fidelity and long-duration human animation videos. First, we design a set of hybrid implicit guidance signals and a sharpness guidance factor, enabling our framework to additionally incorporate detailed facial and hand features as guidance. Next, we incorporate the time-aware position shift fusion module, modify the input format within the DiT backbone, and refer to this mechanism as the Position Shift Adaptive Module, which enables video generation of arbitrary length. Finally, we introduce a novel data augmentation strategy and a skeleton alignment model to reduce the impact of human shape variations across different identities. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving superior performance in both high-fidelity and long-duration human image animation.
翻译:扩散模型的最新进展显著推动了人体图像动画领域的发展。现有方法虽然能够为短时或常规运动生成时序一致的结果,但在生成长时视频方面仍面临重大挑战。此外,细粒度面部与手部细节的合成研究仍显不足,限制了现有方法在现实世界高质量应用中的适用性。为突破这些局限,我们提出一种基于扩散Transformer(DiT)的框架,专注于生成高保真与长时人体动画视频。首先,我们设计了一套混合隐式引导信号与锐度引导因子,使框架能够额外融合详细的面部与手部特征作为引导。其次,我们引入时序感知位置偏移融合模块,修改DiT主干网络中的输入格式,并将此机制称为位置偏移自适应模块,从而实现任意时长的视频生成。最后,我们提出新颖的数据增强策略与骨骼对齐模型,以降低不同身份间人体形状差异的影响。实验结果表明,本方法优于现有先进方法,在高保真与长时人体图像动画任务中均取得更优性能。