Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.
翻译:现有的插值方法利用预训练的视频扩散先验,在稀疏采样的关键帧之间生成中间帧。由于缺乏三维几何引导,这些方法难以对复杂、铰接的人体运动生成合理结果,且对合成动态的控制能力有限。本文提出PoseFuse3D关键帧插值器(PoseFuse3D-KI),这是一个将三维人体引导信号整合到扩散过程中的新型框架,用于实现可控的人体中心关键帧插值(CHKI)。为提供丰富的空间与结构插值线索,我们的三维引导控制模型PoseFuse3D具备以下特征:新型SMPL-X编码器将三维几何与形状信息转换至二维潜在条件空间,以及融合网络将这些三维线索与二维姿态嵌入进行整合。为进行评估,我们构建了CHKI-Video数据集,该数据集同时标注了二维姿态与三维SMPL-X参数。实验表明,PoseFuse3D-KI在CHKI-Video数据集上持续超越现有先进基线方法,PSNR指标提升9%,LPIPS指标降低38%。系统的消融实验证明,我们的PoseFuse3D模型显著提升了插值保真度。