Deriving sophisticated 3D motions from sparse keyframes is a particularly challenging problem, due to continuity and exceptionally skeletal precision. The action features are often derivable accurately from the full series of keyframes, and thus, leveraging the global context with transformers has been a promising data-driven embedding approach. However, existing methods are often with inputs of interpolated intermediate frame for continuity using basic interpolation methods with keyframes, which result in a trivial local minimum during training. In this paper, we propose a novel framework to formulate latent motion manifolds with keyframe-based constraints, from which the continuous nature of intermediate token representations is considered. Particularly, our proposed framework consists of two stages for identifying a latent motion subspace, i.e., a keyframe encoding stage and an intermediate token generation stage, and a subsequent motion synthesis stage to extrapolate and compose motion data from manifolds. Through our extensive experiments conducted on both the LaFAN1 and CMU Mocap datasets, our proposed method demonstrates both superior interpolation accuracy and high visual similarity to ground truth motions.
翻译:通过稀疏的关键帧派生复杂的3D动作是一个尤为具有挑战性的问题,由于连续性和极其骨架精度。行动特征通常可以从完整的关键帧序列中精确推导得出,因此,利用变换器来借助全局上下文进行数据驱动的嵌入方法是一种有前途的方法。然而,现有方法通常是通过使用基本的关键帧插值方法得到的插值中间帧输入来实现连续性,这导致在培训过程中出现微不足道的局部极小值。在本文中,我们提出了一个新的框架来制定具有基于关键帧的约束的潜在运动流形,从而考虑到连续中间符号表示的性质。特别地,我们提出的框架由两个阶段组成,用于确定潜在运动子空间,即关键帧编码阶段和中间符号生成阶段,以及一个随后的运动综合阶段,用于从流形中外推和组合运动数据。通过我们在LaFAN1和CMU Mocap数据集上进行的广泛实验,我们提出的方法展示了比率插值准确度更高的插值精度和与地面实况运动相似的高视觉相似度。