Today's Mixed Reality head-mounted displays track the user's head pose in world space as well as the user's hands for interaction in both Augmented Reality and Virtual Reality scenarios. While this is adequate to support user input, it unfortunately limits users' virtual representations to just their upper bodies. Current systems thus resort to floating avatars, whose limitation is particularly evident in collaborative settings. To estimate full-body poses from the sparse input sources, prior work has incorporated additional trackers and sensors at the pelvis or lower body, which increases setup complexity and limits practical application in mobile settings. In this paper, we present AvatarPoser, the first learning-based method that predicts full-body poses in world coordinates using only motion input from the user's head and hands. Our method builds on a Transformer encoder to extract deep features from the input signals and decouples global motion from the learned local joint orientations to guide pose estimation. To obtain accurate full-body motions that resemble motion capture animations, we refine the arm joints' positions using an optimization routine with inverse kinematics to match the original tracking input. In our evaluation, AvatarPoser achieved new state-of-the-art results in evaluations on large motion capture datasets (AMASS). At the same time, our method's inference speed supports real-time operation, providing a practical interface to support holistic avatar control and representation for Metaverse applications.
翻译:今天的混合Reality头顶显示显示显示用户头部在世界空间中的位置,以及用户亲手在增强现实和虚拟现实情景中的互动。 虽然这足以支持用户输入, 但不幸的是, 它将用户的虚拟代表限制在仅仅是其上身。 目前系统因此使用浮动动动变数, 其局限性在协作环境中特别明显。 为了估计来自稀疏输入源的全体代表, 先前的工作在骨盆或下体中增加了更多的跟踪器和传感器, 从而增加了设置的复杂性, 并限制了移动环境的实用应用。 在本文中, 我们展示了AffatarPoser, 这是第一个基于学习的方法, 仅使用用户头和手的动作输入来预测全体在世界上的坐标。 我们的方法建立在变动器编码上, 从输入信号中提取深度的外观, 从本地学习的联合方向中解析出全球运动, 以做出估计。 为了获得准确的全体动作动作, 与运动捕捉动画一样, 我们用一个不切实际的常规程序改进了手臂组合的位置, 来预测全体在现实互动界面中显示我们最初的动作分析结果。