The trend in sign language generation is centered around data-driven generative methods that require vast amounts of precise 2D and 3D human pose data to achieve an acceptable generation quality. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to self-occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11% in the estimation of body and hand poses compared to the state-of-the-art. The official website of this work is: https://github.com/kaustesseract/DexAvatar.
翻译:手语生成领域的发展趋势正围绕数据驱动的生成方法展开,这些方法需要海量精确的二维和三维人体姿态数据才能达到可接受的生成质量。然而,目前大多数手语数据集基于视频,仅限于自动重建的二维人体姿态(即关键点),且缺乏准确的三维信息。此外,现有从手语视频中自动估计三维人体姿态的最先进方法易受自遮挡、噪声和运动模糊效应的影响,导致重建质量不佳。针对这一问题,我们提出了DexAvatar,这是一个新颖的框架,通过学习的二维手部和身体姿态先验引导,从野外单目手语视频中重建生物力学上准确的细粒度手部关节和身体运动。DexAvatar在SGNify动作捕捉数据集(此任务唯一可用的基准测试)上取得了强劲的性能,与现有最先进技术相比,在身体和手部姿态估计方面实现了35.11%的提升。本工作的官方网站为:https://github.com/kaustesseract/DexAvatar。