Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.
翻译:在第一人称视角下预测人类手部运动对于增强现实和人机策略迁移等应用至关重要。近年来,已开发出多种手部轨迹预测方法以生成未来可能的手部路径点,但这些方法仍面临预测目标不足、固有模态差异、手-头运动耦合以及下游任务验证有限等问题。为克服这些局限,我们提出了一种通用手部运动预测框架,该框架综合考虑多模态输入、多维多目标预测模式以及面向下游应用的多任务赋能。通过视觉-语言融合、全局上下文整合和任务感知文本嵌入注入,我们协调多种模态以预测二维和三维空间中的手部路径点。提出了一种新颖的双分支扩散模型,可同时预测人类头部和手部运动,捕捉其在第一人称视觉中的运动协同性。通过引入目标指示器,预测模型除广泛研究的手部中心点外,还能预测手腕或手指的特定关节路径点。此外,我们使Uni-Hand能够额外预测手-物交互状态(接触/分离)以更好地促进下游任务。作为该领域首个纳入下游任务评估的研究,我们构建了新的基准测试体系以评估手部运动预测算法的实际应用性。在多个公开数据集及我们新提出的基准测试上的实验结果表明,Uni-Hand在多维度多目标手部运动预测中达到了最先进的性能。在多个下游任务中的广泛验证也表明,其出色的人机策略迁移能力可实现机器人操控,并为动作预期/识别提供有效的特征增强。