We propose a novel learned deep prior of body motion for 3D hand shape synthesis and estimation in the domain of conversational gestures. Our model builds upon the insight that body motion and hand gestures are strongly correlated in non-verbal communication settings. We formulate the learning of this prior as a prediction task of 3D hand shape over time given body motion input alone. Trained with 3D pose estimations obtained from a large-scale dataset of internet videos, our hand prediction model produces convincing 3D hand gestures given only the 3D motion of the speaker's arms as input. We demonstrate the efficacy of our method on hand gesture synthesis from body motion input, and as a strong body prior for single-view image-based 3D hand pose estimation. We demonstrate that our method outperforms previous state-of-the-art approaches and can generalize beyond the monologue-based training data to multi-person conversations. Video results are available at http://people.eecs.berkeley.edu/~evonne_ng/projects/body2hands/.
翻译:我们提议在3D手形合成和估计的谈话手势领域,在3D手形合成和估计的体力运动之前,我们深思熟虑。我们的模型基于这样的洞察力,即身体运动和手势在非语言通信环境中密切相关。我们把这个先学成的预测任务,是3D手形在时间上随体动输入而形成的预测任务。我们受过3D的训练,从大规模互联网视频数据集中得出的估计,我们的手势预测模型产生令人信服的3D手势,只以3D手势作为输入。我们展示了我们手势组合的方法在身体运动输入的手势式合成上的效力,以及作为以单视图像为基础的3D手势显示的强体。我们证明,我们的方法超越了以往最先进的方法,可以将单一语言培训数据推广到多人对话。视频结果见http://people.eecs.eberkeley.edu/ ~evonne_ng/project/body2hands/。