Human motion prediction, which plays a key role in computer vision, generally requires a past motion sequence as input. However, in real applications, a complete and correct past motion sequence can be too expensive to achieve. In this paper, we propose a novel approach to predict future human motions from a much weaker condition, i.e., a single image, with mixture density networks (MDN) modeling. Contrary to most existing deep human motion prediction approaches, the multimodal nature of MDN enables the generation of diverse future motion hypotheses, which well compensates for the strong stochastic ambiguity aggregated by the single input and human motion uncertainty. In designing the loss function, we further introduce an energy-based prior over learnable parameters of MDN to maintain motion coherence, as well as improve the prediction accuracy. Our trained model directly takes an image as input and generates multiple plausible motions that satisfy the given condition. Extensive experiments on two standard benchmark datasets demonstrate the effectiveness of our method, in terms of prediction diversity and accuracy.
翻译:人类运动预测在计算机视觉中起着关键作用,通常需要过去的运动序列作为投入。然而,在实际应用中,完整和正确的过去运动序列可能太昂贵,无法实现。在本文中,我们提出一种新的方法,从更弱的状态预测人类未来的运动,即单一图像,并配有混合密度网络(MDN)建模。与大多数现有的人类运动深度预测方法相反,MDN的多式联运性质使得能够生成不同的未来运动假设,这很好地弥补了单项输入和人类运动不确定性所汇总的很强的随机模糊性。在设计损失函数时,我们进一步引入了MDN的以能源为基础的先学过强的参数,以保持运动的一致性,并提高预测的准确性。我们经过训练的模型直接将图像作为投入,产生多种符合给定条件的可信动作。对两个标准基准数据集进行了广泛的实验,显示了我们方法在预测多样性和准确性方面的有效性。