Human motion prediction, which plays a key role in computer vision, generally requires a past motion sequence as input. However, in real applications, a complete and correct past motion sequence can be too expensive to achieve. In this paper, we propose a novel approach to predicting future human motions from a much weaker condition, i.e., a single image, with mixture density networks (MDN) modeling. Contrary to most existing deep human motion prediction approaches, the multimodal nature of MDN enables the generation of diverse future motion hypotheses, which well compensates for the strong stochastic ambiguity aggregated by the single input and human motion uncertainty. In designing the loss function, we further introduce the energy-based formulation to flexibly impose prior losses over the learnable parameters of MDN to maintain motion coherence as well as improve the prediction accuracy by customizing the energy functions. Our trained model directly takes an image as input and generates multiple plausible motions that satisfy the given condition. Extensive experiments on two standard benchmark datasets demonstrate the effectiveness of our method in terms of prediction diversity and accuracy.
翻译:人类运动预测在计算机视觉中起着关键作用,通常需要过去的运动序列作为投入。然而,在实际应用中,完整和正确的过去运动序列可能太昂贵,难以实现。在本文中,我们提出一种新的方法,从较弱的状态预测未来人类运动,即单一图像,以混合密度网络(MDN)为模型。与大多数现有的人类运动深度预测方法相反,MDN的多式联运性质使得能够生成不同的未来运动假设,这很好地弥补了单项投入和人类运动不确定性所累积的强烈随机模糊性。在设计损失函数时,我们进一步采用基于能源的公式,对MDN的可学习参数灵活地规定先前的损失,以保持运动的一致性,并通过定制能源功能来提高预测的准确性。我们经过训练的模型直接将图像作为投入,产生符合给定条件的多种合理性动作。在两个标准基准数据集上进行的广泛实验,显示了我们方法在预测多样性和准确性方面的有效性。