We propose the task of forecasting characteristic 3D poses: from a monocular video observation of a person, to predict a future 3D pose of that person in a likely action-defining, characteristic pose - for instance, from observing a person reaching for a banana, predict the pose of the person eating the banana. Prior work on human motion prediction estimates future poses at fixed time intervals. Although easy to define, this frame-by-frame formulation confounds temporal and intentional aspects of human action. Instead, we define a semantically meaningful pose prediction task that decouples the predicted pose from time, taking inspiration from goal-directed behavior. To predict characteristic poses, we propose a probabilistic approach that first models the possible multi-modality in the distribution of likely characteristic poses. It then samples future pose hypotheses from the predicted distribution in an autoregressive fashion to model dependencies between joints and finally optimizes the resulting pose with bone length and angle constraints. To evaluate our method, we construct a dataset of manually annotated characteristic 3D poses. Our experiments with this dataset suggest that our proposed probabilistic approach outperforms state-of-the-art methods by 22% on average.
翻译:我们提出预测3D特征的任务:从一个人的单眼视频观察中,预测一个人在可能的行动定义中的未来3D形象,特征构成——例如,观察一个人达到香蕉,预测吃香蕉的人的构成;先前关于人类运动预测的工作估计未来在固定的时间间隔中构成;虽然容易确定,但这一框架的编制方式会混淆人类行动的时空和故意方面。相反,我们界定了一种从时间上将预测的构成分解出来、从目标方向行为中得到启发的具有意义的具有地震意义的预测意义的预测任务。为了预测特征,我们提出了一种概率性方法,首先对可能特性分布的可能多模式进行模型;然后,对未来进行抽样,从预测的分布中产生假象,以自动回溯性的方式模拟连接之间的依赖性,并最终将由此造成的构成与骨头长度和角度限制的最佳性。为了评估我们的方法,我们设计了一个人工显示3D特征特征的数据集。我们对这一数据设置的实验表明,我们提议的预测的预测性平均方法是22 %的状态方法。