We propose to model longer-term future human behavior by jointly predicting action labels and 3D characteristic poses (3D poses representative of the associated actions). While previous work has considered action and 3D pose forecasting separately, we observe that the nature of the two tasks is coupled, and thus we predict them together. Starting from an input 2D video observation, we jointly predict a future sequence of actions along with 3D poses characterizing these actions. Since coupled action labels and 3D pose annotations are difficult and expensive to acquire for videos of complex action sequences, we train our approach with action labels and 2D pose supervision from two existing action video datasets, in tandem with an adversarial loss that encourages likely 3D predicted poses. Our experiments demonstrate the complementary nature of joint action and characteristic 3D pose prediction: our joint approach outperforms each task treated individually, enables robust longer-term sequence prediction, and outperforms alternative approaches to forecast actions and characteristic 3D poses.
翻译:我们建议通过共同预测行动标签和3D特征构成来模拟更长期的未来人类行为(3D代表相关行动)。虽然先前的工作已经审议了行动,而3D则分别作出预测,但我们看到这两项任务的性质是同时的,因此我们共同作出预测。从输入 2D 视频观察开始,我们共同预测未来一系列行动与3D 一起构成这些行动的特点。由于结合行动标签和3D 表示说明难以获取复杂行动序列的视频,因此我们用行动标签来培训我们的方法,用现有的两个行动视频数据集来监督2D,同时从现有的两个行动视频数据集中鼓励可能的3D预测构成的对抗性损失。我们的实验表明联合行动的互补性质和3D特性构成预测:我们的共同方法超越了每个单独处理的任务,使得能够进行强有力的长期序列预测,并超越了预测行动和3D构成特征的替代方法。