Stochastic human motion prediction aims to forecast multiple plausible future motions given a single pose sequence from the past. Most previous works focus on designing elaborate losses to improve the accuracy, while the diversity is typically characterized by randomly sampling a set of latent variables from the latent prior, which is then decoded into possible motions. This joint training of sampling and decoding, however, suffers from posterior collapse as the learned latent variables tend to be ignored by a strong decoder, leading to limited diversity. Alternatively, inspired by the diffusion process in nonequilibrium thermodynamics, we propose MotionDiff, a diffusion probabilistic model to treat the kinematics of human joints as heated particles, which will diffuse from original states to a noise distribution. This process offers a natural way to obtain the "whitened" latents without any trainable parameters, and human motion prediction can be regarded as the reverse diffusion process that converts the noise distribution into realistic future motions conditioned on the observed sequence. Specifically, MotionDiff consists of two parts: a spatial-temporal transformer-based diffusion network to generate diverse yet plausible motions, and a graph convolutional network to further refine the outputs. Experimental results on two datasets demonstrate that our model yields the competitive performance in terms of both accuracy and diversity.
翻译:然而,这种对取样和解码的联合培训会受到后遗症的困扰,因为所学的潜伏变量往往被强大的解码器忽视,导致有限的多样性。 或者,在无平衡热动力学的扩散过程的启发下,MotionDiff由两个部分组成:一个基于空间的变异网络,用来将人类联合体的动力学成热粒子,从原始状态扩散到噪音分布。这一过程提供了一种自然的方式,在没有任何可训练参数的情况下获取“白色”的潜能,而人类运动预测可被视为一种反向传播过程,将噪音分布转换成现实的未来运动,以观察的序列为条件。具体地说,MotionDiff由两个部分组成:一个基于空间的变异性网络,将人类联合体的动力学作为热粒子进行传播,从原始状态扩散到噪音分布。这一过程提供了一种自然的方式,在没有任何可训练参数的情况下获取“白化”的潜能值,而人类运动预测可以被视为一种反向扩散过程,将噪音分布转化为观察到的序列上的现实的未来运动。具体地,MotionDiff由两个部分组成:一个基于空间变现变现变现网络,以产生多样化的模型,以产生多样化的实验性结果,同时显示我们具有竞争性的实验性结果。