In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.
翻译:在这项工作中,我们调查一个简单而必须知道的有条件的基因框架,其基础是用质谱描述生成人类运动的矢量量化自动编码(VQ-VAE)和培养前先导变异器(GPT),我们展示了一个基于CNN的简单VQ-VAE,其中含有常用培训食谱(EMA和代码重置),使我们能够获得高质量的离散演示;对于GPT来说,我们在培训中加入了一个简单的腐败战略,以减轻培训测试差异;尽管它简单,我们的T2M-GPT的表现优于竞争性方法,包括最近的基于传播的方法。例如,在目前最大数据集的HumanML3D上,我们在文本和生成的动作(R-Pricision)之间的一致性上取得了可比的绩效,但FID0.116基本上超过0.630的动作。此外,我们对人类ML3D进行了分析,并观察到数据集的规模是我们方法的局限性。我们的工作表明,VQ-VAE仍然是人类运动生成的一种竞争性方法。</s>