Generating motion in line with text has attracted increasing attention nowadays. However, open-vocabulary human motion generation still remains touchless and undergoes the lack of diverse labeled data. The good news is that, recent studies of large multi-model foundation models (e.g., CLIP) have demonstrated superior performance on few/zero-shot image-text alignment, largely reducing the need for manually labeled data. In this paper, we take advantage of CLIP for open-vocabulary 3D human motion generation in a zero-shot manner. Specifically, our model is composed of two stages, i.e., text2pose and pose2motion. For text2pose, to address the difficulty of optimization with direct supervision from CLIP, we propose to carve the versatile CLIP model into a slimmer but more specific model for aligning 3D poses and texts, via a novel pipeline distillation strategy. Optimizing with the distilled 3D pose-text model, we manage to concretize the text-pose knowledge of CLIP into a text2pose generator effectively and efficiently. As for pose2motion, drawing inspiration from the advanced language model, we pretrain a transformer-based motion model, which makes up for the lack of motion dynamics of CLIP. After that, by formulating the generated poses from the text2pose stage as prompts, the motion generator can generate motions referring to the poses in a controllable and flexible manner. Our method is validated against advanced baselines and obtains sharp improvements. The code will be released here.
翻译:与文本相一致的动画现在引起了越来越多的关注。然而,开放式的人类动画制作仍然不触摸,而且缺乏各种标签数据。好消息是,最近对大型多模型基础模型模型(如CLIP)的研究显示,在少数/零发图像文本对齐方面表现优异,基本减少了人工标签数据的需求。在本文件中,我们利用CLIP来以零发方式生成公开的3D型人动画。具体地说,我们的模型由两个阶段组成,即文本2投放和摆动。关于文本2,为了在CLIP的直接监督下解决优化困难,我们提议将多功能的CLIP模型刻成一个更细但更具体的模型,通过新的管道蒸馏战略来调整3D型和文本。与3D型压实的3D型人动画模模模相比,我们设法将CLIP的改进码知识化成一个文本模版,以快速和高效的变压方式取代发电机。为了快速的变压,CFIP,我们从前的变压,可以快速的变动,从C-FIPFL制成一个动作,可以快速的模。