We propose CLIP-Actor, a text-driven motion recommendation and neural mesh stylization system for human mesh animation. CLIP-Actor animates a 3D human mesh to conform to a text prompt by recommending a motion sequence and learning mesh style attributes. Prior work fails to generate plausible results when the artist-designed mesh content does not conform to the text from the beginning. Instead, we build a text-driven human motion recommendation system by leveraging a large-scale human motion dataset with language labels. Given a natural language prompt, CLIP-Actor first suggests a human motion that conforms to the prompt in a coarse-to-fine manner. Then, we propose a synthesize-through-optimization method that detailizes and texturizes a recommended mesh sequence in a disentangled way from the pose of each frame. It allows the style attribute to conform to the prompt in a temporally-consistent and pose-agnostic manner. The decoupled neural optimization also enables spatio-temporal view augmentation from multi-frame human motion. We further propose the mask-weighted embedding attention, which stabilizes the optimization process by rejecting distracting renders containing scarce foreground pixels. We demonstrate that CLIP-Actor produces plausible and human-recognizable style 3D human mesh in motion with detailed geometry and texture from a natural language prompt.
翻译:我们提出CLIP- Actor, 文本驱动的动作建议和人类网状动动动的神经网格系统。 CLIP- Actor 动画一个 3D 人类网格, 通过推荐运动序列和学习网状样式属性, 符合文本提示。 当艺术家设计的网状内容从一开始就与文本不相符时, 先前的工作无法产生可信的结果。 相反, 我们通过利用语言标签的大规模人类运动数据集, 建立一个文本驱动的人类运动建议系统。 在自然语言提示的情况下, CLIP- Actor 首次建议一种符合3D 人类网状的3D 运动。 然后, 我们提出一种合成通向式的通向- 节向化方法, 详细化一个与每个框架的构成不相容。 它允许样式与提示相匹配, 以时间调和配置的方言调的方式。 分解的内调调调的内置的内置多框架的轨动作, 使人类运动的快速度- 快速增长。 我们进一步提议以正态格式化的C- L, 使人类正态的正态运动成为稳定, 。