We introduce Action-GPT, a plug-and-play framework for incorporating Large Language Models (LLMs) into text-based action generation models. Action phrases in current motion capture datasets contain minimal and to-the-point information. By carefully crafting prompts for LLMs, we generate richer and fine-grained descriptions of the action. We show that utilizing these detailed descriptions instead of the original action phrases leads to better alignment of text and motion spaces. We introduce a generic approach compatible with stochastic (e.g. VAE-based) and deterministic (e.g. MotionCLIP) text-to-motion models. In addition, the approach enables multiple text descriptions to be utilized. Our experiments show (i) noticeable qualitative and quantitative improvement in the quality of synthesized motions, (ii) benefits of utilizing multiple LLM-generated descriptions, (iii) suitability of the prompt function, and (iv) zero-shot generation capabilities of the proposed approach. Project page: https://actiongpt.github.io
翻译:我们引入了将大语言模型(LLMS)纳入基于文本的行动生成模型的插座和游戏框架 " 行动-GPT " (LLMS),当前运动捕获数据集中的行动短语包含最小和点点的信息。通过为LLMS精心设计提示,我们生成了更丰富和细微的动作描述。我们表明,利用这些详细描述而不是最初的动作短语可以更好地调整文本和运动空间。我们引入了一种与Stochistic(例如VAE基础)和确定性(例如MotionCLIP)文本到动作模型兼容的通用方法。此外,该方法还允许使用多个文本描述。我们的实验显示:(一) 综合动作质量的显著质和量的改进,(二) 利用多种LMM生成描述的好处,(三) 快速功能的适宜性,以及(四) 拟议方法的零光生成能力。项目页:https://actiongpt.github.io。</s>