The generation of realistic and contextually relevant co-speech gestures is a challenging yet increasingly important task in the creation of multimodal artificial agents. Prior methods focused on learning a direct correspondence between co-speech gesture representations and produced motions, which created seemingly natural but often unconvincing gestures during human assessment. We present an approach to pre-train partial gesture sequences using a generative adversarial network with a quantization pipeline. The resulting codebook vectors serve as both input and output in our framework, forming the basis for the generation and reconstruction of gestures. By learning the mapping of a latent space representation as opposed to directly mapping it to a vector representation, this framework facilitates the generation of highly realistic and expressive gestures that closely replicate human movement and behavior, while simultaneously avoiding artifacts in the generation process. We evaluate our approach by comparing it with established methods for generating co-speech gestures as well as with existing datasets of human behavior. We also perform an ablation study to assess our findings. The results show that our approach outperforms the current state of the art by a clear margin and is partially indistinguishable from human gesturing. We make our data pipeline and the generation framework publicly available.
翻译:生成逼真且上下文相关的共语手势是一项具有挑战性但愈发重要的任务,这有助于创建多模态的人工智能代理。以往的方法集中于学习共语手势表示与产生动作之间的直接对应关系,这导致看似自然但在人类评估中常常不令人信服的手势。我们提出了一个方法来使用生成对抗网络和量化技术来预先训练部分手势序列。产生的码本向量既作为我们框架的输入也作为输出,从而形成手势的生成和重构的基础。通过学习潜空间表示的映射关系,而不是将其直接映射到向量表示,这个框架有助于生成高度逼真和富有表现力的手势,这些手势紧密地复制了人类的运动和行为,同时避免了生成过程中的人工痕迹。我们通过将其与已有的共语手势生成方法和现有的人类行为数据集进行比较来评估我们的方法。我们还进行了消融研究,以评估我们的发现。结果表明我们的方法在性能上明显优于当前技术水平,并且在某种程度上与人类手势难以区分。我们公开了我们的数据管道和生成框架。