数据驱动共喊手势生成的综述 (A Comprehensive Review of Data-Driven Co-Speech Gesture Generation)

Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology in film, games, virtual social spaces, and for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion, and by the great diversity of communicative functions that gestures encompass. Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep-learning-based generative models, that benefit from the growing availability of data. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text, and non-linguistic input. We also chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method. Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.

翻译：伴随着语音表达的手势是自然而高效的人类交流的重要组成部分。自动生成类似共喊手势的问题是计算机动画中长期存在的难题，也被认为是电影、游戏、虚拟社交空间以及与社交机器人互动等领域的核心技术。由于共喊手势运动的个性化、不定期特性以及针对意图的多样性，使得该问题的挑战性不断上升。近年来，随着人类手势运动的更多和更大的数据集的出现，以及深度学习模型的发展，人们在共喊手势的生成方面又取得了重大进展。本文总结了共喊手势生成的研究，特别侧重于深度生成模型。文中首先阐述了人类示意动作的理论及其与语音互补性的理论，并在此基础上简要概述了基于规则的和基于传统统计技术的示意动作综合方法，然后深入探讨了基于深度学习的方法。我们以输入模态的选择为组织原则，分别介绍了从音频、文本和非语言输入中生成手势的系统，以及相关训练数据集在大小、多样性、运动质量和收集方法方面的演变；最后，我们确定了手势生成的关键研究挑战，包括数据可用性和质量；生成类似于人类运动的运动；将动作固定在同时发生的讲话人之间以及环境之中；执行手势评估以及手势综合与应用的整合。我们强调了最近应对各种关键挑战的方法，以及这些方法的局限性，并指出未来发展的方向。