Speech-driven gesture synthesis is a field of growing interest in virtual human creation. However, a critical challenge is the inherent intricate one-to-many mapping between speech and gestures. Previous studies have explored and achieved significant progress with generative models. Notwithstanding, most synthetic gestures are still vastly less natural. This paper presents DiffMotion, a novel speech-driven gesture synthesis architecture based on diffusion models. The model comprises an autoregressive temporal encoder and a denoising diffusion probability Module. The encoder extracts the temporal context of the speech input and historical gestures. The diffusion module learns a parameterized Markov chain to gradually convert a simple distribution into a complex distribution and generates the gestures according to the accompanied speech. Compared with baselines, objective and subjective evaluations confirm that our approach can produce natural and diverse gesticulation and demonstrate the benefits of diffusion-based models on speech-driven gesture synthesis.
翻译:语音驱动的手势合成是一个对虚拟人类创造越来越感兴趣的领域。然而,一个关键的挑战在于对语言和手势的内在复杂一对多绘图。以前的研究已经探索并取得了基因模型的重大进步。尽管大多数合成的手势仍然不那么自然。本文展示了以扩散模型为基础的新型语音驱动的手势合成结构Diffmotion。模型包括一个自动递增时间编码器和一个分解的传播概率模块。编码器提取了语音输入和历史手势的时间背景。扩散模块学习了一个参数化的Markov链,以逐渐将简单的分布转换成复杂的分布,并根据伴随的演讲生成手势。与基线、客观和主观评估相比,我们的方法可以产生自然和多样化的演化,并展示以语音驱动的手势合成为基础的传播模型的好处。