Co-speech gesture generation is to synthesize a gesture sequence that not only looks real but also matches with the input speech audio. Our method generates the movements of a complete upper body, including arms, hands, and the head. Although recent data-driven methods achieve great success, challenges still exist, such as limited variety, poor fidelity, and lack of objective metrics. Motivated by the fact that the speech cannot fully determine the gesture, we design a method that learns a set of gesture template vectors to model the latent conditions, which relieve the ambiguity. For our method, the template vector determines the general appearance of a generated gesture sequence, while the speech audio drives subtle movements of the body, both indispensable for synthesizing a realistic gesture sequence. Due to the intractability of an objective metric for gesture-speech synchronization, we adopt the lip-sync error as a proxy metric to tune and evaluate the synchronization ability of our model. Extensive experiments show the superiority of our method in both objective and subjective evaluations on fidelity and synchronization.
翻译:共同语言的手势生成是合成一个手势序列,它不仅看起来真实,而且与输入语音音频相匹配。我们的方法产生一个完整的上部身体,包括手臂、双手和头部的移动。虽然最近的数据驱动方法取得了巨大成功,但挑战仍然存在,例如种类有限、忠诚度低和缺乏客观指标。由于演讲无法完全确定姿态,我们设计了一种方法来学习一组手势模板矢量来模拟潜在条件,从而减轻模糊性。对于我们的方法,模板矢量决定了生成的手势序列的总体外观,而声音驱动身体的微妙移动,两者对于合成现实的手势顺序都是不可或缺的。由于手势同步的客观指标不易,我们采用了唇合成错误作为代号来调节和评估我们模型的同步能力。广泛的实验显示我们方法在对忠诚和同步性的客观和主观评价中的优越性。