In real life, people communicate using both speech and non-verbal signals such as gestures, face expression or body pose. Non-verbal signals impact the meaning of the spoken utterance in an abundance of ways. An absence of non-verbal signals impoverishes the process of communication. Yet, when users are represented as avatars, it is difficult to translate non-verbal signals along with the speech into the virtual world without specialized motion-capture hardware. In this paper, we propose a novel, data-driven technique for generating gestures directly from speech. Our approach is based on the application of Generative Adversarial Neural Networks (GANs) to model the correlation rather than causation between speech and gestures. This approach approximates neuroscience findings on how non-verbal communication and speech are correlated. We create a large dataset which consists of speech and corresponding gestures in a 3D human pose format from which our model learns the speaker-specific correlation. We evaluate the proposed technique in a user study that is inspired by the Turing test. For the study, we animate the generated gestures on a virtual character. We find that users are not able to distinguish between the generated and the recorded gestures. Moreover, users are able to identify our synthesized gestures as related or not related to a given utterance.
翻译:在现实生活中,人们使用言语和非言语的信号进行交流,如手势、脸部表达或身体姿势等; 非言语信号以多种方式影响口语表达的含义; 缺乏非言语信号使交流过程变得贫乏; 然而,当用户以方形表示时,很难将非言语信号和言语与讲话一起翻译到虚拟世界,而没有专门的动作抓取硬件。 在本文中,我们提出一种创新的、数据驱动的技术,直接从言语中产生手势。 我们的方法基于运用“创性反动神经网络”来模拟言语和手势之间的相互关系,而不是因果关系。 这个方法近似于关于非言语沟通和言语之间相互关系的神经科学发现。 我们创建了一个庞大的数据集,由3D 人造型的言语和相应的手势组成,我们从中学习到演讲者特有的手势。 我们在一项由图灵测试启发的用户研究中评估了拟议的手法。 对于研究,我们所创造的手势无法与虚拟手势进行区分。