During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoustically-linked beat gestures or semantically-linked gesticulation (e.g., raising a hand when saying "high"): they cannot appropriately learn to generate both gesture types. We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots. Subjective and objective evaluations confirm the success of our approach. The code and video are available at the project page https://svito-zar.github.io/gesticulator .
翻译:在演讲期间,人们自发地跳动,这在传递信息方面起着关键的作用。同样,现实的共同声音手势对于促成与社会代理人的自然和顺利互动至关重要。当前的端对端共同声音手势生成系统使用单一的表达语言模式:音频或文字。因此,这些系统局限于产生与声相关联的击动手势,或与语义和语义挂钩的演动(例如,举手说“高”时:他们无法适当地学会产生两种手势类型。我们展示了一种模型,设计用来产生任意的击打和语义手势。我们深层学习的模型将语音和语义表达作为输入,并生成手势作为联合角度旋转的顺序。由此产生的手势可以适用于虚拟代理人和人类机器人。主观和客观的评价证实了我们的方法的成功。代码和视频可以在项目网页 https://svito-zar.github.io/gesteculator上查阅。