Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and semantics) can be predicted from speech text and/or audio using contemporary deep learning. In extensive experiments, we show that gesture properties related to gesture meaning (semantics and category) are predictable from text features (time-aligned FastText embeddings) alone, but not from prosodic audio features, while rhythm-related gesture properties (phase) on the other hand can be predicted from audio features better than from text. These results are encouraging as they indicate that it is possible to equip an embodied agent with content-wise meaningful co-speech gestures using a machine-learning model.
翻译:虽然近年来提出了许多由数据驱动的动作生成方法,但目前还不清楚这些系统能否始终产生传达含义的动作。我们用当代深层的学习来调查语言文字和/或音频可以预测哪些手势属性(阶段、类别和语义 ) 。在广泛的实验中,我们显示,与手势含义(语义和类别)有关的手势属性仅来自文字特征(与时间吻合的快速图文体嵌入)是可以预测的,而不是来自原声学特征,而另一方面,与节奏相关的手势属性(阶段)则来自比文本更好的音频特征。这些结果令人鼓舞,因为它们表明,有可能用机器学习模型来装备一个具有内容上有意义的共同语言的手势。