Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and semantics) can be predicted from speech text and/or audio using contemporary deep learning. In extensive experiments, we show that gesture properties related to gesture meaning (semantics and category) are predictable from text features (time-aligned BERT embeddings) alone, but not from prosodic audio features, while rhythm-related gesture properties (phase) on the other hand can be predicted from either audio, text (with word-level timing information), or both. These results are encouraging as they indicate that it is possible to equip an embodied agent with content-wise meaningful co-speech gestures using a machine-learning model.
翻译:虽然近年来提出了许多由数据驱动的动作生成方法,但目前还不清楚这些系统是否能够始终产生传达含义的动作。我们用当代深层的学习来调查语言文字和/或音频可以预测哪些手势属性(阶段、类别和语义 ) 。在广泛的实验中,我们显示与手势含义(语义和类别)有关的手势属性可以单凭文字特征(与时间一致的BERT嵌入)而可以预测,但不能从音频特征中预测,而另一方面,与节奏相关的手势属性(阶段)可以从音频、文字(有文字级时间级信息)或两者中预测。这些结果令人鼓舞,因为它们表明有可能使用机器学习模型来装备一个具有内涵意义的共言手势。