This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. We provide an analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations. We also analyse the importance of smoothing of the produced motion. Our results indicated that the proposed method improved on our baseline in terms of objective measures. For example, it better captured the motion dynamics and better matched the motion-speed distribution. Moreover, we performed user studies on two different datasets. The studies confirmed that our proposed method is perceived as more natural than the baseline, although the difference in the studies was eliminated by appropriate post-processing: hip-centering and smoothing. We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method.
翻译:本文提供了一个用于语音驱动的手势制作的新框架,适用于虚拟代理器,以加强人与计算机的互动。 具体地说,我们通过纳入演示学习,推广最近基于深学习的、数据驱动的方法,用于语音驱动的手势生成。 我们的模型以3D坐标序列的形式,将演讲作为输入,并产生作为输出的手势。 我们通过客观和主观的评价,对输入(声音)和网络输出(动作)的不同表达方式进行了分析。 我们还分析了运动平滑的重要性。 我们的结果表明,拟议的方法在客观措施方面改进了我们的基线。例如,它更好地记录了运动动态,并更好地匹配了运动-速度分布。 此外,我们对两个不同的数据集进行了用户研究。这些研究证实,我们拟议的方法被认为比基线更自然,尽管通过适当的后处理(时时段和滑动)消除了研究的差异。 我们的结论是,在设计自动手势生产方法时,必须既考虑到运动代表又考虑到后处理方式。