Synthesizing natural head motion to accompany speech for an embodied conversational agent is necessary for providing a rich interactive experience. Most prior works assess the quality of generated head motion by comparing them against a single ground-truth using an objective metric. Yet there are many plausible head motion sequences to accompany a speech utterance. In this work, we study the variation in the perceptual quality of head motions sampled from a generative model. We show that, despite providing more diverse head motions, the generative model produces motions with varying degrees of perceptual quality. We finally show that objective metrics commonly used in previous research do not accurately reflect the perceptual quality of generated head motions. These results open an interesting avenue for future work to investigate better objective metrics that correlate with human perception of quality.
翻译:为了提供丰富的互动经验,有必要将自然头部运动与具有内涵的谈话代理器的演讲同时合成。大多数先前的工作都通过使用客观的度量来比较头部运动的质量,将头部运动的质量与单一地面真实性进行比较。然而,在演讲时,有许多可信的头部运动序列。在这项工作中,我们研究了从基因化模型中抽样的头部运动的感知质量的差异。我们表明,尽管有更多样化的头部运动,但基因模型产生的运动具有不同程度的感知质量。我们最后显示,以往研究中常用的客观指标并没有准确反映头部运动的感知质量。这些结果为未来工作调查与人对质量的看法相联系的更客观指标开辟了一条有趣的途径。