Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG). We also propose a set of models modified from state-of-the-art neural speech-synthesis engines to achieve this goal. We evaluate the models in three carefully-designed user studies, two of which evaluate the synthesized speech and gesture in isolation, plus a combined study that evaluates the models like they will be used in real-world applications -- speech and gesture presented together. The results show that participants rate one of the proposed integrated synthesis models as being as good as the state-of-the-art pipeline system we compare against, in all three tests. The model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system, illustrating some of the potential benefits of treating speech and gesture synthesis together as a single, unified problem. Videos and code are available on our project page at https://swatsw.github.io/isg_icmi21/
翻译:迄今为止,两个不同的研究社区将文本到语音和共同语音的姿态合成作为单独的领域对待,两个不同的研究社区将这两个技术用简单的系统级管道堆叠在一起。这可能导致效率低下的建模,并可能造成不一致,从而限制可实现的自然性。我们提议将这两种模式合并成一个单一的模式,我们称之为综合演讲和手势合成(ISG)的新问题。我们还提议了一套模型,从最先进的神经语言合成引擎中加以修改,以实现这一目标。我们在三个精心设计的用户研究中对模型进行了评估,其中两个研究是孤立地评价综合演讲和手势的,加上一项联合研究,评价在现实应用中将使用这些模型 -- -- 演讲和手势 -- -- 一起提出。结果显示,与会者将拟议的综合综合模型中的一种评为与我们比较的最先进的管道系统一样好,在所有三个测试中,该模型都能够实现这一点,而合成时间更快,参数计数与管道系统相比大大减少,其中两个模型评价了综合演讲和手势,其中两个是孤立地评价综合发言和手势动作组合的一些潜在好处。在单一、统一的版本/ADLA/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A