Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data. We illustrate our results by visualising gesture spaces and text-speech-gesture alignments, and through a demonstration video at https://simonalexanderson.github.io/IVA2020 .
翻译:在语音方面,文本到语音系统现在能够生成具有高度说服力的、自发声音的语音作为源材料。在运动方面,概率运动生成方法现在可以合成生动和生命似语音驱动的3D演化。在本文中,我们第一次以连贯的方式将这两种最先进的技术组合在一起。具体地说,我们展示了一套关于单一语音音频和动作缩略图数据集的验证系统,该系统能够生成来自文本输入的语音和全体手势。与以前用于联合语音和20世纪20年代生成的方法不同,我们通过语言合成生成全体姿态,我们通过语言合成,对同一人自发讲话的录音作为移动-剖析数据进行了培训。我们通过视觉空间和图像图像演示了我们的成果。我们通过图像空间和图像演示了文本/图像调整。我们通过图像空间和图像演示了我们的成果。