This brief literature review studies the problem of audiovisual speech synthesis, which is the problem of generating an animated talking head given a text as input. Due to the high complexity of this problem, we approach it as the composition of two problems. Specifically, that of Text-to-Speech (TTS) synthesis as well as the voice-driven talking head animation. For TTS, we present models that are used to map text to intermediate acoustic representations, e.g. mel-spectrograms, as well as models that generate voice signals conditioned on these intermediate representations, i.e vocoders. For the talking-head animation problem, we categorize approaches based on whether they produce human faces or anthropomorphic figures. An attempt is also made to discuss the importance of the choice of facial models in the second case. Throughout the review, we briefly describe the most important work in audiovisual speech synthesis, trying to highlight the advantages and disadvantages of the various approaches.
翻译:这份简短的文献回顾研究视听话语合成问题,这是一个制作动动动的谈话头作为投入的问题。由于这个问题的高度复杂性,我们把它作为两个问题的构成来对待。具体地说,文本到语音合成(TTS)以及语音驱动的语音头部动画。对于TTS,我们介绍了用于将文字映射到中间声表的模型,例如mel-spectrogrogram,以及产生以这些中间表达为条件的声音信号的模型,即vocoters。对于讲动头的动画问题,我们根据它们是否产生人的脸或人类形态数字来分类方法。我们还试图讨论在第二个案例中选择面部模型的重要性。在审查过程中,我们简要描述了视听话合成中最重要的工作,试图突出各种方法的利弊。