We propose a novel robust and efficient Speech-to-Animation (S2A) approach for synchronized facial animation generation in human-computer interaction. Compared with conventional approaches, the proposed approach utilize phonetic posteriorgrams (PPGs) of spoken phonemes as input to ensure the cross-language and cross-speaker ability, and introduce corresponding prosody features (i.e. pitch and energy) to further enhance the expression of generated animation. Mixtureof-experts (MOE)-based Transformer is employed to better model contextual information while provide significant optimization on computation efficiency. Experiments demonstrate the effectiveness of the proposed approach on both objective and subjective evaluation with 17x inference speedup compared with the state-of-the-art approach.
翻译:我们为在人与计算机的互动中同步生成面部动画提出了一种新颖、稳健、高效的语音到动画(S2A)方法。 与常规方法相比,拟议方法使用口声电话的语音后方格(PPGs)作为输入,以确保跨语言和跨口音能力,并引入相应的手动功能(即投放和能量),以进一步加强所生成动画的表达。 混合专家(MOE)的变异器被用于更好地模拟背景信息,同时对计算效率提供显著的优化。 实验表明,与最新方法相比,客观和主观评价的拟议方法的有效性为17x推论速度。