FaceXHuBERT: 使用自我强化语音代表教学法进行无文字语音驱动 E(X) 3D 减压动动动合成</s> (FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning)

This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that allows to capture personalized and subtle cues in speech (e.g. identity, emotion and hesitation). It is also very robust to background noise and can handle audio recorded in a variety of situations (e.g. multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate facial animation for the whole face. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-synching, expressivity, person-specific information and generalizability. We effectively employ self-supervised pretrained HuBERT model in the training process that allows us to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Additionally, guiding the training with a binary emotion condition and speaker identity distinguishes the tiniest subtle facial motion. We carried out extensive objective and subjective evaluation in comparison to ground-truth and state-of-the-art work. A perceptual user study demonstrates that our approach produces superior results with respect to the realism of the animation 78% of the time in comparison to the state-of-the-art. In addition, our method is 4 times faster eliminating the use of complex sequential models such as transformers. We strongly recommend watching the supplementary video before reading the paper. We also provide the implementation and evaluation codes with a GitHub repository link.

翻译：本文展示了FaceXHuBERT, 这是一种无文字语音驱动的 3D 面部动画生成方法,它能够捕捉到言语中个性化和微妙的提示(例如身份、情感和犹豫),对于背景噪音也非常有力,可以处理各种情况下的录音记录(例如多人发言),最近的办法采用端到端深的学习,同时考虑音频和文字作为输入,为整个脸部生成面部动画。然而,公开提供的表达式音频-3D 面部动画数据集的缺乏是一个重大瓶颈。由此产生的动画仍然在准确的口述合成、表达性、个人特有信息和可概括性方面存在问题。我们在培训过程中采用了自我监督的预先训练的HuBERT模型,这使我们能够在听音中既纳入词汇性信息,又不使用大字典。此外,以二进制情感状态和发言者身份来指导培训,区别了最细微的面部面部动动画数据集。我们进行了广泛的客观和主观评价,比较了地面和状态的纸质化、表达、个人特质化的资讯分析方法,我们用了不断的变动的变动方法,我们用了不断的变动的变动的变动方法,我们用了更精确的变动的变动的变动的方法,用了更动的变动的变动的变动方法。</s>