Visual emotion expression plays an important role in audiovisual speech communication. In this work, we propose a novel approach to rendering visual emotion expression in speech-driven talking face generation. Specifically, we design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input to render a talking face video synchronized with the speech and expressing the conditioned emotion. Objective evaluation on image quality, audiovisual synchronization, and visual emotion expression shows that the proposed system outperforms a state-of-the-art baseline system. Subjective evaluation of visual emotion expression and video realness also demonstrates the superiority of the proposed system. Furthermore, we conduct a human emotion recognition pilot study using generated videos with mismatched emotions among the audio and visual modalities. Results show that humans respond to the visual modality more significantly than the audio modality on this task.
翻译:视觉情感表达在视听语言交流中起着重要作用。 在这项工作中, 我们提出一种新的方法, 将视觉情感表达在语音驱动的谈话面部一代中。 具体地说, 我们设计了一个端对端的谈话面部生成系统, 采用语音表达、 单一面部图像和绝对的情感标签作为输入, 使谈话面部视频与演讲同步, 表达有条件的情感。 对图像质量、 视听同步和视觉情感表达的客观评价显示, 拟议的系统优于最先进的基线系统。 对视觉情感表达和视频真实性进行的主观评价也显示了拟议系统的优越性。 此外, 我们使用视频和视觉模式之间不匹配的视频, 进行了人类情感识别试点研究。 结果显示, 人类对视觉模式的反应比此任务的音频模式更为显著。