While previous speech-driven talking face generation methods have made significant progress in improving the visual quality and lip-sync quality of the synthesized videos, they pay less attention to lip motion jitters which greatly undermine the realness of talking face videos. What causes motion jitters, and how to mitigate the problem? In this paper, we conduct systematic analyses on the motion jittering problem based on a state-of-the-art pipeline that uses 3D face representations to bridge the input audio and output video, and improve the motion stability with a series of effective designs. We find that several issues can lead to jitters in synthesized talking face video: 1) jitters from the input 3D face representations; 2) training-inference mismatch; 3) lack of dependency modeling among video frames. Accordingly, we propose three effective solutions to address this issue: 1) we propose a gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) we add augmented erosions on the input data of the neural renderer in training to simulate the distortion in inference to reduce mismatch; 3) we develop an audio-fused transformer generator to model dependency among video frames. Besides, considering there is no off-the-shelf metric for measuring motion jitters in talking face video, we devise an objective metric (Motion Stability Index, MSI), to quantitatively measure the motion jitters by calculating the reciprocal of variance acceleration. Extensive experimental results show the superiority of our method on motion-stable face video generation, with better quality than previous systems.
翻译:虽然先前由语音驱动的谈话面部生成方法在提高合成视频的视觉质量和嘴语同步质量方面取得了显著进展,但对于口语动作的紧张情绪却不那么重视,因为这会大大破坏说话脸部视频的真实性。 是什么导致动作紧张,以及如何缓解问题? 在本文中,我们根据一个最先进的管道,对运动抖动问题进行系统分析,该管道使用3D面部演示来连接输入的音频和输出视频,并通过一系列有效的设计来改善运动稳定性。 我们发现,若干问题可能导致合成谈话面部视频的不稳定性:(1) 输入3D面部演示的紧张性反应;(2) 培训不匹配;(2) 培训不匹配;(3) 视频框架之间缺乏依赖性建模。 因此,我们提出解决这一问题的三种有效解决方案:(1) 我们提出一个基于Gaussian的适应性平滑动模块,以平滑动3D面面部演示消除音量;(2) 在培训中,神经模模质解动的输入数据会增加侵蚀,以模拟变相偏差来减少对调;(3) 我们制作了一种双向型动作的变动的变压式图像,我们使用了一台的变动的变动的变压式动作,我们用的变压的变压式的变压的变压式的变压式的变压式的变压压压式的压式的压式的压式的压式的压压式的压到比的变压。