Lip sync is a fundamental audio-visual task. However, existing lip sync methods fall short of being robust to the incredible diversity of videos taken in the wild, and the majority of the diversity is caused by compound distracting factors that could degrade existing lip sync methods. To address these issues, this paper proposes a data standardization pipeline that can produce standardized expressive images while preserving lip motion information from the input and reducing the effects of compound distracting factors. Based on recent advances in 3D face reconstruction, we first create a model that can consistently disentangle expressions, with lip motion information embedded. Then, to reduce the effects of compound distracting factors on synthesized images, we synthesize images with only expressions from the input, intentionally setting all other attributes at predefined values independent of the input. Using synthesized images, existing lip sync methods improve their data efficiency and robustness, and they achieve competitive performance for the active speaker detection task.
翻译:利普同步是一项基本的视听任务。然而,现有的口音同步方法不足以应对野生视频的惊人多样性,而大部分多样性是由复合分散因素导致的,这些因素可能会降低现有的唇同步方法。为解决这些问题,本文件建议建立一个数据标准化管道,既能产生标准化的表达图像,同时又保留输入中的唇动信息,并减少复合分散因素的影响。根据3D面部重建的最新进展,我们首先创建了一个能够始终不离的表达方式的模型,并嵌入唇动信息。随后,为了减少复合分散因素对合成图像的影响,我们将图像与输入中的表达方式相合成,我们仅将图像与输入的预定义值相合成,有意将所有其他属性置于与输入的预定义值之外。使用合成图像,现有的唇动方法提高了数据效率和稳健性,并实现了积极语音检测任务的竞争性性表现。