通过音频有条件传播模型对语音驱动视频编辑 (Speech Driven Video Editing via an Audio-Conditioned Diffusion Model)

In this paper we propose a method for end-to-end speech driven video editing using a denoising diffusion model. Given a video of a person speaking, we aim to re-synchronise the lip and jaw motion of the person in response to a separate auditory speech recording without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model with audio spectral features to generate synchronised facial motion. We achieve convincing results on the task of unstructured single-speaker video editing, achieving a word error rate of 45% using an off the shelf lip reading model. We further demonstrate how our approach can be extended to the multi-speaker domain. To our knowledge, this is the first work to explore the feasibility of applying denoising diffusion models to the task of audio-driven video editing.

翻译：在本文中,我们提出了一个使用分解的传播模式进行端到端语音驱动视频编辑的方法。根据一个人说话的视频,我们的目标是对一个人的嘴唇和下巴运动进行重新同步,以回应一个单独的听力语音记录,而不必依赖中间结构表象,如面部标志或3D面部模型。我们用音频光谱特性来调整一个分解的传播模型,以产生同步的面部运动,这样可以做到这一点。我们在非结构的单声频视频编辑任务上取得了令人信服的结果,在使用架子上唇读模型实现45%的字误差率。我们进一步展示了如何将我们的方法扩大到多声区。据我们所知,这是探索将分解扩散模型应用于音频驱动视频编辑任务的可行性的第一项工作。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日