The task of video-to-speech aims to translate silent video of lip movement to its corresponding audio signal. Previous approaches to this task are generally limited to the case of a single speaker, but a method that accounts for multiple speakers is desirable as it allows to i) leverage datasets with multiple speakers or few samples per speaker; and ii) control speaker identity at inference time. In this paper, we introduce a new video-to-speech architecture and explore ways of extending it to the multi-speaker scenario: we augment the network with an additional speaker-related input, through which we feed either a discrete identity or a speaker embedding. Interestingly, we observe that the visual encoder of the network is capable of learning the speaker identity from the lip region of the face alone. To better disentangle the two inputs -- linguistic content and speaker identity -- we add adversarial losses that dispel the identity from the video embeddings. To the best of our knowledge, the proposed method is the first to provide important functionalities such as i) control of the target voice and ii) speech synthesis for unseen identities over the state-of-the-art, while still maintaining the intelligibility of the spoken output.
翻译:视频到语音的任务旨在将嘴唇移动的静默视频转换为相应的音频信号。以前的任务方法一般限于单一发言者的情况,但考虑到多个发言者的方法是可取的,因为它允许(一) 利用有多个发言者或每个发言者少样的样本的数据集;以及(二) 在推论时间控制发言者身份。在本文中,我们引入一个新的视频到语音结构,并探索将其推广到多语音情景的方法:我们通过增加一个与发言者有关的输入来扩大网络,我们通过这些输入一个离散身份或嵌入一个发言者。有趣的是,我们观察到网络的视觉编码能够单独从面部的嘴部区域学习发言者身份。为了更好地分解两种输入 -- -- 语言内容和发言者身份 -- -- 我们加上了从视频嵌入中消除身份的对抗性损失。我们最了解的是,拟议的方法是首先提供重要功能,例如(i)控制目标声音和(ii) 语音合成对状态输出的可视性。