The speaker extraction technique seeks to single out the voice of a target speaker from the interfering voices in a speech mixture. Typically an auxiliary reference of the target speaker is used to form voluntary attention. Either a pre-recorded utterance or a synchronized lip movement in a video clip can serve as the auxiliary reference. The use of visual cue is not only feasible, but also effective due to its noise robustness, and becoming popular. However, it is difficult to guarantee that such parallel visual cue is always available in real-world applications where visual occlusion or intermittent communication can occur. In this paper, we study the audio-visual speaker extraction algorithms with intermittent visual cue. We propose a joint speaker extraction and visual embedding inpainting framework to explore the mutual benefits. To encourage the interaction between the two tasks, they are performed alternately with an interlacing structure and optimized jointly. We also propose two types of visual inpainting losses and study our proposed method with two types of popularly used visual embeddings. The experimental results show that we outperform the baseline in terms of signal quality, perceptual quality, and intelligibility.
翻译:语音提取技术试图将目标演讲者的声音从干扰声音中分离出来, 通常使用目标演讲者的辅助引用来形成自愿关注。 视频片段中的预先录制的发音或同步嘴唇运动可以作为辅助参考。 使用视觉提示不仅可行, 而且由于其噪音强力和流行性而有效。 但是, 很难保证在现实世界的应用中总是可以找到这种平行的视觉提示, 在那里可以进行视觉隔离或间歇性通信。 在本文中, 我们用间歇视觉提示来研究视听演讲者提取算法。 我们建议用一个联合的语音提取和视觉嵌入框架来探索相互的好处。 为了鼓励这两项任务之间的互动, 使用一个相互连接的结构来进行交替和优化。 我们还提出了两种视觉定位损失, 用两种流行的视觉嵌入方式来研究我们建议的方法。 实验结果显示, 我们的信号质量、 感知质量 和 不可想象性都超过了基线 。</s>