Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail-party effect. For decades, researchers have focused on approaching the listening ability of humans. One critical issue is handling interfering speakers because the target and non-target speech signals share similar characteristics, complicating their discrimination. Target speech/speaker extraction (TSE) isolates the speech signal of a target speaker from a mixture of several speakers with or without noises and reverberations using clues that identify the speaker in the mixture. Such clues might be a spatial clue indicating the direction of the target speaker, a video of the speaker's lips, or a pre-recorded enrollment utterance from which their voice characteristics can be derived. TSE is an emerging field of research that has received increased attention in recent years because it offers a practical approach to the cocktail-party problem and involves such aspects of signal processing as audio, visual, array processing, and deep learning. This paper focuses on recent neural-based approaches and presents an in-depth overview of TSE. We guide readers through the different major approaches, emphasizing the similarities among frameworks and discussing potential future directions.
翻译:人类甚至可以在有噪音、反响和干扰音员的具有挑战性的音响条件下倾听目标演讲者的声音,这种现象被称为鸡尾酒会效应。几十年来,研究人员一直侧重于接近人类的听力能力。一个关键问题是处理干扰演讲者,因为目标和非目标的语音信号具有相似的特点,使其歧视复杂化。目标演讲/声音提取(TSE)将目标演讲者的语音信号与若干发言者使用能识别混合音频、声音和反响的线索或没有声音和反响的混合体隔开来。这些线索可能是显示目标演讲者方向的空间线索、发言者嘴唇的视频或预先录制的录制的录入,可以从中得出其声音特征。TSE是一个新兴的研究领域,近年来由于它提供了解决鸡尾派对问题的实用方法,涉及音频、视觉、阵列处理和深层次学习等信号处理的方方面。本文侧重于最近的神经基方法,并介绍了TEE的深度概览。我们指导读者通过不同的主要方法,强调框架之间的相似性和潜力。