Real-time single-channel speech separation aims to unmix an audio stream captured from a single microphone that contains multiple people talking at once, environmental noise, and reverberation into multiple de-reverberated and noise-free speech tracks, each track containing only one talker. While large state-of-the-art DNNs can achieve excellent separation from anechoic mixtures of speech, the main challenge is to create compact and causal models that can separate reverberant mixtures at inference time. In this paper, we explore low-complexity, resource-efficient, causal DNN architectures for real-time separation of two or more simultaneous speakers. A cascade of three neural network modules are trained to sequentially perform noise-suppression, separation, and de-reverberation. For comparison, a larger end-to-end model is trained to output two anechoic speech signals directly from noisy reverberant speech mixtures. We propose an efficient single-decoder architecture with subtractive separation for real-time recursive speech separation for two or more speakers. Evaluation on real monophonic recordings of speech mixtures, according to speech separation measures like SI-SDR, perceptual measures like DNS-MOS, and a novel proposed channel separation metric, show that these compact causal models can separate speech mixtures with low latency, and perform on par with large offline state-of-the-art models like SepFormer.
翻译:实时单通道语音分离的目标是将采集自单一麦克风的音频流中含有多人同时说话、环境噪声和混响的内容分离成多个去混响且不带噪声的语音轨道,每个轨道只包含一个说话者。尽管最先进的大型DNN可以从无混响语音信号中实现出色的分离,但主要挑战在于创建紧凑且因果的模型,在推理时可以分离混响信号。在本文中,我们探索了用于实时分离两个或多个同时说话者的低复杂度、资源高效、因果的DNN体系结构,该体系结构由一系列三个神经网络模块级联组成,分别用于顺序执行噪声抑制、分离和去混响。为进行比较,还通过训练较大的端到端模型,直接从含噪混响的语音混合物中输出两个无混响语音信号。我们提出了一种高效的单解码器体系结构,并采用减法分离实现对两个或多个说话者进行递归语音分离的实时操作。根据语音分离度量(例如SI-SDR)、感知度量(例如DNS-MOS)和一种新颖的提出的通道分离度量,对实际的语音混合单声道录音进行评估,表明这些紧凑因果型模型可以实现低延迟的语音分离,性能与大型线下最先进的模型SepFormer相当。