使用神经光束隔离和脱异 (Blind Speech Separation and Dereverberation using Neural Beamforming)

In this paper, we present the Blind Speech Separation and Dereverberation (BSSD) network, which performs simultaneous speaker separation, dereverberation and speaker identification in a single neural network. Speaker separation is guided by a set of predefined spatial cues. Dereverberation is performed by using neural beamforming, and speaker identification is aided by embedding vectors and triplet mining. We introduce a frequency-domain model which uses complex-valued neural networks, and a time-domain variant which performs beamforming in latent space. Further, we propose a block-online mode to process longer audio recordings, as they occur in meeting scenarios. We evaluate our system in terms of Scale Independent Signal to Distortion Ratio (SI-SDR), Word Error Rate (WER) and Equal Error Rate (EER).

翻译：在本文中,我们介绍盲人言语分离和脱节网络,该网络在单一神经网络中同时进行扬声分离、脱节和声频识别。议长分离由一套预先定义的空间提示指导。脱节是通过神经波束成形进行的,声频识别通过嵌入矢量和三重采矿得到帮助。我们引入了一个使用复杂价值的神经网络的频域模型,以及一个在潜伏空间进行声频成形的时间域变量。此外,我们提出一个模块式在线模式,在会议场景中处理较长的录音。我们从比例独立信号到扭曲比率(SI-SDR)、单词错误率(WER)和等值错误率(EER)的角度来评估我们的系统。