In this paper, we present the Blind Speech Separation and Dereverberation (BSSD) network, which performs simultaneous speaker separation, dereverberation and speaker identification in a single neural network. Speaker separation is guided by a set of predefined spatial cues. Dereverberation is performed by using neural beamforming, and speaker identification is aided by embedding vectors and triplet mining. We introduce a frequency-domain model which uses complex-valued neural networks, and a time-domain variant which performs beamforming in latent space. Further, we propose a block-online mode to process longer audio recordings, as they occur in meeting scenarios. We evaluate our system in terms of Scale Independent Signal to Distortion Ratio (SI-SDR), Word Error Rate (WER) and Equal Error Rate (EER).
翻译:在本文中,我们介绍盲人言语分离和脱节网络,该网络在单一神经网络中同时进行扬声分离、脱节和声频识别。议长分离由一套预先定义的空间提示指导。脱节是通过神经波束成形进行的,声频识别通过嵌入矢量和三重采矿得到帮助。我们引入了一个使用复杂价值的神经网络的频域模型,以及一个在潜伏空间进行声频成形的时间域变量。此外,我们提出一个模块式在线模式,在会议场景中处理较长的录音。我们从比例独立信号到扭曲比率(SI-SDR)、单词错误率(WER)和等值错误率(EER)的角度来评估我们的系统。