We propose multi-microphone complex spectral mapping, a simple way of applying deep learning for time-varying non-linear beamforming, for speaker separation in reverberant conditions. We aim at both speaker separation and dereverberation. Our study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation (CSS). Assuming a fixed array geometry between training and testing, we train deep neural networks (DNN) to predict the real and imaginary (RI) components of target speech at a reference microphone from the RI components of multiple microphones. We then integrate multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation, and combine it with frame-level speaker counting for block-online CSS. Although our system is trained on simulated room impulse responses (RIR) based on a fixed number of microphones arranged in a given geometry, it generalizes well to a real array with the same geometry. State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.
翻译:我们提出多麦克风复杂的光谱绘图,这是应用深度学习用于时间变化的非线性波束成像的简单方法,在回旋条件下让发言者分离,我们的目标是使发言者分离和偏角分解。我们的研究首先调查离线超音速扬声器分离,然后延伸至连线连续语音分离(CSS)。假设在培训和测试之间有一个固定的阵列几何,我们培训深神经网络,从多个麦克风的反射部件中,在参考麦克风中预测目标讲话的真实和想象(RI)组成部分。然后,我们将多麦克风复合光谱制图与最小差异不扭曲反应(MVDR)相融合,同时进行最小差异不扭曲反应(MVDR)成形和后过滤反应,以进一步改进分离,并将它与架在CSS的轮廓级发言者结合起来。虽然我们的系统在模拟室动脉冲反应(RIR)方面受过训练,但该系统根据特定几何测量所安排的固定数目,概括为真实阵列。在模拟的两部S-S-S-S-S-S-S-CS-CS-CS-CS-CS-CS-CS-CS-CS-S-M-M-M-S-S-S-S-S-S-S-S-CS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-CM-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-M-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S