Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source. The precisely estimated directional embedding provides quite effective spatial discrimination guidance for the neural beamformer to offset the effect of phase wrapping, thus allowing more accurate reconstruction of two sources' speech signals. Experiments show that our proposed MIMO-DBnet not only achieves a comprehensive decent improvement compared to baseline systems, but also maintain the performance on high frequency bands when phase wrapping occurs.
翻译:最近,为多通道语音分离提出了许多基于深层学习的光线,但大多数依靠预先知道的额外提示,如语音特征、脸部图像或定向信息。在本文中,我们建议建立一个端到端波形网络,仅以混合信号(即MIMO-DBnet)为单位,指导语音分离方向。具体地说,我们设计了一个多道输入和多重产出结构,以预测以抵达为单位的嵌入方向和每个源的波段重量。精确估计的方向嵌入为神经光谱提供了非常有效的空间歧视指导,以抵消阶段包装的效果,从而可以更准确地重建两个源的语音信号。实验表明,我们提议的IMO-DBnet不仅与基线系统相比取得了全面的体面改进,而且在阶段包装时还保持了高频波段的性能。