Multi-channel speech separation using speaker's directional information has demonstrated significant gains over blind speech separation. However, it has two limitations. First, substantial performance degradation is observed when the coming directions of two sounds are close. Second, the result highly relies on the precise estimation of the speaker's direction. To overcome these issues, this paper proposes 3D features and an associated 3D neural beamformer for multi-channel speech separation. Previous works in this area are extended in two important directions. First, the traditional 1D directional beam patterns are generalized to 3D. This enables the model to extract speech from any target region in the 3D space. Thus, speakers with similar directions but different elevations or distances become separable. Second, to handle the speaker location uncertainty, previously proposed spatial feature is extended to a new 3D region feature. The proposed 3D region feature and 3D neural beamformer are evaluated under an in-car scenario. Experimental results demonstrated that the combination of 3D feature and 3D beamformer can achieve comparable performance to the separation model with ground truth speaker location as input.
翻译:使用讲演者方向性信息进行多通道语音分离表明,在盲人言语分离方面取得了显著成果,但有两个限制:第一,在两个声音的向导接近时观察到了显著的性能退化;第二,结果高度依赖于对发言者方向的准确估计;为克服这些问题,本文件提出3D特征和相关的3D神经光束,供多道语音分离使用;该领域以前的工作分为两个重要方向。第一,传统的1D方向光束模式被普遍推广到3D。这使得该模型能够从3D空间的任何目标区域提取演讲。因此,具有类似方向但高高低或距离的发言者变得可以分离。第二,为了处理发言者位置的不确定性,先前提议的空间特征扩大到一个新的3D区域特征。拟议的3D区域特征和3D神经光束在一次车内评估。实验结果表明,3D特征和3D光谱可以取得与地面演讲者位置作为输入的分离模型相似的性能。</s>