Permutation-invariant training (PIT) is a dominant approach for addressing the permutation ambiguity problem in talker-independent speaker separation. Leveraging spatial information afforded by microphone arrays, we propose a new training approach to resolving permutation ambiguities for multi-channel speaker separation. The proposed approach, named location-based training (LBT), assigns speakers on the basis of their spatial locations. This training strategy is easy to apply, and organizes speakers according to their positions in physical space. Specifically, this study investigates azimuth angles and source distances for location-based training. Evaluation results on separating two- and three-speaker mixtures show that azimuth-based training consistently outperforms PIT, and distance-based training further improves the separation performance when speaker azimuths are close. Furthermore, we dynamically select azimuth-based or distance-based training by estimating the azimuths of separated speakers, which further improves separation performance. LBT has a linear training complexity with respect to the number of speakers, as opposed to the factorial complexity of PIT. We further demonstrate the effectiveness of LBT for the separation of four and five concurrent speakers.
翻译:利用麦克风阵列提供的空间信息,我们建议采用新的培训办法,解决多声道隔开的隔开性模糊问题;提议采用以地点为基础的培训(LBT),根据发言者的空间位置指派发言者;这一培训战略易于应用,并按照其在实际空间的位置组织发言者;具体而言,本研究调查基于地点的培训的方位角和源距离;关于将二位和三位发言者混合体分开的评价结果显示,以方位法为基础的培训一贯优于PIT,远程培训在发言者方位接近时进一步提高了分离性;此外,我们通过估计离散发言者的方位来动态选择以方位为基础的或远程培训,进一步提高了分离性能;LBT对发言者人数的线性培训复杂性,而不是PIT的因子复杂性;我们进一步展示了LB的实效,从而进一步提高了PIT的共质分解。