In this paper, we present the speaker diarization system for the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) from team DKU_DukeECE. As the highly overlapped speech exists in the dataset, we employ an x-vector-based target-speaker voice activity detection (TS-VAD) to find the overlap between speakers. For the single-channel scenario, we separately train a model for each of the 8 channels and fuse the results. We also employ the cross-channel self-attention to further improve the performance, where the non-linear spatial correlations between different channels are learned and fused. Experimental results on the evaluation set show that the single-channel TS-VAD reduces the DER by over 75% from 12.68\% to 3.14%. The multi-channel TS-VAD further reduces the DER by 28% and achieves a DER of 2.26%. Our final submitted system achieves a DER of 2.98% on the AliMeeting test set, which ranks 1st in the M2MET challenge.
翻译:在本文中,我们介绍了DKU_DukeECE团队多渠道多方会议分流挑战(M2MET)的语音分解系统(M2Met),由于数据集中存在高度重叠的语音,我们采用了XVC方目标播音器语音活动探测(TS-VAD),以找出发言者之间的重叠。在单一频道的设想中,我们为8个频道中的每个频道分别培训一个模型,并调出结果。我们还利用跨频道的自我关注来进一步改进性能,在不同频道之间的非线性空间关系得到学习和融合。评价组的实验结果表明,单频道TS-VAD将DER减少75%以上,从12.68 ⁇ 降为3.14%。多频道TS-VAD进一步将DER减少28%,并达到2.26%的DER。我们最后提交的系统在Alimet测试组上实现了2.98的ER,该测试组在M2MET的挑战中排名第一。