This paper describes the system developed by the XMUSPEECH team for the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT). For the speaker diarization task, we propose a multi-channel speaker diarization system that obtains spatial information of speaker by Difference of Arrival (DOA) technology. Speaker-spatial embedding is generated by x-vector and s-vector derived from Filter-and-Sum Beamforming (FSB) which makes the embedding more robust. Specifically, we propose a novel multi-channel sequence-to-sequence neural network architecture named Discriminative Multi-stream Neural Network (DMSNet) which consists of Attention Filter-and-Sum block (AFSB) and Conformer encoder. We explore DMSNet to address overlapped speech problem on multi-channel audio. Compared with LSTM based OSD module, we achieve a decreases of 10.1% in Detection Error Rate(DetER). By performing DMSNet based OSD module, the DER of cluster-based diarization system decrease significantly form 13.44% to 7.63%. Our best fusion system achieves 7.09% and 9.80% of the diarization error rate (DER) on evaluation set and test set.
翻译:本文描述由XMUSPEECEH团队为多渠道多党会议交汇挑战(M2MET)开发的系统。 对于发言者的分解任务,我们建议建立一个多频道扬声器分解系统,通过抵达技术差异获取发言者的空间信息。 音频嵌入由来自过滤器和Sum Beamformating(FSB)的X-Vector和s-spator生成,使嵌入率更强。 具体地说,我们提议建立一个新型的多频道序列至序列神经网络结构,名为DMSNet,由注意力过滤器- Sum区和Contecten concoder(AFSB)组成。我们探索DMSNet,以解决多频道音频带音响中重叠的语音问题。 与基于OSD模块的LSTM,我们发现错误率降低了10.1%。 通过实施基于 OSD模块的DISD模块,DER-DER-DISM- disalizmation% 和我们7-80DER 的系统测试率设置, 我们的DR% 和7-DER- dismlationsmal化系统将系统缩小降为7.DER) 。