We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge. These techniques are designed to handle multi-speaker conversations in real-world meeting scenarios with high speaker-overlap ratios and under heavy reverberant and noisy condition. First, for data preparation and augmentation in training TS-VAD models, speech data containing both real meetings and simulated indoor conversations are used. Second, in refining results obtained after TS-VAD based decoding, we perform a series of post-processing steps to improve the VAD results needed to reduce diarization error rates (DERs). Tested on the ALIMEETING corpus, the newly released Mandarin meeting dataset used in M2MeT, we demonstrate that our proposed system can decrease the DER by up to 66.55/60.59% relatively when compared with classical clustering based diarization on the Eval/Test set.
翻译:我们建议对目标发言人语音活动探测(TS-VAD)进行两项改进,这是向2022年多渠道多党会议分解(M2MET)挑战提交的拟议发言者分解系统的核心组成部分,这些技术旨在处理在现实世界会议情景中、发言者多位比例高和音频和吵闹状态下进行的多声音对话。首先,为编制数据和加强TS-VAD模型的培训,使用了包含真实会议和模拟室内对话的语音数据。第二,在改进基于TS-VAD解码的结果时,我们采取了一系列后处理步骤,以改进降低分解误差率所需的VAD结果(DERs)。根据ALIMET软件测试,新发行的曼达林会议数据集用于M2MET,我们表明,与基于Eval/Test集的经典集成相比,我们拟议的系统可以将DER值减少至66.555/60.59%。相对而言,与基于Eval/Test集成的经典集成相比,我们提议的系统可以将DR降低到66.55%/60.59%。