This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement and so on, to process training, validation and test sets. But we only selected WPE and beamforming as our frontend methods according to their experimental results. Secondly, we made great efforts in the data augmentation for multi-speaker ASR, mainly including adding noise and reverberation, overlapped speech simulation, multi-channel speech simulation, speed perturbation, front-end processing, and so on, which brought us a great performance improvement. Finally, in order to make full use of the performance complementary of different model architecture, we trained the standard conformer based joint CTC/Attention (Conformer) and U2++ ASR model with a bidirectional attention decoder, a modification of Conformer, to fuse their results. Comparing with the official baseline system, our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
翻译:本文描述了我们在M2MET挑战中跟踪多声器自动语音识别(ASR)的皇家Flush系统。我们采用了基于多声器自动语音识别(ASR)的连续输出培训(SOT)多声器ASR系统,并配有大型模拟数据。首先,我们调查了一套前端方法,包括多声带加权预测错误(WPE)、波形、语音分离、语音增强等,以进行各种模式的培训、验证和测试。但我们只选择了WPE,并根据实验结果将它作为我们前端方法。第二,我们在多声器ASR的数据增强方面作出了巨大努力,主要包括增加噪音和回音培训(SOT)基于多声器多声器多声器的多声器ASR系统(SOT),重叠语音模拟、多声道语音模拟、快速扰动、前端处理等,从而给我们带来了巨大的性能改进。最后,为了充分利用不同模型结构的性能补充性能,我们仅根据试验结果选择了标准合规(Conforent)和U2-+ASR模型模型,以双向导调调调调调系统,并调整了我们12RBRBRBRRRBRRRRRRRRRRRRRRRRRRRR。