Self-supervised-learning-based pre-trained models for speech data, such as Wav2Vec 2.0 (W2V2), have become the backbone of many speech tasks. In this paper, to achieve speaker diarisation and speech recognition using a single model, a tandem multitask training (TMT) method is proposed to fine-tune W2V2. For speaker diarisation, the tasks of voice activity detection (VAD) and speaker classification (SC) are required, and connectionist temporal classification (CTC) is used for ASR. The multitask framework implements VAD, SC, and ASR using an early layer, middle layer, and late layer of W2V2, which coincides with the order of segmenting the audio with VAD, clustering the segments based on speaker embeddings, and transcribing each segment with ASR. Experimental results on the augmented multi-party (AMI) dataset showed that using different W2V2 layers for VAD, SC, and ASR from the earlier to later layers for TMT not only saves computational cost, but also reduces diarisation error rates (DERs). Joint fine-tuning of VAD, SC, and ASR yielded 16%/17% relative reductions of DER with manual/automatic segmentation respectively, and consistent reductions in speaker attributed word error rate, compared to the baseline with separately fine-tuned models.
翻译:Wav2Vec 2. 0 (W2V2) 等自我监督的预先培训的语音数据模型,如Wav2Vec 2. 0 (W2V2),已成为许多语音任务的主干。在本文中,为了使用单一模式实现语音分解和语音识别,建议采用双轨多任务培训方法,对W2V2 2进行微调调。对于语音分解,需要语音活动检测和语音分类(SC)等任务,对ASR采用连接型时间分类(CTC)。多任务框架使用VAD、SC和ASR的早期层、中间层和后层W2V2,这与将音频与VAD分解的顺序相吻合,将以发言者嵌入的区段组合在一起,并将每个区段与ASR连接。 扩增的多缔约方(AM)数据集的实验结果显示,TMTMT不仅节省计算成本,而且还要降低VA、SC和ADR的精确度比值,并且将ADR的比值降为16。