Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personalized disordered speech augmentation approaches that simultaneously learn to encode, generate and discriminate synthesized impaired speech. Separate latent features are derived to learn dysarthric speech characteristics and phoneme context representations. Self-supervised pre-trained Wav2vec 2.0 embedding features are also incorporated. Experiments conducted on the UASpeech corpus suggest the proposed adversarial data augmentation approach consistently outperformed the baseline speed perturbation and non-VAE GAN augmentation methods with trained hybrid TDNN and End-to-end Conformer systems. After LHUC speaker adaptation, the best system using VAE-GAN based augmentation produced an overall WER of 27.78% on the UASpeech test set of 16 dysarthric speakers, and the lowest published WER of 57.31% on the subset of speakers with "Very Low" intelligibility.
翻译:使用VAE-GAN进行对抗性数据增强以实现混淆语音识别
自动识别混淆语音仍然是一项极具挑战性的任务。潜在的神经运动疾病,经常与并存的身体残疾相结合,导致收集所需的大量受损语音用于ASR系统开发的困难。本文提出了基于变分自编码器生成对抗网络(VAE-GAN)的个性化混淆语音增强方法,同时学习编码、生成和鉴别合成受损语音。派生出单独的潜在特征来学习吃字性语音特征和音素上下文表示。还结合了自监督预训练的Wav2vec 2.0嵌入特征。在UASpeech语料库上进行的实验表明,所提出的对抗性数据增强方法相对于基线速度扰动和非VAE-GAN增强方法具有更好的性能,并配合使用训练动态时间调整神经网络和端到端Conformer系统。在LHUC说话人自适应之后,使用基于VAE-GAN的增强方法的最佳系统在16个吃字性发言人的UASpeech测试集上产生了27.78%的总体WER,并在“非常低”可读性的发言人子集上得到了57.31%的最低发布WER。