Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personalized disordered speech augmentation approaches that simultaneously learn to encode, generate and discriminate synthesized impaired speech. Separate latent features are derived to learn dysarthric speech characteristics and phoneme context representations. Self-supervised pre-trained Wav2vec 2.0 embedding features are also incorporated. Experiments conducted on the UASpeech corpus suggest the proposed adversarial data augmentation approach consistently outperformed the baseline speed perturbation and non-VAE GAN augmentation methods with trained hybrid TDNN and End-to-end Conformer systems. After LHUC speaker adaptation, the best system using VAE-GAN based augmentation produced an overall WER of 27.78% on the UASpeech test set of 16 dysarthric speakers, and the lowest published WER of 57.31% on the subset of speakers with "Very Low" intelligibility.
翻译:迄今为止,对无序言语的自动识别仍是一项极具挑战性的任务。神经运动的基本条件,往往与身体残疾同时发生,导致难以收集到ASR系统开发所需的大量受损言语。本文件介绍了基于个性化的无序言语增强功能方法,这些方法以个性化无序言语增强功能为基础,同时学习编码、生成和区别合成受损言语。不同的潜伏特征用于学习有读性言特征和电话背景表达。还纳入了自我监督的事先受过训练的Wav2vec 2.0嵌入功能。在UASpeech系统中进行的实验表明,拟议的对抗性数据增强方法始终高于基线速度的透度和非VAE GAN增强方法,同时使用经过训练的混合式TDNN和端至端调调系统。LHUHOC演讲器改造后,使用VAE-GAN系统增强功能的最佳系统,在16台有色话筒的UASpeech测试组中产生了27.78%的总WER。