Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. To this end, data augmentation techniques play a vital role in current disordered speech recognition systems. In contrast to existing data augmentation techniques only modifying the speaking rate or overall shape of spectral contour, fine-grained spectro-temporal differences between disordered and normal speech are modelled using deep convolutional generative adversarial networks (DCGAN) during data augmentation to modify normal speech spectra into those closer to disordered speech. Experiments conducted on the UASpeech corpus suggest the proposed adversarial data augmentation approach consistently outperformed the baseline augmentation methods using tempo or speed perturbation on a state-of-the-art hybrid DNN system. An overall word error rate (WER) reduction up to 3.05\% (9.7\% relative) was obtained over the baseline system using no data augmentation. The final learning hidden unit contribution (LHUC) speaker adapted system using the best adversarial augmentation approach gives an overall WER of 25.89% on the UASpeech test set of 16 dysarthric speakers.
翻译:迄今为止,对无序语音的自动识别仍是一项极具挑战性的任务。根基神经运动条件,往往与身体残疾同时发生,导致难以收集到ASR系统开发所需的大量受损语音。为此,数据扩增技术在目前无序语音识别系统中发挥着关键作用。与现有的数据扩增技术相比,数据扩增技术仅能改变光谱等距的语音率或总体形状,微弱分辨光谱-时空差异正在模拟在数据扩增期间使用深相相动的基因对抗网络(DCGAN),将正常语音光谱修改为接近无序语音的状态。在UASpeech系统中进行的实验表明,拟议中的对抗性数据扩增量方法在使用最先进的节奏或加速干扰状态混合式DNNN系统时,始终超越基线增强方法。在基线系统上,使用未加增数据的方式获得了一个整体的字差率(WER),即降至3.05 ⁇ (9.7 ⁇ 相对值)。最后学习隐藏单位贡献(LHUHC),使用最佳对抗性语音扩增压器测试系统,对UAS89调了整个WER。