Automatic recognition of disordered speech remains a highly challenging task to date. Sources of variability commonly found in normal speech including accent, age or gender, when further compounded with the underlying causes of speech impairment and varying severity levels, create large diversity among speakers. To this end, speaker adaptation techniques play a vital role in current speech recognition systems. Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed to facilitate both accurate speech intelligibility assessment and auxiliary feature based speaker adaptation of state-of-the-art hybrid DNN and end-to-end disordered speech recognition systems. Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i-Vector adaptation by up to 2.63% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation. Learning hidden unit contribution (LHUC) based speaker adaptation was further applied. The final speaker adapted system using the proposed spectral basis embedding features gave an overall WER of 25.6% on the UASpeech test set of 16 dysarthric speakers
翻译:迄今为止,对无序言论的自动认识仍是一项极具挑战性的任务。通常在正常言语中常见的变异性来源,包括口音、年龄或性别,如果与言论受损和严重程度不同的根本原因进一步复杂化,就会在发言者中产生巨大的多样性。为此,在目前的语音识别系统中,发言者适应技术发挥着关键作用。受无序和正常言语之间的分光-时空差异的驱动,这些障碍和正常言语系统地表现在动脉不全、体积和清晰度下降、发言率下降和变异性增加、新颖的频谱-时空次空基嵌入SVD变异频谱产生的深层特征,目的是便利准确的言语不通性评估以及基于语音识别系统的辅助性特征。在UASpeech系统中进行的实验表明,拟议的光谱-时深度适应系统一直比基准i-维特适应性强,最高2.63%(8.6%相对)的词错率降低(WER),或者没有数据增强型的语音评估,还利用甚深层的语音系统应用了甚低频系统。