Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems designed for normal speech. Their practical application to atypical task domains such as elderly and disordered speech across languages is often limited by the difficulty in collecting such specialist data from target speakers. This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio, visual and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training before being cross-domain and cross-lingual adapted to three datasets across two languages: the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora; and the English TORGO dysarthric speech data, to produce UTI based articulatory features. Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline hybrid TDNN and Conformer based end-to-end systems constructed using acoustic features only by statistically significant word error rate or character error rate reductions up to 2.64%, 1.92% and 1.21% absolute (8.17%, 7.89% and 13.28% relative) after data augmentation and speaker adaptation were applied.
翻译:交际和跨语言的A2A传声器特征本质上是听觉信号扭曲的,已经成功地融入了为正常语言设计的自动语音识别系统(ASR),这些特征实际应用于不同任务领域,如不同语言的老年人和有障碍的言语,由于很难从目标演讲者那里收集这类专家数据,这些特征往往受到限制。本文展示了一种跨域和跨语言的A2A传声成像(UTI)反演法,利用A2A模式的24小时TAL成像(UTI)数据,利用A2A模型中的24小时TAL成像(UTI)数据,在培训前成功地将其纳入到针对两种语言的三个数据集:英语Dementia Bank Pitt 和广域的JCCOCC MOCA 长语组;英语TORGO 调音频语音数据,以产生以UTIA为基础的动脉图特征。在三个任务上进行的实验显示,所生成的止动成像特征持续超过基于端对端和端对端系统的基线混合TDN和CRED系统,仅通过具有统计意义的单词误差率的文字误率率率或字符误差率(8.44%)和伸缩成比例数据,应用率为2.89%和伸伸缩后为1.9%。