Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors. To address these issues, a set of compact and data efficient speaker-dependent (SD) parameter representations are used to facilitate both speaker adaptive training and test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR systems. The sensitivity to supervision quality is reduced using a confidence score-based selection of the less erroneous subset of speaker-level adaptation data. Two lightweight confidence score estimation modules are proposed to produce more reliable confidence scores. The data sparsity issue, which is exacerbated by data selection, is addressed by modelling the SD parameter uncertainty using Bayesian learning. Experiments on the benchmark 300-hour Switchboard and the 233-hour AMI datasets suggest that the proposed confidence score-based adaptation schemes consistently outperformed the baseline speaker-independent (SI) Conformer model and conventional non-Bayesian, point estimate-based adaptation using no speaker data selection. Similar consistent performance improvements were retained after external Transformer and LSTM language model rescoring. In particular, on the 300-hour Switchboard corpus, statistically significant WER reductions of 1.0%, 1.3%, and 1.4% absolute (9.5%, 10.9%, and 11.3% relative) were obtained over the baseline SI Conformer on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Similar WER reductions of 2.7% and 3.3% absolute (8.9% and 10.2% relative) were also obtained on the AMI development and evaluation sets.
翻译:为解决这些问题,采用一套精密和数据高效的语音调节法(SD)参数代表方法,为个人用户定制自动语音识别(ASR)系统提供了强有力的解决方案。对不受监督的示范演讲者对数据密集端至端ASR系统的实际应用,由于缺少演讲者一级的数据和对转录错误的性能敏感性而受阻。为解决这些问题,使用一套精密和数据高效的语音依赖(SD)参数代表制来帮助对发言者进行适应培训和测试时不受监督的语音识别系统进行改造。对监督质量的敏感度通过以信任为基础选择较不错误的演讲者一级适应数据组来降低。 提议两个轻度信心评分模块来产生更可靠的信心分数。 使用Bayesian 学习模拟SD参数不确定性,解决了数据宽度问题。 对基准300小时的开关板和233小时的AMI数据集进行实验表明,基于信任的适应计划始终高于基线、基于语音测量的绝对值评估(SI) 精度模型和常规非巴耶里亚州一级调整数组的分数。