A key challenge for automatic speech recognition (ASR) systems is to model the speaker level variability. In this paper, compact speaker dependent learning hidden unit contributions (LHUC) are used to facilitate both speaker adaptive training (SAT) and test time unsupervised speaker adaptation for state-of-the-art Conformer based end-to-end ASR systems. The sensitivity during adaptation to supervision error rate is reduced using confidence score based selection of the more "trustworthy" subset of speaker specific data. A confidence estimation module is used to smooth the over-confident Conformer decoder output probabilities before serving as confidence scores. The increased data sparsity due to speaker level data selection is addressed using Bayesian estimation of LHUC parameters. Experiments on the 300-hour Switchboard corpus suggest that the proposed LHUC-SAT Conformer with confidence score based test time unsupervised adaptation outperformed the baseline speaker independent and i-vector adapted Conformer systems by up to 1.0%, 1.0%, and 1.2% absolute (9.0%, 7.9%, and 8.9% relative) word error rate (WER) reductions on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Consistent performance improvements were retained after external Transformer and LSTM language models were used for rescoring.
翻译:自动语音识别( ASR) 系统的关键挑战是如何模拟语音识别( ASR) 。 在本文中, 精密的演讲人依赖学习隐藏单位贡献( LHUC) 被用于促进演讲人的适应性培训( SAT) 和测试无监督的演讲人时间, 以适应基于端对端的基于端对端 ASR 系统。 调适监管错误率的敏感度, 使用基于信任的比分选择更“ 可信赖”的演讲人特定数据, 降低对监督错误率的敏感度。 在作为信任分数之前, 使用一个信心估计模块来平息过度自信的 Condecter decoder 输出概率。 使用巴伊西亚对 LHUC 参数的估算, 和测试时间为300小时的切换式显示, 以信任为基础的拟议的 LHUHC- SAT 组合在测试时间上比基准演讲人独立, 和i- Victor 调整的Conforect 系统, 达到1. 0%, 绝对值 (9.0 %, 7. 和 8. 相对) 错误率 数据选择。 在 NISISIS5 的外部改进后, 5 和REM5 格式中分别使用了REM 和REM 3 的不断的软件改进。