Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been shown that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly and difficult to scale. We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning, while only updating a tiny fraction (less than 0.5%) of the model parameters. We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.
翻译:自动语音识别(ASR)系统往往得到优化,以便最能对有粗俗语言模式的演讲者产生最佳效果。 不幸的是,这些系统在非典型语言和重口语测试时表现不佳。以前已经表明,通过模型微调实现个性化会大大改善性能。然而,每个演讲者保持这样的大型模型成本高,且难以规模化。我们通过所谓的剩余适应器在编码器层中增加数量相对较少的额外参数,与模型微调相比,我们可以取得类似的适应收益,而只是更新了模型参数的一小部分(不到0.5 % ) 。我们用两种演讲适应任务(典型和重口语)和两种最先进的ASR结构来展示这一点。