从图层角度快速适应端至端多言语识别 (Layer-wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition)

Accent variability has posed a huge challenge to automatic speech recognition~(ASR) modeling. Although one-hot accent vector based adaptation systems are commonly used, they require prior knowledge about the target accent and cannot handle unseen accents. Furthermore, simply concatenating accent embeddings does not make good use of accent knowledge, which has limited improvements. In this work, we aim to tackle these problems with a novel layer-wise adaptation structure injected into the E2E ASR model encoder. The adapter layer encodes an arbitrary accent in the accent space and assists the ASR model in recognizing accented speech. Given an utterance, the adaptation structure extracts the corresponding accent information and transforms the input acoustic feature into an accent-related feature through the linear combination of all accent bases. We further explore the injection position of the adaptation layer, the number of accent bases, and different types of accent bases to achieve better accent adaptation. Experimental results show that the proposed adaptation structure brings 12\% and 10\% relative word error rate~(WER) reduction on the AESRC2020 accent dataset and the Librispeech dataset, respectively, compared to the baseline.

翻译：虽然常使用以一热口音为基础的矢量调适系统,但需要事先了解目标口音,无法处理隐形口音。此外,简单的凝固口音嵌入并不能很好地利用口音知识,而这种知识的改进有限。在这项工作中,我们的目标是用注入E2E ASR模型编码器的新颖的分层适应结构来解决这些问题。调适器层在口音空间中将任意的口音编码起来,并协助ASR模型识别重音。根据一个语句,适应结构提取相应的口音信息,并通过所有口音基的线性组合将输入的声学特征转换为与口音相关的特征。我们进一步探索适应层的注射位置、口音基数和不同类型的口音基数,以便实现更好的口音调调调调。实验结果表明,拟议的适应结构在AESRC2020口音数据集和Librispeech数据集的减少率方面分别与基线相比,带来了12 ⁇ 和10 ⁇ 相对单词错误率~(WER)。