We analyze the impact of speaker adaptation in end-to-end architectures based on transformers and wav2vec 2.0 under different noise conditions. We demonstrate that the proven method of concatenating speaker vectors to the acoustic features and supplying them as an auxiliary model input remains a viable option to increase the robustness of end-to-end architectures. By including speaker embeddings obtained from x-vector and ECAPA-TDNN models, we achieve relative word error rate improvements of up to 9.6% on LibriSpeech and up to 14.5% on Switchboard. The effect on transformer-based architectures is approximately inversely proportional to the signal-to-noise ratio (SNR) and is strongest in heavily noised environments ($SNR=0$). The most substantial benefit of speaker adaption in systems based on wav2vec 2.0 can be achieved under moderate noise conditions ($SNR\geq18$). We also find that x-vectors tend to yield larger improvements than ECAPA-TDNN embeddings.
翻译:我们根据变压器和 wav2vec 2.0 在不同噪音条件下对终端到终端结构进行调整的影响进行分析。我们证明,将扬声器矢量与声学特性相连接并将它们作为辅助模型输入的经证明的方法仍然是提高端到终端结构的稳健性的一个可行选择。通过纳入从 x 矢量和 ECAPA-TDNN 模型获得的扩音器嵌入器,我们在LibriSpeech 上实现了9.6%的相对单词错误率改进,在交换台上实现了14.5%的误差率改进。对变压器结构的影响与信号到噪音比率(SNR)几乎成反比,在高度无记名环境中效果最强(SNR=0美元)。基于 wav2ve 2.0 的扩音器系统适应的最大好处可以在中度噪音条件下实现(SNRR\ge18美元)。我们还发现,X 致感应器的改进幅度往往大于ECAPA-TDNN 嵌入器。