The goal of self-supervised learning (SSL) for automatic speech recognition (ASR) is to learn good speech representations from a large amount of unlabeled speech for the downstream ASR task. However, most SSL frameworks do not consider noise robustness which is crucial for real-world applications. In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets of each other. By doing this, it enforces the network to have consistent predictions for the original and noisy speech, thus allows to learn contextualized representation with noise robustness. Our experiments on synthesized and real noisy data show the effectiveness of our method: it achieves 2.9--4.9% relative word error rate (WER) reduction on the synthesized noisy LibriSpeech data without deterioration on the original data, and 5.7% on CHiME-4 real 1-channel noisy data compared to a data augmentation baseline even with a strong language model for decoding. Our results on CHiME-4 can match or even surpass those with well-designed speech enhancement components.
翻译:自动语音识别自我监督学习(SSL)的目标是从下游的 ASR 任务的大量无标签演讲中学习良好的语音表现。 然而,大多数 SSL 框架并不考虑对真实世界应用至关重要的噪音强度。 在本文中,我们提议了 wav2vec-Switch, 将噪音强度融入通过对比性学习的语音背景化表达方式。 具体地说, 我们同时向 wav2vec 2. 0 网络输入原始的声调对配对。 除了现有的对比性学习任务外, 我们转换原始和吵闹的语音的量化表达方式作为另一个预测目标。 通过这样做, 它强制实施网络对原始和吵闹的语音应用连贯一致的预测, 从而能够学习带有噪音强度的带背景化表达方式。 我们在合成和真实的热度数据实验中显示了我们的方法的有效性: 它实现了2.9-4.9 % 相对的单词错误率(WER) 综合热度LiSpeech数据的减少, 而原始数据甚至恶化, 原始数据中原始和噪音的5.7 % 的原始和超音速性语音增强的CH- 4 的图像数据, 对比这些模型数据, 与强度比强的超强度数据。