Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations in the context of automatic speech recognition (ASR). It was shown that wav2vec2.0 has a good robustness against the domain shift, while the noise robustness is still unclear. In this work, we therefore first analyze the noise robustness of wav2vec2.0 via experiments. We observe that wav2vec2.0 pre-trained on noisy data can obtain good representations and thus improve the ASR performance on the noisy test set, which however brings a performance degradation on the clean test set. To avoid this issue, in this work we propose an enhanced wav2vec2.0 model. Specifically, the noisy speech and the corresponding clean version are fed into the same feature encoder, where the clean speech provides training targets for the model. Experimental results reveal that the proposed method can not only improve the ASR performance on the noisy test set which surpasses the original wav2vec2.0, but also ensure a tiny performance decrease on the clean test set. In addition, the effectiveness of the proposed method is demonstrated under different types of noise conditions.
翻译:Wav2vec2.0是一个在自动语音识别(ASR)背景下学习语音演示的受欢迎的自我监督培训前框架。 事实证明, wav2vec2.0 相对于域变换具有很强的稳健性, 而噪音的稳健性仍然不清楚。 因此, 在这项工作中, 我们首先通过实验分析 wav2vec2.0 的噪声强度。 我们观察到, wav2vec2.0 预先训练的噪音数据可以得到良好的表现, 从而改进噪音测试组的ASR性能, 但是这却会在清洁测试组中造成性能退化。 为了避免这一问题, 我们在此工作中提议了一个增强的 wav2vec2. 0 模型。 具体地说, 噪音和相应的清洁版本被输入到相同的功能编码器中, 清洁的演讲为模型提供了培训目标。 实验结果显示, 拟议的方法不仅能够改善超原瓦2vec2.0 0.0 的噪音测试组的ASR性能, 而且还确保清洁测试组的微性能下降。 此外, 在不同噪音条件下展示了拟议方法的有效性。