Self-supervised pre-training methods based on contrastive learning or regression tasks can utilize more unlabeled data to improve the performance of automatic speech recognition (ASR). However, the robustness impact of combining the two pre-training tasks and constructing different negative samples for contrastive learning still remains unclear. In this paper, we propose a noise-robust data2vec for self-supervised speech representation learning by jointly optimizing the contrastive learning and regression tasks in the pre-training stage. Furthermore, we present two improved methods to facilitate contrastive learning. More specifically, we first propose to construct patch-based non-semantic negative samples to boost the noise robustness of the pre-training model, which is achieved by dividing the features into patches at different sizes (i.e., so-called negative samples). Second, by analyzing the distribution of positive and negative samples, we propose to remove the easily distinguishable negative samples to improve the discriminative capacity for pre-training models. Experimental results on the CHiME-4 dataset show that our method is able to improve the performance of the pre-trained model in noisy scenarios. We find that joint training of the contrastive learning and regression tasks can avoid the model collapse to some extent compared to only training the regression task.
翻译:以对比学习或回归任务为基础的自我监督的预培训方法可以使用更多的未加标签的数据来改进自动语音识别(ASR)的性能。然而,将两项培训前任务结合起来和为对比学习而建造不同的负面样本的稳健性影响仍然不明确。在本文件中,我们建议为自我监督的语音表述学习提供噪音-机器人数据2vec,通过共同优化培训前阶段的对比学习和回归任务,共同优化优异学习和回归任务。此外,我们提出了两种改进的方法,以便利对比学习。更具体地说,我们首先提议建立补丁非保密负面样本,以提高培训前模式的噪音稳健性,这是通过将特点分为不同尺寸的补丁(即所谓的负抽样)来实现的。第二,通过分析正与负抽样的分布情况,我们建议消除容易辨别的负面样本,以提高培训前模式的歧视性能力。CHIME-4数据集的实验结果表明,我们的方法能够改进在紧张情景中预培训前模型的性能提高噪音稳健性。我们发现,通过联合培训来避免反向倒退任务。