While Self-Supervised Learning has helped reap the benefit of the scale from the available unlabeled data, the learning paradigms are continuously being bettered. We present a new pre-training strategy named ccc-wav2vec 2.0, which uses clustering and an augmentation-based cross-contrastive loss as its self-supervised objective. Through the clustering module, we scale down the influence of those negative examples that are highly similar to the positive. The Cross-Contrastive loss is computed between the encoder output of the original sample and the quantizer output of its augmentation and vice-versa, bringing robustness to the pre-training strategy. ccc-wav2vec 2.0 achieves up to 15.6% and 12.7% relative WER improvement over the baseline wav2vec 2.0 on the test-clean and test-other sets, respectively, of LibriSpeech, without the use of any language model. The proposed method also achieves up to 14.9% relative WER improvement over the baseline wav2vec 2.0 when fine-tuned on Switchboard data. We make all our codes publicly available on GitHub.
翻译:虽然自导学习帮助从现有未贴标签的数据中获得规模的好处,但学习范式正在不断改善。我们提出了一个名为ccc-wav2vec 2.0的新的培训前战略,该战略将集群和基于增强的交叉争议损失作为其自我监督的目标。通过分组模块,我们缩小了与正数非常相似的负面例子的影响。交叉计算损失是在原始样本的编码器输出与其增强和反向的量化器输出之间计算的,使培训前战略更加稳健。ccc-wav2vec 2.0在测试-清洁和测试-其他装置LibriSpeech的基线wav2vec 2.0的基础上分别实现了15.6%和12.7%的相对WER改进,而没有使用任何语言模式。拟议方法在对交换机数据进行微调时,还实现了比基线 wav2vec 2.0 提高了14.9%的相对WER改进率。我们分别在LibriSpeech测试-其他装置上公布了我们所有的代码。