Self-supervised speech recognition models require considerable labeled training data for learning high-fidelity representations for Automatic Speech Recognition (ASR) which is computationally demanding and time-consuming. We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR. We discover that the dataset pruning strategies used in vision tasks for sampling the most informative examples do not perform better than random subset selection on fine-tuning self-supervised ASR. We then present the COWERAGE algorithm for representative subset selection in self-supervised ASR. COWERAGE is based on our finding that ensuring the coverage of examples based on training Word Error Rate (WER) in the early training epochs leads to better generalization performance. Extensive experiments with the wav2vec 2.0 and HuBERT model on TIMIT, Librispeech, and LJSpeech datasets show the effectiveness of COWERAGE and its transferability across models, with up to 17% relative WER improvement over existing dataset pruning methods and random sampling. We also demonstrate that the coverage of training instances in terms of WER values ensures the inclusion of phonemically diverse examples, leading to better test accuracy in self-supervised speech recognition models.
翻译:自我监督的语音识别模型需要大量标记的训练数据来学习实现自动语音识别(ASR)的高保真表示,这需要大量计算和耗费时间。我们考虑在自我监督语音模型进行高效微调时,识别一个最佳数据子集的任务。我们发现,用于在视觉任务中对最具信息量的示例进行采样的数据集剪枝策略,在自监督 ASR 的微调上效果不如随机子集选择。然后,我们提出了用于自我监督 ASR 的代表性子集选择的 COWERAGE 算法。COWERAGE 基于我们的发现,即在早期的训练时期,确保包含训练 Word Error Rate (WER) 的示例覆盖范围可以获得更好的泛化性能。在 TIMIT、Librispeech 和 LJSpeech 数据集上的 wav2vec 2.0 和 HuBERT 模型的大量实验表明了 COWERAGE 的有效性及其跨模型的可转移性,相对于现有的数据集剪枝方法和随机抽样方式,可以实现最多 17% 的相对 WER 改善。我们还证明了在基于 WER 值的训练实例中确保覆盖可以包含音素多样性示例,从而在自我监督语音识别模型中获得更好的测试精度。