Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models when we have access to only a small amount of transcribed speech data. However, this raises the question of which subset of the available unlabeled data should be selected for transcription. Our work investigates different unsupervised data selection techniques for fine-tuning the HuBERT model under a limited transcription budget. We investigate the impact of speaker diversity, gender bias, and topic diversity on the downstream ASR performance. We also devise two novel techniques for unsupervised data selection: pre-training loss based data selection and the perplexity of byte pair encoded clustered units (PBPE) and we show how these techniques compare to pure random data selection. Finally, we analyze the correlations between the inherent characteristics of the selected fine-tuning subsets as well as how these characteristics correlate with the resultant word error rate. We demonstrate the importance of token diversity, speaker diversity, and topic diversity in achieving the best performance in terms of WER.
翻译:自我监督的学习(SSL)能够利用未贴标签的数据来提高自动语音识别(ASR)模型的性能,当我们只能获得少量转录语音数据时,这提出了应选择哪些可选的非标签数据子集进行转录的问题。我们的工作调查了在有限的抄录预算下微调HuBERT模型的未经监督的数据选择技术。我们调查了演讲者多样性、性别偏见和主题多样性对下游ASR绩效的影响。我们还设计了两种未受监督的数据选择新颖技术:基于培训前损失的数据选择,以及编译的组合单元(PPPPE)对字对的难解性,我们展示了这些技术如何与纯随机数据选择进行比较。最后,我们分析了选定的微调子的内在特征之间的相互关系,以及这些特征与由此产生的单词错误率之间的关系。我们展示了象征性多样性、演讲者多样性和主题多样性对于实现WER的最佳性能的重要性。