Self-training (ST) and self-supervised learning (SSL) methods have demonstrated strong improvements in automatic speech recognition (ASR). In spite of these advances, to the best of our knowledge, there is no analysis of how the composition of the labelled and unlabelled datasets used in these methods affects the results. In this work we aim to analyse the effect of number of speakers in the training data on a recent SSL algorithm (wav2vec 2.0), and a recent ST algorithm (slimIPL). We perform a systematic analysis on both labeled and unlabeled data by varying the number of speakers while keeping the number of hours fixed and vice versa. Our findings suggest that SSL requires a large amount of unlabeled data to produce high accuracy results, while ST requires a sufficient number of speakers in the labelled data, especially in the low-regime setting. In this manner these two approaches improve supervised learning in different regimes of data composition.
翻译:自我培训(ST)和自我监督学习(SSL)方法在自动语音识别(ASR)方面显示出了显著的改进。尽管取得了这些进展,但据我们所知,对这些方法中使用的标签和未标签数据集的组成如何影响结果没有分析。在这项工作中,我们的目标是分析培训数据中讲者人数对最新的SSL算法(wav2vec 2.0)和最近的ST算法(slimIPL)的影响。我们对贴有标签和未标签的数据进行系统分析,以不同的发言者人数为单位,同时保留固定的小时数,反之亦然。我们的调查结果表明,SSL需要大量未标签数据才能产生很高的准确结果,而ST要求有足够数量的发言者在标签数据中发言,特别是在低制度设置中。通过这种方式,这两种方法可以改进不同数据构成制度中的监督学习。</s>