Many automatic speech recognition (ASR) data sets include a single pre-defined test set consisting of one or more speakers whose speech never appears in the training set. This "hold-speaker(s)-out" data partitioning strategy, however, may not be ideal for data sets in which the number of speakers is very small. This study investigates ten different data split methods for five languages with minimal ASR training resources. We find that (1) model performance varies greatly depending on which speaker is selected for testing; (2) the average word error rate (WER) across all held-out speakers is comparable not only to the average WER over multiple random splits but also to any given individual random split; (3) WER is also generally comparable when the data is split heuristically or adversarially; (4) utterance duration and intensity are comparatively more predictive factors of variability regardless of the data split. These results suggest that the widely used hold-speakers-out approach to ASR data partitioning can yield results that do not reflect model performance on unseen data or speakers. Random splits can yield more reliable and generalizable estimates when facing data sparsity.
翻译:许多自动语音识别(ASR)数据集包括一个单一的预先定义的测试组,由在培训组中从未出现过其发言的一个或多个发言者组成。但是,对于发言人数非常少的数据集来说,这种“控股发言者出”数据分割战略可能并不理想。本研究调查了五种语言的十种不同的数据分割方法,而ASR培训资源极少。我们发现:(1)示范性性表现因选择了哪个发言者来测试而有很大差异;(2)所有现有发言者的平均字差率不仅与平均WER在多个随机分割上的差错率相当,而且与任何给定的个别随机分割率也相当;(3)当数据是超常或敌对的时,WER一般也是可比较的;(4)超时和强度是相对而言更具有预测性的变异性因素,而不论数据是分解的数据。这些结果表明,对ASR数据分割使用广泛使用的持标者出法可以产生结果,但不会反映对看不见数据或发言者的示范性能。随机分裂在面对数据紧张时,可以产生更可靠和可概括的估计数。