Selecting application scenarios matching data is important for the automatic speech recognition (ASR) training, but it is difficult to measure the matching degree of the training corpus. This study proposes a unsupervised target-aware data selection method based on speech corpora divergence (SCD), which can measure the similarity between two speech corpora. We first use the self-supervised Hubert model to discretize the speech corpora into label sequence and calculate the N-gram probability distribution. Then we calculate the Kullback-Leibler divergence between the N-grams as the SCD. Finally, we can choose the subset which has minimum SCD to the target corpus for annotation and training. Compared to previous data selection method, the SCD data selection method can focus on more acoustic details and guarantee the diversity of the selected set. We evaluate our method on different accents from Common Voice. Experiments show that the proposed SCD data selection can realize 14.8% relative improvements to the random selection, comparable or even superior to the result of supervised selection.
翻译:选择匹配数据的应用程序情景对于自动语音识别( ASR) 培训很重要, 但很难测量培训内容的匹配程度 。 本研究基于语言整体差异( SCD) 提出一种不受监督的目标认知数据选择方法, 它可以测量两个语言整体的相似性 。 我们首先使用自我监督的“ 休伯特” 模型, 将语言整体分解为标签序列, 并计算 N 克概率分布 。 然后我们计算 N 克与 SCD 之间的 Kullback- Leibler 差异 。 最后, 我们可以选择一个子集, 该子集将最小的 SCD 到目标主体进行批注和培训 。 与先前的数据选择方法相比, SCD 数据选择方法可以侧重于更多声音细节, 并保障所选集的多样性。 我们评估了共同声音不同口数的方法 。 实验显示, 拟议的 SCD 数据选择可以实现14.8% 相对于随机选择的改进, 可比甚至高于监督选择的结果 。</s>