We study the task of personalizing ASR models to a target non-native speaker/accent while being constrained by a transcription budget on the duration of utterances selected from a large unlabelled corpus. We propose a subset selection approach using the recently proposed submodular mutual information functions, in which we identify a diverse set of utterances that match the target speaker/accent. This is specified through a few target utterances and achieved by modeling the relationship between the target subset and the selected subset using submodular mutual information functions. This method is applied at both the speaker and accent levels. We personalize the model by fine tuning it with utterances selected and transcribed from the unlabelled corpus. Our method is able to consistently identify utterances from the target speaker/accent using just speech features. We show that the targeted subset selection approach improves upon random sampling by as much as 2% to 5% (absolute) depending on the speaker and accent and is 2x to 4x more label-efficient compared to random sampling. We also compare with a skyline where we specifically pick from the target and our method generally outperforms the oracle in its selections.
翻译:我们研究将ASR模型个性化到一个目标非本地演讲者/美分上的任务,同时受从一个大无标签的文体中选择的语句长度的抄录预算限制。我们建议使用最近提议的子模版相互信息功能,使用最近提议的子模版相互信息功能,确定一套与目标演讲者/美分相匹配的不同语句。这通过几个目标语句加以说明,并通过利用亚模版相互信息功能模拟目标子集和选定子集之间的关系来实现。这种方法适用于讲音和口音级别。我们用从无标签的文体中选择和转录的语句微调使该模型个性化。我们的方法能够使用简单的语言特征,始终辨别目标演讲者/美分的语句。我们显示,目标子选择方法在随机采样时会改进2%至5%(绝对),并且比随机采样要高出2x至4x的标签效率。我们还比较了一条天线,我们从目标中具体选取了目标,我们的方法一般在选择时会超越目标或容器。