Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data is limited. In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained speech models, based on mixed source and target domain self-supervision. We find that including source domain self-supervision stabilizes training and avoids mode collapse of the latent representations. For evaluation, we collect HParl, a $120$ hour speech corpus for Greek, consisting of plenary sessions in the Greek Parliament. We merge HParl with two popular Greek corpora to create GREC-MD, a test-bed for multi-domain evaluation of Greek ASR systems. In our experiments we find that, while other Unsupervised Domain Adaptation baselines fail in this resource-constrained environment, M2DS2 yields significant improvements for cross-domain adaptation, even when a only a few hours of in-domain audio are available. When we relax the problem in a weakly supervised setting, we find that independent adaptation for audio using M2DS2 and language using simple LM augmentation techniques is particularly effective, yielding word error rates comparable to the fully supervised baselines.
翻译:现代语音识别系统在域变换中出现快速的性能退化。 这个问题在低资源语言等低资源语言等数据萎缩环境中特别普遍, 培训数据的多样性有限。 在这项工作中,我们提议M2DS2, 一种基于混合源和目标域自我监督的、基于混合源和目标域自我监督的、简单和抽样高效的大型预先培训语音模型微调战略。 我们发现, 包括源域自控的自我监督系统可以稳定培训, 避免潜在代表体的模式崩溃。 在评估中, 我们为希腊人收集了120美元小时的HParl, 包括希腊议会全体会议在内。 我们把HParl与两个受欢迎的希腊公司合并起来, 以创建GREC-MD, 这是希腊ASR系统多域评价的测试台。 我们的实验发现, 虽然其他非超覆盖域适应基线在资源受限制的环境中无法稳定, M2DS2 能够带来显著的跨区域适应性改进, 即使只有几个小时的音频。 当我们在一个薄弱的监管环境中放松问题时, 我们发现, 使用M2DS2 和以完全有效的磁率基准进行独立的音频调整, 我们发现, 以完全的MA 。