The pre-trained multi-lingual XLSR model generalizes well for language identification after fine-tuning on unseen languages. However, the performance significantly degrades when the languages are not very distinct from each other, for example, in the case of dialects. Low resource dialect classification remains a challenging problem to solve. We present a new data augmentation method that leverages model training dynamics of individual data points to improve sampling for latent mixup. The method works well in low-resource settings where generalization is paramount. Our datamaps-based mixup technique, which we call Map-Mix improves weighted F1 scores by 2% compared to the random mixup baseline and results in a significantly well-calibrated model. The code for our method is open sourced on https://github.com/skit-ai/Map-Mix.
翻译:事先经过培训的多语言 XLSR 模型在对隐形语言进行微调后,对语言识别作了概括化的精细调整。然而,如果语言彼此不十分不同,例如方言,这种性能就会大大降低。低资源方言分类仍然是需要解决的一个难题。我们提出了一个新的数据增强方法,利用个别数据模型培训动态来改进潜在混杂的取样。这种方法在普遍化最为重要的低资源环境中运作良好。我们称之为基于数据图谱的混合技术,与随机混合基线相比,将加权F1得分提高2%,并形成一个非常精确的模型。我们方法的代码在 https://gitub.com/skit-ai/Map-Mix 上公开来源。