Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available
翻译:传统的语音识别建模研究大多依赖低资源语言的规范形式,而区域方言的自动语音识别(ASR)通常被视为微调任务。为探究方言变异对ASR的影响,我们构建了一个包含78小时标注时长的孟加拉语语音转文本(STT)语料库,命名为Ben-10。从语言学与数据驱动的双重视角分析表明,语音基础模型在区域方言ASR任务中表现严重不足,无论是零样本还是微调场景均如此。我们观察到所有深度学习方法在方言变异下的语音数据建模中均存在困难,但针对特定方言的模型训练能缓解此问题。本数据集还可作为ASR算法在受限资源条件下进行分布外(OOD)建模的研究资源。项目开发的数据集与代码均已公开。