The Bangla linguistic variety is a fascinating mix of regional dialects that contributes to the cultural diversity of the Bangla-speaking community. Despite extensive study into translating Bangla to English, English to Bangla, and Banglish to Bangla in the past, there has been a noticeable gap in translating Bangla regional dialects into standard Bangla. In this study, we set out to fill this gap by creating a collection of 32,500 sentences, encompassing Bangla, Banglish, and English, representing five regional Bangla dialects. Our aim is to translate these regional dialects into standard Bangla and detect regions accurately. To tackle the translation and region detection tasks, we propose two novel models: DialectBanglaT5 for translating regional dialects into standard Bangla and DialectBanglaBERT for identifying the dialect's region of origin. DialectBanglaT5 demonstrates superior performance across all dialects, achieving the highest BLEU score of 71.93, METEOR of 0.8503, and the lowest WER of 0.1470 and CER of 0.0791 on the Mymensingh dialect. It also achieves strong ROUGE scores across all dialects, indicating both accuracy and fluency in capturing dialectal nuances. In parallel, DialectBanglaBERT achieves an overall region classification accuracy of 89.02%, with notable F1-scores of 0.9241 for Chittagong and 0.8736 for Mymensingh, confirming its effectiveness in handling regional linguistic variation. This is the first large-scale investigation focused on Bangla regional dialect translation and region detection. Our proposed models highlight the potential of dialect-specific modeling and set a new benchmark for future research in low-resource and dialect-rich language settings.
翻译:孟加拉语的语言多样性包含丰富的地区方言,这为孟加拉语使用者社群的文化多元性增添了独特魅力。尽管以往已有大量关于孟加拉语到英语、英语到孟加拉语以及混合式孟加拉语(Banglish)到标准孟加拉语的翻译研究,但在将孟加拉语地区方言翻译为标准孟加拉语方面仍存在显著空白。本研究旨在填补这一空白,通过构建包含32,500个句子的数据集,涵盖孟加拉语、混合式孟加拉语及英语,代表五种孟加拉语地区方言。我们的目标是将这些地区方言准确翻译为标准孟加拉语并精确识别其地域来源。为应对翻译与地域识别任务,我们提出了两种创新模型:用于将地区方言翻译为标准孟加拉语的DialectBanglaT5,以及用于识别方言起源地域的DialectBanglaBERT。DialectBanglaT5在所有方言上均表现出卓越性能,在迈门辛方言上取得了最高BLEU分数71.93、METEOR分数0.8503,以及最低的WER(0.1470)和CER(0.0791)。该模型在所有方言上均获得强劲的ROUGE分数,表明其在捕捉方言细微差别方面兼具准确性与流畅性。与此同时,DialectBanglaBERT实现了89.02%的整体地域分类准确率,其中吉大港方言的F1分数达0.9241,迈门辛方言达0.8736,证实了其处理地域语言变异的有效性。这是首个专注于孟加拉语地区方言翻译与地域识别的大规模研究。我们提出的模型凸显了方言特异性建模的潜力,并为未来低资源与多方言语言环境的研究设立了新基准。