Bangla -- ranked as the 6th most widely spoken language across the world (https://www.ethnologue.com/guides/ethnologue200), with 230 million native speakers -- is still considered as a low-resource language in the natural language processing (NLP) community. With three decades of research, Bangla NLP (BNLP) is still lagging behind mainly due to the scarcity of resources and the challenges that come with it. There is sparse work in different areas of BNLP; however, a thorough survey reporting previous work and recent advances is yet to be done. In this study, we first provide a review of Bangla NLP tasks, resources, and tools available to the research community; we benchmark datasets collected from various platforms for nine NLP tasks using current state-of-the-art algorithms (i.e., transformer-based models). We provide comparative results for the studied NLP tasks by comparing monolingual vs. multilingual models of varying sizes. We report our results using both individual and consolidated datasets and provide data splits for future research. We reviewed a total of 108 papers and conducted 175 sets of experiments. Our results show promising performance using transformer-based models while highlighting the trade-off with computational costs. We hope that such a comprehensive survey will motivate the community to build on and further advance the research on Bangla NLP.
翻译:孟加拉语(Bangla NLP)是全世界使用量最广的第6种语言(https://www.ethnologue.com/guides/ethnologue200),有2.3亿土著发言者,这在自然语言处理(NLP)社区中仍被视为一种低资源语言。由于进行了30年的研究,Bangla NLP(BNLP)仍然落后于30年,这主要是由于资源稀缺和随之而来的挑战。在BNLP的不同领域,工作稀少;然而,报告以往工作和最近进展的彻底调查尚未完成。在本研究中,我们首先审查Bangla NLP的任务、资源和工具;我们利用目前最先进的算法(即变压模型),为从各种平台收集的九种NLP任务(即变压模型)的数据集进行基准化。我们通过比较单语版和不同规模的多语言模型,为所研究的NLP任务提供了比较结果。我们利用个人和综合数据集报告我们的成果,并为未来研究提供数据分解的数据。我们用108个模型和175个有希望的模型来显示我们进行的全面改革的计算结果。