This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings, which is a fundamental task in the field of Natural Language Processing (NLP). Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well. Developing a reliable evaluation test set for Bangla word embeddings are crucial for benchmarking and guiding future research. We provide a Mikolov-style word analogy evaluation set specifically for Bangla, with a sample size of 16678, as well as a translated and curated version of the Mikolov dataset, which contains 10594 samples for cross-lingual research. Our experiments with different state-of-the-art embedding models reveal that Bangla has its own unique characteristics, and current embeddings for Bangla still struggle to achieve high accuracy on both datasets. We suggest that future research should focus on training models with larger datasets and considering the unique morphological characteristics of Bangla. This study represents the first step towards building a reliable NLP system for the Bangla language1.
翻译:本论文提出了一个高质量的数据集,用于评估孟加拉语词向量的质量,这是自然语言处理(NLP)领域的基本任务。虽然孟加拉语是世界上使用最广泛的第七大语言,但它是一种低资源语言,流行的NLP模型无法表现良好。为孟加拉词向量开发一个可靠的评估测试集对于基准测试和指导未来研究至关重要。我们为孟加拉提供了一个Mikolov风格的词类类比评估集,样本量为16678,以及一个翻译和精选的Mikolov数据集,其中包含10594个用于跨语言研究。我们使用不同的最新词向量模型进行实验,结果表明孟加拉语具有其独特的特点,当前的孟加拉语词向量模型仍然难以在两个数据集上实现高精度。我们建议将来的研究应该集中于使用更大的数据集训练模型,并考虑孟加拉语的独特形态学特征。这项研究代表着为孟加拉语构建可靠的NLP系统迈出的第一步。