This work presents BanglaNLG, a comprehensive benchmark for evaluating natural language generation (NLG) models in Bangla, a widely spoken yet low-resource language in the web domain. We aggregate three challenging conditional text generation tasks under the BanglaNLG benchmark. Then, using a clean corpus of 27.5 GB of Bangla data, we pretrain BanglaT5, a sequence-to-sequence Transformer model for Bangla. BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming mT5 (base) by up to 5.4%. We are making the BanglaT5 language model and a leaderboard publicly available in the hope of advancing future research and evaluation on Bangla NLG. The resources can be found at https://github.com/csebuetnlp/BanglaNLG.
翻译:这项工作提出了BanglaNLG(BanglaNLG)模型,该模型是评价孟加拉语的自然语言生成(NLG)模型的一个全面基准,孟加拉语是网络域中广泛使用但资源较少的语言。我们根据BanglaNLG基准汇总了三项具有挑战性的有条件文本生成任务。然后,我们利用27.5GB孟加拉语数据这一干净的集合体,对BanglaT5(孟加拉语的序列到序列转换模型)进行了准备。BanglaT5(孟加拉语)在所有这些任务中都取得了最先进的表现,比MT5(基础)高5.4%。我们正在将BanglaT5语言模型和一个领导板公开提供,以期推进Bangla NLG的未来研究和评价。 这些资源可以在https://github.com/csebuetnp/BanglaNLG上找到。