This study focuses on the generation of Persian named entity datasets through the application of machine translation on English datasets. The generated datasets were evaluated by experimenting with one monolingual and one multilingual transformer model. Notably, the CoNLL 2003 dataset has achieved the highest F1 score of 85.11%. In contrast, the WNUT 2017 dataset yielded the lowest F1 score of 40.02%. The results of this study highlight the potential of machine translation in creating high-quality named entity recognition datasets for low-resource languages like Persian. The study compares the performance of these generated datasets with English named entity recognition systems and provides insights into the effectiveness of machine translation for this task. Additionally, this approach could be used to augment data in low-resource language or create noisy data to make named entity systems more robust and improve them.
翻译:本研究的重点是通过应用英语数据集的机器翻译生成波斯命名实体数据集。生成的数据集通过试验一个单一语言和多语言变压器模型进行评估。值得注意的是,CNLL 2003 数据集达到了85.11%的最高F1分。相比之下,WNUT 2017 数据集得出了40.02%的最低F1分。这项研究的结果突出显示了机器翻译在为波斯语等低资源语言创建高质量实体识别数据集方面的潜力。这项研究将这些生成的数据集的性能与英文命名实体识别系统进行比较,并提供了对这项工作机器翻译有效性的洞察力。此外,这一方法可以用来增加低资源语言的数据,或者制造噪音数据,使命名实体系统更加健全和完善。