探索机器翻译在生成代名实体数据集方面的潜力:波斯文和英文之间的案例研究 (Exploring the Potential of Machine Translation for Generating Named Entity Datasets: A Case Study between Persian and English)

This study focuses on the generation of Persian named entity datasets through the application of machine translation on English datasets. The generated datasets were evaluated by experimenting with one monolingual and one multilingual transformer model. Notably, the CoNLL 2003 dataset has achieved the highest F1 score of 85.11%. In contrast, the WNUT 2017 dataset yielded the lowest F1 score of 40.02%. The results of this study highlight the potential of machine translation in creating high-quality named entity recognition datasets for low-resource languages like Persian. The study compares the performance of these generated datasets with English named entity recognition systems and provides insights into the effectiveness of machine translation for this task. Additionally, this approach could be used to augment data in low-resource language or create noisy data to make named entity systems more robust and improve them.

翻译：本研究的重点是通过应用英语数据集的机器翻译生成波斯命名实体数据集。生成的数据集通过试验一个单一语言和多语言变压器模型进行评估。值得注意的是,CNLL 2003 数据集达到了85.11%的最高F1分。相比之下,WNUT 2017 数据集得出了40.02%的最低F1分。这项研究的结果突出显示了机器翻译在为波斯语等低资源语言创建高质量实体识别数据集方面的潜力。这项研究将这些生成的数据集的性能与英文命名实体识别系统进行比较,并提供了对这项工作机器翻译有效性的洞察力。此外,这一方法可以用来增加低资源语言的数据,或者制造噪音数据,使命名实体系统更加健全和完善。

相关内容

Machine Translation

关注 0

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日