Named Entity Recognition (NER) is a fundamental NLP tasks with a wide range of practical applications. The performance of state-of-the-art NER methods depends on high quality manually anotated datasets which still do not exist for some languages. In this work we aim to remedy this situation in Slovak by introducing WikiGoldSK, the first sizable human labelled Slovak NER dataset. We benchmark it by evaluating state-of-the-art multilingual Pretrained Language Models and comparing it to the existing silver-standard Slovak NER dataset. We also conduct few-shot experiments and show that training on a sliver-standard dataset yields better results. To enable future work that can be based on Slovak NER, we release the dataset, code, as well as the trained models publicly under permissible licensing terms at https://github.com/NaiveNeuron/WikiGoldSK.
翻译:命名实体识别(NER)是一项基本的 NLP 任务,具有广泛的实际应用。最先进的 NER 方法的性能取决于高质量的手动标注数据集,但对于某些语言尚不存在这样的数据集。在本文中,我们旨在通过引入 WikiGoldSK,第一个体积可观的人工标注的斯洛伐克 NER 数据集,解决这种情况。我们通过评估最先进的多语言预训练语言模型并将其与现有的银标准斯洛伐克 NER 数据集进行比较来进行基准测试。我们还进行了少样本实验,并表明在银标准数据集上进行训练可以获得更好的结果。为了支持未来基于斯洛伐克 NER 的工作,我们在 https://github.com/NaiveNeuron/WikiGoldSK 上公开发布了数据集、代码以及经过训练的模型,并使用允许的许可条款。