Most Named Entity Recognition (NER) models operate under the assumption that training datasets are fully labelled. While it is valid for established datasets like CoNLL 2003 and OntoNotes, sometimes it is not feasible to obtain the complete dataset annotation. These situations may occur, for instance, after selective annotation of entities for cost reduction. This work presents an approach to finetuning BERT on such partially labelled datasets using self-supervision and label preprocessing. Our approach outperforms the previous LSTM-based label preprocessing baseline, significantly improving the performance on poorly labelled datasets. We demonstrate that following our approach while finetuning RoBERTa on CoNLL 2003 dataset with only 10% of total entities labelled is enough to reach the performance of the baseline trained on the same dataset with 50% of the entities labelled.
翻译:多数命名实体识别(NER)模型的运作假设是,培训数据集是完全贴上标签的。虽然对CONLL 2003 和OntoNotes等既定数据集有效,但有时无法获得完整的数据集注释。这些情况可能发生,例如,在有选择地对降低成本的实体进行批注之后。这项工作提出了一种方法,用自我监督及标签预处理来微调这种部分贴上标签的数据集上的BERT。我们的方法超过了先前基于LSTM的标签预处理基准,大大改进了标签不清的数据集的性能。我们证明,在对CONLL 2003 数据集的RoBERTA进行微调的同时,只有10%的标有标签的实体达到了与50%的标有标签的实体就同一数据集培训的基线的性能。