Identifying named entities such as a person, location or organization, in documents can highlight key information to readers. Training Named Entity Recognition (NER) models requires an annotated data set, which can be a time-consuming labour-intensive task. Nevertheless, there are publicly available NER data sets for general English. Recently there has been interest in developing NER for legal text. However, prior work and experimental results reported here indicate that there is a significant degradation in performance when NER methods trained on a general English data set are applied to legal text. We describe a publicly available legal NER data set, called E-NER, based on legal company filings available from the US Securities and Exchange Commission's EDGAR data set. Training a number of different NER algorithms on the general English CoNLL-2003 corpus but testing on our test collection confirmed significant degradations in accuracy, as measured by the F1-score, of between 29.4\% and 60.4\%, compared to training and testing on the E-NER collection.
翻译:在文件中确定个人、地点或组织等被点名的实体,可以向读者强调关键信息。培训实体识别模式需要一套附加说明的数据集,这可能是耗时的劳动密集型任务。然而,一般英语有公开的NER数据集。最近人们有兴趣开发法律文本的NER数据。然而,在此所报告的先前的工作和实验结果表明,当在一般英国数据集方面受过培训的NER方法应用于法律文本时,其性能出现显著下降。我们描述了一套公开提供的法律NER数据集,称为E-NER,其依据是美国证券交易委员会EDGAR数据集提供的法律公司档案。对一些不同的NER算法进行了一般英语COLL-2003系统的培训,但对我们的测试收集进行测试证实,与ENER收集的培训和测试相比,F1核心测量的准确性显著下降,介于29.4 ⁇ 至60.4 ⁇ 之间。