Named Entity Recognition (NER) is a key step in the creation of structured data from digitised historical documents. Traditional NER approaches deal with flat named entities, whereas entities often are nested. For example, a postal address might contain a street name and a number. This work compares three nested NER approaches, including two state-of-the-art approaches using Transformer-based architectures. We introduce a new Transformer-based approach based on joint labelling and semantic weighting of errors, evaluated on a collection of 19 th-century Paris trade directories. We evaluate approaches regarding the impact of supervised fine-tuning, unsupervised pre-training with noisy texts, and variation of IOB tagging formats. Our results show that while nested NER approaches enable extracting structured data directly, they do not benefit from the extra knowledge provided during training and reach a performance similar to the base approach on flat entities. Even though all 3 approaches perform well in terms of F1 scores, joint labelling is most suitable for hierarchically structured data. Finally, our experiments reveal the superiority of the IO tagging format on such data.
翻译:命名实体识别(NER)是利用数字化历史文件创建结构化数据的关键步骤。传统的 NER 方法涉及命名单位,而实体往往被嵌套。例如,邮政地址可能包含街道名称和数字。这项工作比较了三种嵌套式NER方法,包括使用基于变压器的建筑结构的两种最先进的方法。我们采用了基于联合标签和对错误进行语义加权的新的变换器方法,在19世纪巴黎贸易目录的汇编中进行了评估。我们评价了监督的微调、不受监督的用吵闹文字进行预培训以及IOB标记格式变异等做法的影响。我们的结果显示,虽然嵌套式NER方法能够直接提取结构化数据,但它们并没有从培训期间提供的额外知识中获益,也没有达到与固定实体基本方法类似的绩效。尽管所有三种方法在F1评分方面表现良好,但联合标签对于等级结构化数据最为合适。最后,我们的实验显示IO在这类数据上标记格式的优越性。