Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developed Biographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set. Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.
翻译:从在线文件中提取的传记资料是信息提取(IE)界中最受欢迎的研究课题。各种自然语言处理(NLP)技术,如文本分类、文本总和和关系提取等,通常用于实现这一目标。在这些技术中,RE是最常见的,因为它可以直接用于建立传记知识图表。RE通常被设计成一个监督的机器学习(ML)问题,ML模型可以在附加说明的数据集上接受培训。然而,RE的附加说明数据集很少,因为批注过程可能耗时费钱。为此,我们开发了文字分类、文本总和关系提取(NLP)等各种自然语言处理(NLP)技术。针对数字人文和历史研究的数据集,是用维基百科文和维基数据等来源的系统化数据来匹配的。通过利用维基百科文章的架构和强有力的名称实体识别(NER),我们将信息匹配得相当精确,以便为在DH RER 域中重要的十种不同关系编辑附加说明的配对关系。此外,我们开发了第一组半监督的数据集数据集数据集(DH)数据集的数据集,我们主要用于在纸质历史模型中进行纸化,我们通过对纸化的纸路路路路路关系进行数据化的分类,因此,我们可以将数据序列进行纸化为纸化和纸路路关系进行纸化,对纸化的升级。