Semantic annotation of long texts, such as novels, remains an open challenge in Natural Language Processing (NLP). This research investigates the problem of detecting person entities and assigning them unique identities, i.e., recognizing people (especially main characters) in novels. We prepared a method for person entity linkage (named entity recognition and disambiguation) and new testing datasets. The datasets comprise 1,300 sentences from 13 classic novels of different genres that a novel reader had manually annotated. Our process of identifying literary characters in a text, implemented in protagonistTagger, comprises two stages: (1) named entity recognition (NER) of persons, (2) named entity disambiguation (NED) - matching each recognized person with the literary character's full name, based on approximate text matching. The protagonistTagger achieves both precision and recall of above 83% on the prepared testing sets. Finally, we gathered a corpus of 13 full-text novels tagged with protagonistTagger that comprises more than 35,000 mentions of literary characters.
翻译:在自然语言处理(NLP)中,对诸如小说等长篇文字进行语义说明仍然是一项公开的挑战。这项研究调查了发现个人实体和赋予他们独特身份的问题,即识别小说中的人(特别是主要人物)的问题。我们为个人实体联系(名称为实体识别和模糊)和新的测试数据集编写了一份方法。数据集包含13种经典的、由小说读者手动加注的13种不同版本的小说中的1 300个句子。我们在“主角塔格”中实施的在文本中识别文学字符的过程包括两个阶段:(1) 名称为实体识别(NER) 的人,(2) 名称为实体模糊(NED) - 以大约文本匹配为基础,将每个被识别的人与文学字符的全名匹配。 主角塔格在准备的测试集中既实现了精确度,又记起了83%以上的记号。最后,我们收集了13种全文本小说小说,由主角塔格(Tagger)加注超过35,000个文学字符。