After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.
翻译:在经过几十年的大规模数字化之后,以数字格式提供了数量空前的历史文件及其机器可读文本,这是在保存和无障碍方面迈出的一大步,同时也在内容开采方面开辟了新的机会,下一个根本挑战是开发适当的技术,以便有效地搜索、检索和探索这一“过去大数据”中的信息。在语义索引机会中,人文学者对名称实体的承认和分类有很大的需求。然而,名称实体识别系统受到多种、历史和吵闹的投入的极大挑战。在这次调查中,我们向NER介绍了历史文件构成的一系列挑战,清点了现有资源,介绍了迄今为止采用的主要方法,并确定了未来发展的关键优先事项。