Compared to standard Named Entity Recognition (NER), identifying persons, locations, and organizations in historical texts forms a big challenge. To obtain machine-readable corpora, the historical text is usually scanned and optical character recognition (OCR) needs to be performed. As a result, the historical corpora contain errors. Also, entities like location or organization can change over time, which poses another challenge. Overall historical texts come with several peculiarities that differ greatly from modern texts and large labeled corpora for training a neural tagger are hardly available for this domain. In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by training large historical language models. We circumvent the need for labeled data by using unlabeled data for pretraining a language model. hmBERT, a historical multilingual BERT-based language model is proposed, with different sizes of it being publicly released. Furthermore, we evaluate the capability of hmBERT by solving downstream NER as part of this year's HIPE-2022 shared task and provide detailed analysis and insights. For the Multilingual Classical Commentary coarse-grained NER challenge, our tagger HISTeria outperforms the other teams' models for two out of three languages.
翻译:与标准命名实体识别(NER)相比,在历史文本中识别个人、地点和组织是一个巨大的挑战。要获得机器可读体体,历史文本通常是扫描的,需要进行光学字符识别(OCR)。因此,历史文本包含错误。此外,像地点或组织这样的实体可能随时间而变化,这又是一个挑战。总体历史文本具有若干不同之处,与现代文本有很大不同,对于该领域来说,很难找到用于培训神经塔格的大型标签公司。在这项工作中,我们通过培训大型历史语言模型,解决历史德文、英文、法文、瑞典文和芬兰文的NER。我们通过使用未贴标签的数据对语言模型进行预培训而避免使用标签数据的必要性。 HMBERT提出了一种基于历史多语言的BERT语言模型,其规模不同,公开发布。此外,我们通过解决下游NER技术作为本年度HIPE-2022共同任务的一部分来评估HBERT的能力,并提供详细的分析和洞察力。对于多种语言的Collearrical comvical coregy regresuted ex for other sours for other other other strupleglegleglegresuteaker of other sultragleglegleglemental