Compared to standard Named Entity Recognition (NER), identifying persons, locations, and organizations in historical texts constitutes a big challenge. To obtain machine-readable corpora, the historical text is usually scanned and Optical Character Recognition (OCR) needs to be performed. As a result, the historical corpora contain errors. Also, entities like location or organization can change over time, which poses another challenge. Overall, historical texts come with several peculiarities that differ greatly from modern texts and large labeled corpora for training a neural tagger are hardly available for this domain. In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by training large historical language models. We circumvent the need for large amounts of labeled data by using unlabeled data for pretraining a language model. We propose hmBERT, a historical multilingual BERT-based language model, and release the model in several versions of different sizes. Furthermore, we evaluate the capability of hmBERT by solving downstream NER as part of this year's HIPE-2022 shared task and provide detailed analysis and insights. For the Multilingual Classical Commentary coarse-grained NER challenge, our tagger HISTeria outperforms the other teams' models for two out of three languages.
翻译:与标准命名实体识别(NER)相比,在历史文本中识别个人、地点和组织是一个巨大的挑战。为了获得机器可读体体,历史文本通常是扫描的,需要进行光学字符识别(OCR),因此历史文本通常需要进行扫描和光学字符识别(OCR),因此,历史社团包含错误。此外,地点或组织等实体可能随时间而变化,这又构成另一个挑战。总体而言,历史文本中有一些与现代文本有很大差异的特殊性,而用于培训神经塔格的大型标签公司几乎无法为这个领域提供。在这项工作中,我们通过培训大型历史语言模型,处理历史德语、英语、法语、瑞典语和芬兰语的NER。我们绕过大量标签数据的需求,通过使用无标签数据对语言模型进行预培训。我们建议HMBERT,一个基于历史多语言的BERT语言模型,以不同规模的不同版本发布模型。此外,我们评估HMBERET的能力,通过将下游网络作为本年度高科技-2022共同任务的一部分加以解决,并提供详细分析和洞见。 对于多种语言版本的其他三种不同版本版本版本版本的HIERARIalexexexexexerviews。