We present four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the two static models, and four instances considering different time slices for BERT. Our models have already been used in various downstream tasks where they consistently improved performance. In this paper, we describe how the models have been created and outline their reuse potential.
翻译:我们展示了四种神经语言模型,这些模型经过培训,用大量英文历史数据集进行培训,于1760-1900年间出版,由约510亿个符号组成,语言模型结构包括静态模型(word2vec和快速图文)和背景化模型(BERT和Flair)。我们用整个数据集对每个结构进行了示范实例培训。此外,我们还为两个静态模型培训了1850年以前出版的文本的单独实例,并培训了四个实例,其中考虑了BERT的不同时间切片。我们的模型已经用于各种下游任务,在这些下游任务中,这些模型不断提高绩效。我们在本文件中描述了这些模型是如何创建的,并概述了其再利用潜力。