This paper discusses the benefits of including metadata when training language models on historical collections. Using 19th-century newspapers as a case study, we extend the time-masking approach proposed by Rosin et al., 2022 and compare different strategies for inserting temporal, political and geographical information into a Masked Language Model. After fine-tuning several DistilBERT on enhanced input data, we provide a systematic evaluation of these models on a set of evaluation tasks: pseudo-perplexity, metadata mask-filling and supervised classification. We find that showing relevant metadata to a language model has a beneficial impact and may even produce more robust and fairer models.
翻译:本文件讨论了在历史收藏培训语言模型时纳入元数据的好处。我们利用19世纪的报纸作为案例研究,推广了Rosin等人(2022年)提出的时间制版方法,比较了将时间、政治和地理信息纳入隐蔽语言模型的不同战略。在微调了数个关于强化输入数据的DistilBERT之后,我们对这些模型进行了一系列评价任务:假重复、元数据遮盖和受监督的分类。我们发现将相关元数据显示为一种语言模型具有有益影响,甚至可能产生更强大更公平的模型。