We present models which complete missing text given transliterations of ancient Mesopotamian documents, originally written on cuneiform clay tablets (2500 BCE - 100 CE). Due to the tablets' deterioration, scholars often rely on contextual cues to manually fill in missing parts in the text in a subjective and time-consuming process. We identify that this challenge can be formulated as a masked language modelling task, used mostly as a pretraining objective for contextualized language models. Following, we develop several architectures focusing on the Akkadian language, the lingua franca of the time. We find that despite data scarcity (1M tokens) we can achieve state of the art performance on missing tokens prediction (89% hit@5) using a greedy decoding scheme and pretraining on data from other languages and different time periods. Finally, we conduct human evaluations showing the applicability of our models in assisting experts to transcribe texts in extinct languages.
翻译:我们展示了一些模型,这些模型根据古代美索波达米亚文件的转写法完成缺失的文本,最初是用古代美索波塔米亚文件拼写成的,最初是用粘土板写成的(2500 BCE - 100 CE)。由于平板的退化,学者们往往依赖背景线索,在主观和耗时的进程中人工填写文本中的缺失部分。我们确定,这一挑战可以作为一种隐蔽语言建模任务来拟订,主要用作背景化语言模型的培训前目标。随后,我们开发了几个建筑,重点是阿卡迪安语,即当时的方言。我们发现,尽管缺少数据(1M符号),但我们可以利用贪婪的解码计划和对其他语言和不同时间段的数据进行预先培训,实现缺失符号预测的艺术表现(89% hit@5)。最后,我们进行了人类评估,展示了我们的模型在协助专家用灭绝语言对文本进行书写方面是否适用。