Although abbreviations are fairly common in handwritten sources, particularly in medieval and modern Western manuscripts, previous research dealing with computational approaches to their expansion is scarce. Yet abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks. Often, pre-processing ultimately aims to lead from a digitised image of the source to a normalised text, which includes expansion of the abbreviations. We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing the process into discrete steps, each making use of specialist models for recognition, word segmentation and normalisation. The case studies considered here are drawn from the medieval Latin tradition.
翻译:虽然在手写来源中,特别是在中世纪和现代西方手稿中,缩写相当常见,但以前关于扩展的计算方法的研究很少,但缩写对手写文本识别和自然语言处理任务等计算方法提出了特别的挑战,通常,预处理的最终目的是从源的数字化图像引向正常文本,包括缩写内容的扩展。我们探索不同的设置,直接通过在正常化(即扩展、脱节)文本上培训HTR引擎,或通过将过程分解为独立步骤,每个过程都利用专家模型来进行识别、文字分割和正常化,这里所考虑的案例研究是从中世纪拉丁传统中抽取的。