Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting -- even of the same author over a wide time-span -- and the scarcity of data from ancient, poorly represented languages. With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years. The dataset comes in two configurations: a basic splitting and a date-based splitting which takes into account the age of the author. The first setting is intended to study HTR on ancient documents in Italian, while the second focuses on the ability of HTR systems to recognize text written by the same writer in time periods for which training data are not available. For both configurations, we analyze quantitative and qualitative characteristics, also with respect to other line-level HTR benchmarks, and present the recognition performance of state-of-the-art HTR architectures. The dataset is available for download at \url{https://aimagelab.ing.unimore.it/go/lam}.
翻译:计算机视觉和自然语言处理交叉点的手写文本识别(HTR)是一个公开的问题,在处理历史手稿时面临的主要挑战包括:保存纸张支持、笔迹的变异 -- -- 甚至同一作者的笔迹在广泛的时间跨度上也是如此 -- -- 以及古老的、没有充分代表性的语言的数据稀少。为了促进关于这个题目的研究,我们在本文件中介绍了Ludovico Antonio Muratori(LAM)数据集,这是一个由一位作者编辑60年的意大利古代手稿的大量直线的HTR数据集。数据集分为两种组合:一种基本分裂和日期分割,考虑到作者的年龄。第一种设置旨在研究意大利古代文件的HTR,而第二种设置则侧重于HTR系统在没有培训数据的情况下识别同一位作者所编写的文本的能力。关于两种配置,我们分析定量和定性特点,也分析了其他行一级HTR基准,并展示了州-art/abormorima的确认性工作表现。在可下载的HTRIFF/HRGLA/MA上的数据。