LAM数据集:一线级手写文本识别新基准 (The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition)

Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting -- even of the same author over a wide time-span -- and the scarcity of data from ancient, poorly represented languages. With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years. The dataset comes in two configurations: a basic splitting and a date-based splitting which takes into account the age of the author. The first setting is intended to study HTR on ancient documents in Italian, while the second focuses on the ability of HTR systems to recognize text written by the same writer in time periods for which training data are not available. For both configurations, we analyze quantitative and qualitative characteristics, also with respect to other line-level HTR benchmarks, and present the recognition performance of state-of-the-art HTR architectures. The dataset is available for download at \url{https://aimagelab.ing.unimore.it/go/lam}.

翻译：计算机视觉和自然语言处理交叉点的手写文本识别(HTR)是一个公开的问题,在处理历史手稿时面临的主要挑战包括:保存纸张支持、笔迹的变异 -- -- 甚至同一作者的笔迹在广泛的时间跨度上也是如此 -- -- 以及古老的、没有充分代表性的语言的数据稀少。为了促进关于这个题目的研究,我们在本文件中介绍了Ludovico Antonio Muratori(LAM)数据集,这是一个由一位作者编辑60年的意大利古代手稿的大量直线的HTR数据集。数据集分为两种组合:一种基本分裂和日期分割,考虑到作者的年龄。第一种设置旨在研究意大利古代文件的HTR,而第二种设置则侧重于HTR系统在没有培训数据的情况下识别同一位作者所编写的文本的能力。关于两种配置,我们分析定量和定性特点,也分析了其他行一级HTR基准,并展示了州-art/abormorima的确认性工作表现。在可下载的HTRIFF/HRGLA/MA上的数据。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日