The Transformer has quickly become the dominant architecture for various pattern recognition tasks due to its capacity for long-range representation. However, transformers are data-hungry models and need large datasets for training. In Handwritten Text Recognition (HTR), collecting a massive amount of labeled data is a complicated and expensive task. In this paper, we propose a lite transformer architecture for full-page multi-script handwriting recognition. The proposed model comes with three advantages: First, to solve the common problem of data scarcity, we propose a lite transformer model that can be trained on a reasonable amount of data, which is the case of most HTR public datasets, without the need for external data. Second, it can learn the reading order at page-level thanks to a curriculum learning strategy, allowing it to avoid line segmentation errors, exploit a larger context and reduce the need for costly segmentation annotations. Third, it can be easily adapted to other scripts by applying a simple transfer-learning process using only page-level labeled images. Extensive experiments on different datasets with different scripts (French, English, Spanish, and Arabic) show the effectiveness of the proposed model.
翻译:Transformer已经成为各种模式识别任务中占主导地位的架构之一,因为它具有长距离表示的能力。但是,Transformer是数据密集的模型,并且需要大量的数据进行训练。在手写文本识别(HTR)中,收集大量标注数据是一个复杂而昂贵的任务。在本文中,我们提出了一种适用于多脚本全页手写识别的轻量级Transformer架构。所提出的模型具有三个优点:首先,为了解决数据稀缺的通常问题,我们提出了一种轻量级Transformer模型,可以在合理的数据集上进行训练,即大多数HTR公共数据集的情况下,无需外部数据。其次,它可以通过一种课程学习策略学习页面级别的阅读顺序,从而避免线条切割错误,利用更大的上下文并减少昂贵的分割注释。第三,它可以通过使用仅带页面标注图像的简单转移学习过程轻松适应其他字母表。在不同字母表(法语,英语,西班牙语和阿拉伯语)的不同数据集上进行的大量实验表明了所提出模型的有效性。