In this paper, we study the problem of text line recognition. Unlike most approaches targeting specific domains such as scene-text or handwritten documents, we investigate the general problem of developing a universal architecture that can extract text from any image, regardless of source or input modality. We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs), and conduct extensive experiments to compare their accuracy and performance on widely used public datasets of scene and handwritten text. We find that a combination that so far has received little attention in the literature, namely a Self-Attention encoder coupled with the CTC decoder, when compounded with an external language model and trained on both public and internal data, outperforms all the others in accuracy and computational complexity. Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length, a requirement for universal line recognition. Using an internal dataset collected from multiple sources, we also expose the limitations of current public datasets in evaluating the accuracy of line recognizers, as the relatively narrow image width and sequence length distributions do not allow to observe the quality degradation of the Transformer approach when applied to the transcription of long lines.
翻译:在本文中,我们研究了文本线识别问题。与大多数针对现场文本或手写文件等特定领域的方法不同,我们调查了开发一个通用结构以从任何图像中提取文字而不论其来源或输入模式如何的一般问题。我们考虑了两个解码组(同步体时空分类和变换器)和三个编码模块(双向LSTMS、自控和GRCLs)以及三个编码模块(双向LSTMS、自控和GRCLs),并进行了广泛的实验,以比较其在广泛使用的现场和手写文本公共数据集中的准确度和性能。我们发现,迄今为止文献中很少注意的组合,即自控编码器与CTC解码器相结合,再加上外部语言模型和在公共及内部数据方面受过培训的组合,在准确性和计算复杂性方面优于所有其他模块。与更常见的基于变码模型不同,这一结构可以处理任意长度的投入,这是普遍识别线条的要求。我们还发现,在评估线条方法的精确度方面,目前公共数据集在评估线条码的准确性方面受到的局限性,在比较狭长的分布时,可以观察到变式质量的分布的分布可以用来测量到较窄的分布到相对的深度。