中文本识别基准:数据集、基线和实证研究 (Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study)

The flourishing blossom of deep learning has witnessed the rapid development of text recognition in recent years. However, the existing text recognition methods are mainly for English texts, whereas ignoring the pivotal role of Chinese texts. As another widely-spoken language, Chinese text recognition in all ways has extensive application markets. Based on our observations, we attribute the scarce attention on Chinese text recognition to the lack of reasonable dataset construction standards, unified evaluation methods, and results of the existing baselines. To fill this gap, we manually collect Chinese text datasets from publicly available competitions, projects, and papers, then divide them into four categories including scene, web, document, and handwriting datasets. Furthermore, we evaluate a series of representative text recognition methods on these datasets with unified evaluation methods to provide experimental results. By analyzing the experimental results, we surprisingly observe that state-of-the-art baselines for recognizing English texts cannot perform well on Chinese scenarios. We consider that there still remain numerous challenges under exploration due to the characteristics of Chinese texts, which are quite different from English texts. The code and datasets are made publicly available at https://github.com/FudanVI/benchmarking-chinese-text-recognition.

翻译：近些年来,深层学习的蓬勃发展见证了文本认知的迅速发展,然而,现有的文本识别方法主要是针对英文文本的,而忽略了中文文本的关键性作用。作为另一种广泛使用的语言,中国文本的承认在所有方面都有广泛的应用市场。根据我们的观察,我们把对中文文本认知的很少的注意力归因于缺乏合理的数据集构建标准、统一评估方法和现有基线的结果。为了填补这一空白,我们手动从公开提供的竞赛、项目和文件中收集中文文本数据集,然后将其分为四类,包括场景、网络、文件和笔迹数据集。此外,我们用统一的评估方法评估这些数据集的一系列具有代表性的文本识别方法,以提供实验结果。通过分析实验结果,我们惊讶地发现,承认英文文本的先进基线无法在中国情景上很好地发挥作用。我们认为,由于中文文本的特征与英文文本大不相同,在探索中仍存在着许多挑战。代码和数据集公布在https://github.com/FudanVI/basigning-markinging。

相关内容