The flourishing blossom of deep learning has witnessed the rapid development of text recognition in recent years. However, the existing text recognition methods are mainly proposed for English texts. As another widely-spoken language, Chinese text recognition (CTR) in all ways has extensive application markets. Based on our observations, we attribute the scarce attention on CTR to the lack of reasonable dataset construction standards, unified evaluation protocols, and results of the existing baselines. To fill this gap, we manually collect CTR datasets from publicly available competitions, projects, and papers. According to application scenarios, we divide the collected datasets into four categories including scene, web, document, and handwriting datasets. Besides, we standardize the evaluation protocols in CTR. With unified evaluation protocols, we evaluate a series of representative text recognition methods on the collected datasets to provide baselines. The experimental results indicate that the performance of baselines on CTR datasets is not as good as that on English datasets due to the characteristics of Chinese texts that are quite different from the Latin alphabet. Moreover, we observe that by introducing radical-level supervision as an auxiliary task, the performance of baselines can be further boosted. The code and datasets are made publicly available at https://github.com/FudanVI/benchmarking-chinese-text-recognition
翻译:近些年来,深层学习的蓬勃发展,见证了文本识别的迅速发展;然而,现有的文本识别方法主要针对英文文本提出。作为另一种广泛语言,中国文本识别(CTR)在所有方面都有广泛的应用市场。根据我们的观察,我们对CTR的很少关注归因于缺乏合理的数据集构建标准、统一的评估协议和现有基线的结果。为了填补这一空白,我们从公开提供的竞赛、项目和文件中手工收集CTR数据集。根据应用设想,我们将收集的数据集分为四类,包括场景、网络、文档和笔迹数据集。此外,我们还将CTR的评价协议标准化。通过统一的评估协议,我们评估所收集的数据集上的一系列具有代表性的文本识别方法,以提供基线。实验结果表明,CTR数据集基线的性能不如英文数据集的好,因为中文文本的特点与拉丁字母大不相同。此外,我们发现,通过引入激进级别的监督作为辅助任务,可将基准/文本数据集的性能进一步提升。