Scene text recognition (STR) has been widely studied in academia and industry. Training a text recognition model often requires a large amount of labeled data, but data labeling can be difficult, expensive, or time-consuming, especially for Traditional Chinese text recognition. To the best of our knowledge, public datasets for Traditional Chinese text recognition are lacking. This paper presents a framework for a Traditional Chinese synthetic data engine which aims to improve text recognition model performance. We generated over 20 million synthetic data and collected over 7,000 manually labeled data TC-STR 7k-word as the benchmark. Experimental results show that a text recognition model can achieve much better accuracy either by training from scratch with our generated synthetic data or by further fine-tuning with TC-STR 7k-word.
翻译:培训文本识别模型往往需要大量贴标签的数据,但数据标识可能困难、昂贵或费时,特别是传统的中文文本识别。据我们所知,缺少用于传统中文文本识别的公共数据集。本文为传统的中国合成数据引擎提供了一个框架,目的是改进文本识别模型的性能。我们生成了2 000多万个合成数据,并收集了7 000多个手工贴标签的数据(TC-STR 7k字)作为基准。实验结果显示,文本识别模型可以通过从零到零地培训我们生成的合成数据,或者通过进一步调整TC-STR 7k字来实现更高的准确性。