Information surrounds people in modern life. Text is a very efficient type of information that people use for communication for centuries. However, automated text-in-the-wild recognition remains a challenging problem. The major limitation for a DL system is the lack of training data. For the competitive performance, training set must contain many samples that replicate the real-world cases. While there are many high-quality datasets for English text recognition; there are no available datasets for Russian language. In this paper, we present a large-scale human-labeled dataset for Russian text recognition in-the-wild. We also publish a synthetic dataset and code to reproduce the generation process
翻译:信息环绕现代人们的生活,文本是人们使用了几个世纪用于交流的一种非常有效的信息类型。然而,自动化的文本外野识别仍然是一个具有挑战性的问题。DL系统的主要限制是缺乏训练数据。为了获得竞争性能,训练集必须包含许多样本,以复制真实世界的情况。虽然有许多高质量的英文文本识别数据集,但没有可用的俄文数据集。在本文中,我们提出一个适用于野外俄文文本识别的大规模人工标注数据集。我们还发布了一个合成数据集和代码以重现生成过程。