Scene text recognition (STR) is a challenging task in computer vision due to the large number of possible text appearances in natural scenes. Most STR models rely on synthetic datasets for training since there are no sufficiently big and publicly available labelled real datasets. Since STR models are evaluated using real data, the mismatch between training and testing data distributions results into poor performance of models especially on challenging text that are affected by noise, artifacts, geometry, structure, etc. In this paper, we introduce STRAug which is made of 36 image augmentation functions designed for STR. Each function mimics certain text image properties that can be found in natural scenes, caused by camera sensors, or induced by signal processing operations but poorly represented in the training dataset. When applied to strong baseline models using RandAugment, STRAug significantly increases the overall absolute accuracy of STR models across regular and irregular test datasets by as much as 2.10% on Rosetta, 1.48% on R2AM, 1.30% on CRNN, 1.35% on RARE, 1.06% on TRBA and 0.89% on GCRNN. The diversity and simplicity of API provided by STRAug functions enable easy replication and validation of existing data augmentation methods for STR. STRAug is available at https://github.com/roatienza/straug.
翻译:由于在自然场景中出现大量可能的文本外观,在计算机视觉中,STRAUG是一项具有挑战性的任务。大多数STR模型依赖合成数据集进行培训,因为没有足够大和公开的贴标签的真实数据集。由于STR模型使用真实数据进行评估,培训和测试数据分布之间的不匹配导致模型的性能差,特别是受噪音、人工制品、几何、结构等影响、具有挑战性的文本模型的性能差。在本文中,我们引入Straug,这是为STRA设计的36个图像增强功能。每个功能都模仿自然场景中可以找到的某些文本图像属性,由摄像传感器或信号处理操作引发,但在培训数据集中代表不足。在应用使用RandAugment的强基线模型时,STRA模型在常规和非常规测试数据集中,其总体绝对准确性提高了2.10%,R2AM的1.48%,CRNNNW的1.30%,RAREER的1.35%,TRA的1.06%,TRA的1.09%,以及GRA/GNURG/TRA的0.89%, 在培训数据集中,STRANNURG/STRA的简单性和STRA/STRASTRASTRA的可提供简化的简化/透明性和透明性校。