Automated recognition of texts in scenes has been a research challenge for years, largely due to the arbitrary variation of text appearances in perspective distortion, text line curvature, text styles and different types of imaging artifacts. The recent deep networks are capable of learning robust representations with respect to imaging artifacts and text style changes, but still face various problems while dealing with scene texts with perspective and curvature distortions. This paper presents an end-to-end trainable scene text recognition system (ESIR) that iteratively removes perspective distortion and text line curvature as driven by better scene text recognition performance. An innovative rectification network is developed which employs a novel line-fitting transformation to estimate the pose of text lines in scenes. In addition, an iterative rectification pipeline is developed where scene text distortions are corrected iteratively towards a fronto-parallel view. The ESIR is also robust to parameter initialization and the training needs only scene text images and word-level annotations as required by most scene text recognition systems. Extensive experiments over a number of public datasets show that the proposed ESIR is capable of rectifying scene text distortions accurately, achieving superior recognition performance for both normal scene text images and those suffering from perspective and curvature distortions.
翻译:多年来,对现场文本的自动识别一直是一项研究挑战,这主要是因为视觉扭曲、文字线曲线、文字样式和各种类型的成像文物的文字外观任意变化。最近的深层次网络能够学习成像文物和文字样式变化方面的有力表述,但在处理带有视觉和曲线扭曲的现场文本时仍面临各种问题。本文件展示了一个端到端的可训练现场文本识别系统(ESIR),该系统迭接地消除了视觉扭曲和文字线曲线,这是由更好的现场文本识别性能所驱动的。开发了一个创新的校正网络,它利用了新颖的对线转换来估计场面文字线的形状。此外,还开发了一个迭代校正管道,使场面文字扭曲能被迭代地纠正为正面方圆形观点。ESIR还能够根据大多数场面文本识别系统的要求对初始化进行参数调整,培训只需要现场文本图像和字级说明。对一些公共数据集进行的广泛实验表明,拟议的ESIR能够准确地纠正场面文字扭曲,从正常的图像和曲线角度实现高级的扭曲。