The prevalent perspectives of scene text recognition are from sequence to sequence (seq2seq) and segmentation. In this paper, we propose a new perspective on scene text recognition, in which we model the scene text recognition as an image classification problem. Based on the image classification perspective, a scene text recognition model is proposed, which is named as CSTR. The CSTR model consists of a series of convolutional layers and a global average pooling layer at the end, followed by independent multi-class classification heads, each of which predicts the corresponding character of the word sequence in input image. The CSTR model is easy to train using parallel cross entropy losses. CSTR is as simple as image classification models like ResNet \cite{he2016deep} which makes it easy to implement, and the fully convolutional neural network architecture makes it efficient to train and deploy. We demonstrate the effectiveness of the classification perspective on scene text recognition with thorough experiments. Futhermore, CSTR achieves nearly state-of-the-art performance on six public benchmarks including regular text, irregular text. The code will be available at https://github.com/Media-Smart/vedastr.
翻译:现场文本识别的流行视角是从序列到序列( seq2seq) 和分区。 在本文中, 我们提出了现场文本识别的新视角, 将现场文本识别作为图像分类问题进行模型。 根据图像分类的视角, 提出了一个场面文本识别模型, 名为 CSTR 。 CSTR 模型由一系列的卷变层和全球平均集合层组成, 最后由独立的多级分类头组成, 每个分类头都预测输入图像中的单词序列的相应特性。 CSTR 模型很容易使用平行的交叉加密损失来进行训练。 CSTR 模式和 ResNet\ cite{he2016deep} 图像分类模型一样简单, 便于执行, 完全进化的神经网络结构使得它能够高效地培训和部署。 我们通过彻底的实验来展示现场文本识别的分类视角的有效性。 Futermore, CSTRANS 将在包括常规文本在内的六个公共基准上实现近于状态的功能性表现。 代码将在 https://github. com/ Smartdia/Mealda/ mada/ muststra.