In recent years, scene text recognition is always regarded as a sequence-to-sequence problem. Connectionist Temporal Classification (CTC) and Attentional sequence recognition (Attn) are two very prevailing approaches to tackle this problem while they may fail in some scenarios respectively. CTC concentrates more on every individual character but is weak in text semantic dependency modeling. Attn based methods have better context semantic modeling ability while tends to overfit on limited training data. In this paper, we elaborately design a Rectified Attentional Double Supervised Network (ReADS) for general scene text recognition. To overcome the weakness of CTC and Attn, both of them are applied in our method but with different modules in two supervised branches which can make a complementary to each other. Moreover, effective spatial and channel attention mechanisms are introduced to eliminate background noise and extract valid foreground information. Finally, a simple rectified network is implemented to rectify irregular text. The ReADS can be trained end-to-end and only word-level annotations are required. Extensive experiments on various benchmarks verify the effectiveness of ReADS which achieves state-of-the-art performance.
翻译:近年来,场景文本识别总是被视为一个从顺序到顺序的问题。连接时间分类(CTC)和注意序列识别(Attn)是解决这一问题的两种非常普遍的办法,尽管在某些情形中可能发生故障。 CTC更多地关注每个个个性,但在文字语义依赖性模型方面薄弱。 以Attn为基础的方法具有更好的背景语义建模能力,但往往过于适合有限的培训数据。 在本文中,我们精心设计了一个经过校正的注意性双倍监控网络(ReADS),用于一般场景文本识别。为了克服CTC和Attn的弱点,这两种方法都适用于我们的方法,但两个受监督的分支的不同模块都可相互补充。此外,还引入有效的空间和引导关注机制,以消除背景噪音和提取有效的地面信息。 最后,实施了一个简单的校正网络,以纠正不规则的文本。ReADS可以经过培训,只需要文字级别的说明。在各种基准上进行广泛的实验,以核实READS的有效性,从而实现状态的绩效。