Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from: 1) implicitly language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. Firstly, the autonomous suggests to block gradient flow between vision and language models to enforce explicitly language modeling. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for language model which can effectively alleviate the impact of noise input. Additionally, based on the ensemble of iterative predictions, we propose a self-training method which can learn from unlabeled images effectively. Extensive experiments indicate that ABINet has superiority on low-quality images and achieves state-of-the-art results on several mainstream benchmarks. Besides, the ABINet trained with ensemble self-training shows promising improvement in realizing human-level recognition. Code is available at https://github.com/FangShancheng/ABINet.
翻译:然而,如何在端至端深端网络中有效示范语言规则仍然是一项研究挑战。在本文件中,我们认为,语言模型的有限能力来自:(1) 隐含语言模型;(2) 单向特征代表;(3) 具有噪音输入的语言模型。相应地,我们提议采用自主、双向和迭代的ABINet,用于现场文本识别。首先,自主建议阻止视觉和语言模型之间的梯度流动,以明确执行语言模型。第二,由于语言模型是以双向特征代表为基础而提出的,因此一个新的双向双向阴云网络(BCN)是新的双向的。第三,我们建议对语言模型进行迭代校正的方式,以有效减轻噪音输入的影响。此外,根据迭代预测的合用,我们提议一种自我培训方法,可以有效地从未贴标签的图像中学习。广泛的实验表明,ABINet在低质量图像上占有优势,并在几个主流基准上取得状态-艺术成果。此外,ABIBNet在可提供读的A/ShangA自我培训的高级认识中,在可实现的ADM/SAmb/ABBA ALA ALADA/MADALADADRADADAGADLADLADADAG/MAMAMADRADLADRADRADLAGRADIGIDOLADADADADADADADADADADADADADADOLADADOLADOLADADAGAGAGAGAGADOL。