Although text recognition has significantly evolved over the years, state-of-the-art (SOTA) models still struggle in the wild scenarios due to complex backgrounds, varying fonts, uncontrolled illuminations, distortions and other artefacts. This is because such models solely depend on visual information for text recognition, thus lacking semantic reasoning capabilities. In this paper, we argue that semantic information offers a complementary role in addition to visual only. More specifically, we additionally utilize semantic information by proposing a multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning. Our novelty lies in the intuition that for text recognition, the prediction should be refined in a stage-wise manner. Therefore our key contribution is in designing a stage-wise unrolling attentional decoder where non-differentiability, invoked by discretely predicted character labels, needs to be bypassed for end-to-end training. While the first stage predicts using visual features, subsequent stages refine on top of it using joint visual-semantic information. Additionally, we introduce multi-scale 2D attention along with dense and residual connections between different stages to deal with varying scales of character sizes, for better performance and faster convergence during training. Experimental results show our approach to outperform existing SOTA methods by a considerable margin.
翻译:虽然多年来对文本的识别发生了显著变化,但是由于背景复杂、字体不同、无控制的光照、扭曲和其他手工艺品等原因,在野生情景中,最先进的(SOTA)模型仍在挣扎。这是因为这些模型完全依赖视觉信息来识别文本,因此缺乏语义推理能力。在本文中,我们争辩说语义信息除了仅具有视觉特征之外,还起到补充作用。更具体地说,我们进一步利用语义信息,提出多阶段多级多级分解器,进行视觉-语义联合推理。我们的新颖之处在于直觉,即文字识别、预测应当以阶段方式加以完善。因此,我们的关键贡献在于设计一个阶段性、不流动的注意力解译器,通过离析性预测性字符标签而引出非差异性,在端对端培训中可以绕过。我们的第一个阶段是使用视觉特征预测,然后用共同视觉-语义推理推理来完善其顶端。此外,我们引入了多级的2D关注点,同时进行文字识别,预测时应当以阶段的方式改进。因此,我们的关键贡献是设计一个阶段的分级的分级分级分级分级的分级,以不同的级,以不同的级,在不同的级中,以显示不同级的级,在级级级级的级级级模型中,以不同的级中,以不同级级级级的级,显示不同的级,显示不同的级,以级,以级,以级,以不同的级方法展示。