Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes. Conventional models in STR employ convolutional neural network (CNN) followed by recurrent neural network in an encoder-decoder framework. In recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency which appears to be prominent in scene text images. Many researchers utilized transformer as part of a hybrid CNN-transformer encoder, often followed by a transformer decoder. However, such methods only make use of the long-term dependency mid-way through the encoding process. Although the vision transformer (ViT) is able to capture such dependency at an early stage, its utilization remains largely unexploited in STR. This work proposes the use of a transformer-only model as a simple baseline which outperforms hybrid CNN-transformer models. Furthermore, two key areas for improvement were identified. Firstly, the first decoded character has the lowest prediction accuracy. Secondly, images of different original aspect ratios react differently to the patch resolutions while ViT only employ one fixed patch resolution. To explore these areas, Pure Transformer with Integrated Experts (PTIE) is proposed. PTIE is a transformer model that can process multiple patch resolutions and decode in both the original and reverse character orders. It is examined on 7 commonly used benchmarks and compared with over 20 state-of-the-art methods. The experimental results show that the proposed method outperforms them and obtains state-of-the-art results in most benchmarks.
翻译:显微文本识别(STR) 涉及在自然场景的作物图像中读取文本的任务。 STR 中的常规模型使用的是动态神经网络(CNN),然后在编码解码器框架内使用经常性神经网络。最近,变压器结构在STR 中被广泛采用,因为它显示具有捕捉长期依赖性的强大能力,这在现场文本图像中似乎十分突出。许多研究人员使用变压器作为混杂CNN- Transerent 编码器的一部分,通常随后有一个变压器解码器。然而,这种方法仅通过编码过程利用长期依赖性中途。虽然视觉变压器(VIT)能够在早期捕捉到这种依赖性,但其利用在很大程度上在ST 中仍然未被利用。 这项工作提议使用只使用变压器模型作为简单的基准, 超越了混合CNN- Transtrader模型。 此外, 确定了两个关键的改进领域。 首先, 最初的解码特性具有最低的预测精确度。 其次, 不同的原始方位比对补码比对补码性字符特性的校准在编码中, 而PTerf- 比较的解算法只有一种固定的解变校正的解法, 在常规的解变校正的解法中, 在常规的解法中, 的解法中,这些是两个的解法的解的解算的解算法是两种方法。