Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93.35% on standard benchmarks. Code will be released soon.
翻译:多年来,在计算机愿景中,文本识别是一个积极的研究课题。为了解决这个具有挑战性的问题,我们连续提出许多创新方法,并将语言知识纳入STR模型,最近已成为一个突出的趋势。在这项工作中,我们首先从View 变异器(Vit)最近的进展中汲取灵感,以构建一个概念简单而有力的STR模型,该模型以ViT为基础,优于以往最先进的现场文本识别模型,包括纯视觉模型和语言强化方法。为了整合语言知识,我们进一步提出多语种预测战略,以隐含的方式将语言模式的信息输入模型,即,除常规字符级别代表外,在NLP中广泛使用的子词表达(BPE和WordPiece)还引入了产出空间,同时没有采用独立语言模型(LM)。结果算法(MGP-ST)能够将STR的性能封套到更高的水平。具体地说,它将很快在标准93.35上实现对标准的平均精确度。