A novel scene text recognizer based on Vision-Language Transformer (VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the proposed method (named Levenshtein OCR, and LevOCR for short) explores an alternative way for automatically transcribing textual content from cropped natural images. Specifically, we cast the problem of scene text recognition as an iterative sequence refinement process. The initial prediction sequence produced by a pure vision model is encoded and fed into a cross-modal transformer to interact and fuse with the visual features, to progressively approximate the ground truth. The refinement process is accomplished via two basic character-level operations: deletion and insertion, which are learned with imitation learning and allow for parallel decoding, dynamic length change and good interpretability. The quantitative experiments clearly demonstrate that LevOCR achieves state-of-the-art performances on standard benchmarks and the qualitative analyses verify the effectiveness and advantage of the proposed LevOCR algorithm. Code will be released soon.
翻译:展示了一个基于视觉语言变异器(VLT)的新场景文本识别器。在莱文什丁变异器(Levenshtein 变异器)的启发下,NLP地区的拟议方法(Levenshtein OCR和LevOCR(简称LevOCR))探索了一种自动转换自然作物图像文字内容的替代方法。具体地说,我们把场景文本识别问题作为一个迭接顺序改进过程。纯视觉模型产生的初始预测序列被编码并输入一个跨式变异器,以便与视觉特征互动和结合,逐步接近地面真相。改进过程是通过两个基本的字符级操作完成的:删除和插入,通过模仿学习来学习,并允许平行解码、动态长度变化和良好的解释性。定量实验清楚地表明,LevOCR在标准基准上取得了最先进的性能,定性分析将核实拟议的LevOCR算法的有效性和优势。代码将很快发布。