Recently end-to-end scene text spotting has become a popular research topic due to its advantages of global optimization and high maintainability in real applications. Most methods attempt to develop various region of interest (RoI) operations to concatenate the detection part and the sequence recognition part into a two-stage text spotting framework. However, in such framework, the recognition part is highly sensitive to the detected results (e.g.), the compactness of text contours). To address this problem, in this paper, we propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO, in which character sequences can be directly recognized without RoI operation. Concretely, a position-aware mask attention module is developed to generate attention weights on each text instance and its characters. It allows different text instances in an image to be allocated on different feature map channels which are further grouped as a batch of instance features. Finally, a lightweight sequence decoder is applied to generate the character sequences. It is worth noting that MANGO inherently adapts to arbitrary-shaped text spotting and can be trained end-to-end with only coarse position information (e.g.), rectangular bounding box) and text annotations. Experimental results show that the proposed method achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks, i.e., ICDAR 2013, ICDAR 2015, Total-Text, and SCUT-CTW1500.
翻译:最近端到端的现场文字定位由于全球优化的优势和真实应用中的高度可维护性,已成为一个受欢迎的研究专题。大多数方法都试图开发各种感兴趣的区域(ROI)操作,将检测部分和序列识别部分合并成一个两阶段的文本定位框架。但是,在这种框架内,识别部分对检测到的结果(例如,文本轮廓的紧凑性)非常敏感。为了解决这一问题,我们在本文件中提议了一个名为MANGO(MANGO)的阶段文字识别框架,在这个框架中,字符序列可以在不进行 RoI 操作的情况下直接得到承认。具体地说,开发一个位置觉觉觉的隐藏关注模块,以引起对每个文本实例及其字符的注意权重。它允许在不同特征地图频道上分配不同的文本实例,这些图像被进一步归为一组实例特征特征特征特征特征特征。最后,对生成字符序列序列应用了轻量序列解码。值得注意的是,MANGO(MaNGO)内在适应任意的文本定位,甚至可以经过培训的 Rent-ART (I-ROT) 的尾端点和直径图框中,只能显示常规- 和直径对等的图像的状态。