End-to-end scene text spotting has made significant progress due to its intrinsic synergy between text detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated rectangles, quadrangles,and polygons as a prerequisite, which are much more expensive than using single-point. For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost single-point annotation by the proposed framework, termed SPTS v2. SPTS v2 reserves the advantage of the auto-regressive Transformer with an Instance Assignment Decoder (IAD) through sequentially predicting the center points of all text instances inside the same predicting sequence, while with a Parallel Recognition Decoder (PRD) for text recognition in parallel. These two decoders share the same parameters and are interactively connected with a simple but effective information transmission process to pass the gradient and information. Comprehensive experiments on various existing benchmark datasets demonstrate the SPTS v2 can outperform previous state-of-the-art single-point text spotters with fewer parameters while achieving 14x faster inference speed. Most importantly, within the scope of our SPTS v2, extensive experiments further reveal an important phenomenon that single-point serves as the optimal setting for the scene text spotting compared to non-point, rectangular bounding box, and polygonal bounding box. Such an attempt provides a significant opportunity for scene text spotting applications beyond the realms of existing paradigms. Code is available at https://github.com/shannanyinxiang/SPTS.
翻译:端到端的场景文本检测由于文本检测和识别之间的内在协同作用而取得了显著进展。 以往的方法通常将横向矩形、 旋转矩形、 夸德罗列和多边形等手动说明视为比使用单点更昂贵的前提条件。 我们第一次证明, 培训场景文本检测模型可以通过拟议框架的极低成本单点注释实现, 称为 SPBS v2. CTOS v2 保存自动回溯式变异器的优势。 通过连续预测同一预测序列内所有文本事件的中心点, 而平行识别解码(PRD)则比使用单点要贵得多。 这两个解coder共享相同的参数, 并且与简单而有效的单点信息传输程序互动。 对各种现有基准数据集的全面实验表明, CTBS v2 能够超越先前的州- 单点文本显示器( IADD) 的优势, 连续预测同一预测序列内所有文本的中点, 而同时实现14x 边的平行解码(PRD), SP 提供更快速的场景端点的场景空间测试速度。