Typical text spotters follow the two-stage spotting strategy: detect the precise boundary for a text instance first and then perform text recognition within the located text region. While such strategy has achieved substantial progress, there are two underlying limitations. 1) The performance of text recognition depends heavily on the precision of text detection, resulting in the potential error propagation from detection to recognition. 2) The RoI cropping which bridges the detection and recognition brings noise from background and leads to information loss when pooling or interpolating from feature maps. In this work we propose the single shot Self-Reliant Scene Text Spotter (SRSTS), which circumvents these limitations by decoupling recognition from detection. Specifically, we conduct text detection and recognition in parallel and bridge them by the shared positive anchor point. Consequently, our method is able to recognize the text instances correctly even though the precise text boundaries are challenging to detect. Additionally, our method reduces the annotation cost for text detection substantially. Extensive experiments on regular-shaped benchmark and arbitrary-shaped benchmark demonstrate that our SRSTS compares favorably to previous state-of-the-art spotters in terms of both accuracy and efficiency.
翻译:典型的文本显示器采用两阶段定位战略:先为文本实例探测准确的边界,然后在位置的文本区域内进行文本识别。虽然这种战略取得了实质性进展,但有两个基本限制。 (1) 文本识别的性能在很大程度上取决于文本检测的精确性,从而可能导致从探测到识别之间的潜在误差。 (2) 连接探测和识别的轮廓在背景中产生噪音,在集成或从地貌图中插入时导致信息丢失。 在这项工作中,我们建议采用单一镜头的“自Reliant Scene Text Spointter ”(SRSTS),通过将识别与检测脱钩来规避这些限制。具体地说,我们同时进行文本检测和识别,并用共同的正值锚点将其连接起来。因此,我们的方法能够正确识别文本实例,尽管精确的文本界限很难探测。此外,我们的方法大大降低了文本检测的注解费用。关于定期缩准基准和任意定型基准的广泛实验表明,我们的SRSTSTS在准确性和效率方面比得上都优于以前的状态。