End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple detection transformer baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations and thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel, solving the sub-tasks in text spotting in a unified framework. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code will be released.
翻译:端到端的文本定位旨在将现场文本检测和识别功能整合到一个统一的框架中。 处理两个子任务之间的关系在设计有效的显示器中发挥着关键作用。 虽然基于变压器的方法消除了超光速后处理, 但它们仍然受到子任务与低培训效率之间的协同问题的影响。 在本文中, 我们介绍DeepSolo, 一个简单的检测变压器基线, 使一个带有文本识别和识别突出点的单个解调器同时能够同时进行文本检测和识别。 从技术上讲, 我们代表了两个子任务之间的关系, 在设计有效的显示器中, 两个子任务之间的关系具有关键的作用。 虽然基于变压器的方法消除了超常的后处理器, 但仍然可以通过非常简单的预测头平行地解码, 解决文本定位中的子变压器。 此外, 我们还将引入一个文本匹配标准, 以提供更准确的监督信号, 从而能够进行更高效的培训。 在通过单一的解码器测试后, 点的计算实验已经将输入了必要的文字语系和位置, 显示深度的排序比以前的描述方法要求要低得多。 。 。 深度Slovestrop 格式要求前的校外的校外的校程要求也要求比以前的校准方法要更低的校正。