Existing scene text spotting (i.e., end-to-end text detection and recognition) methods rely on costly bounding box annotations (e.g., text-line, word-level, or character-level bounding boxes). For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance. We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task. Given an image as input, we formulate the desired detection and recognition results as a sequence of discrete tokens and use an auto-regressive Transformer to predict the sequence. The proposed method is simple yet effective, which can achieve state-of-the-art results on widely used benchmarks. Most significantly, we show that the performance is not very sensitive to the positions of the point annotation, meaning that it can be much easier to be annotated or even be automatically generated than the bounding box that requires precise positions. We believe that such a pioneer attempt indicates a significant opportunity for scene text spotting applications of a much larger scale than previously possible. The code will be publicly available.
翻译:现有场景文本定位(即端到端的文本检测和识别)方法依赖于昂贵的捆绑框附加说明(例如,文本线、字级或字符级绑定框)方法。 我们第一次证明,培训场景文本定位模型可以通过对每个实例的单点进行极低的成本批注来实现。 我们提出了一个端到端的文本定位方法,将现场文本定位作为序列预测任务进行处理。 图像作为输入,我们将预期的检测和识别结果作为离散符号序列,并使用自动递增变换器来预测序列。 拟议的方法简单而有效,可以在广泛使用的基准上实现最新艺术效果。 最重要的是,我们表明,性能对于点标识的位置并不十分敏感,这意味着比需要精确位置的绑定框更容易得到附加说明,甚至自动生成。 我们相信,这样的先驱尝试将表明一个比以往可能提供的要大得多的场景文本定位应用的显著机会。