Almost all scene text spotting (detection and recognition) methods rely on costly box annotation (e.g., text-line box, word-level box, and character-level box). For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance. We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task, like language modeling. Given an image as input, we formulate the desired detection and recognition results as a sequence of discrete tokens and use an auto-regressive transformer to predict the sequence. We achieve promising results on several horizontal, multi-oriented, and arbitrarily shaped scene text benchmarks. Most significantly, we show that the performance is not very sensitive to the positions of the point annotation, meaning that it can be much easier to be annotated and automatically generated than the bounding box that requires precise positions. We believe that such a pioneer attempt indicates a significant opportunity for scene text spotting applications of a much larger scale than previously possible.
翻译:几乎所有场景文本识别(检测和识别)方法都依赖于昂贵的框注解(例如,文本线框、字级框和字符级框)。我们第一次展示了培训场景文本识别模型可以以极低的成本对每个场景的单点进行批注。我们提出了一个端到端的场景文本识别方法,该方法将现场文本定位作为顺序预测任务处理,如语言建模。由于图像作为输入,我们将想要的检测和识别结果作为离散符号的序列,并使用自动递增变异器来预测序列。我们在若干水平、多方向和任意塑造的场景文本基准上取得了有希望的结果。最重要的是,我们显示性能对于点注解的方位并不十分敏感,这意味着比要求精确位置的捆绑框更容易附加附加说明和自动生成。我们相信,这种先驱尝试为现场文本查找比以往可能大得多的规模应用提供了一个重要的机会。