In the deployment of scene-text spotting systems on mobile platforms, lightweight models with low computation are preferable. In concept, end-to-end (E2E) text spotting is suitable for such purposes because it performs text detection and recognition in a single model. However, current state-of-the-art E2E methods rely on heavy feature extractors, recurrent sequence modellings, and complex shape aligners to pursue accuracy, which means their computations are still heavy. We explore the opposite direction: How far can we go without bells and whistles in E2E text spotting? To this end, we propose a text-spotting method that consists of simple convolutions and a few post-processes, named Context-Free TextSpotter. Experiments using standard benchmarks show that Context-Free TextSpotter achieves real-time text spotting on a GPU with only three million parameters, which is the smallest and fastest among existing deep text spotters, with an acceptable transcription quality degradation compared to heavier ones. Further, we demonstrate that our text spotter can run on a smartphone with affordable latency, which is valuable for building stand-alone OCR applications.
翻译:在移动平台上部署现场文字识别系统时,使用低计算法的轻量模型比较可取。在概念上,端到端的文本识别方法适合用于这种目的,因为它在单一的模型中进行文本检测和识别。然而,目前最先进的E2E方法依赖于重特性提取器、经常性序列模型和复杂的形状匹配器,以追求准确性,这意味着其计算仍然十分繁重。我们探索相反的方向:在E2E文本识别中,我们能有多远地不响铃声和哨声?为此,我们建议一种文本定位方法,包括简单的组合和几个后处理器,称为 " 环境自由文本布景 " 。使用标准基准的实验显示, " 环境自由文本布景器 " 能够在GPU上实时定位文本,只有300万个参数,这是现有深层文本识别器中最小和最快的,与较重的图像质量退化是可以接受的。此外,我们证明,我们的文本定位器可以运行智能的手机,具有可承受的胶着。