The recent large-scale Contrastive Language-Image Pretraining (CLIP) model has shown great potential in various downstream tasks via leveraging the pretrained vision and language knowledge. Scene text, which contains rich textual and visual information, has an inherent connection with a model like CLIP. Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection. In contrast to these works, this paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process. We demonstrate the advantages of the proposed TCM as follows: (1) The underlying principle of our framework can be applied to improve existing scene text detector. (2) It facilitates the few-shot training capability of existing methods, e.g., by using 10% of labeled data, we significantly improve the performance of the baseline method with an average of 22% in terms of the F-measure on 4 benchmarks. (3) By turning the CLIP model into existing scene text detection methods, we further achieve promising domain adaptation ability. The code will be publicly released.
翻译:最近的大型对比语言预科培训模式(CLIP)显示,通过利用预先培训的视力和语言知识,在各种下游任务中表现出巨大的潜力。包含丰富的文字和视觉信息的场景文本与CLIP这样的模型有着内在的联系。最近,基于视觉语言模型的预培训方法在文本检测领域取得了有效进展。与这些工作不同,本文件提出了一种新的方法,称为TCM,侧重于直接将CLIP模式转向无需经过培训的文本检测。我们展示了拟议的TCM的优点如下:(1) 我们框架的基本原则可以用来改进现有的现场文本探测器。(2) 它将促进现有方法的微小的培训能力,例如,使用10%的贴标签数据,我们大大改进基线方法的绩效,在4个基准的F措施方面平均达到22%。(3) 通过将CLIP模式转化为现有的现场文本检测方法,我们进一步实现有希望的域适应能力。该守则将公开发布。</s>