Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language. Concretely, we propose to learn contextualized, joint representations through vision-language pre-training, for the sake of enhancing the performance of scene text detectors. Towards this end, we devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder, as well as three pretext tasks: image-text contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP). The pre-trained model is able to produce more informative representations with richer semantics, which could readily benefit existing scene text detectors (such as EAST and PSENet) in the down-stream text detection task. Extensive experiments on standard benchmarks demonstrate that the proposed paradigm can significantly improve the performance of various representative text detectors, outperforming previous pre-training approaches. The code and pre-trained models will be publicly released.
翻译:最近,在各种情景中,视觉语言联合代表学习被证明在各种情景中非常有效。在本文件中,我们特别调整视觉语言联合学习以进行现场文本检测,这项任务本身涉及两种模式之间的跨模式互动:视觉和语言,因为文字是书面语言的形式。具体地说,我们提议通过视觉语言预培训学习背景化的联合代表,以提高现场文本检测器的性能。为此,我们设计了一个培训前结构,配有图像编码器、文字编码器和跨现代编码器,以及三项托辞任务:图像-文字对比学习(IT)、蒙面语言模型(MLMM)和字面图像预测(WIP)。预先培训的模式能够以更丰富的语义语言产生更丰富的信息性陈述,从而在下流文本检测任务中方便现有的现场文本检测器(如东帝汶和PSENet)。关于标准基准的广泛实验表明,拟议的范例可以大大改进各种有代表性的文本检测器的性能,比培训前的方法要好。代码和培训前模型将公开发布。