This paper explores the multi-scale aggregation strategy for scene text detection in natural images. We present the Aggregated Text TRansformer(ATTR), which is designed to represent texts in scene images with a multi-scale self-attention mechanism. Starting from the image pyramid with multiple resolutions, the features are first extracted at different scales with shared weight and then fed into an encoder-decoder architecture of Transformer. The multi-scale image representations are robust and contain rich information on text contents of various sizes. The text Transformer aggregates these features to learn the interaction across different scales and improve text representation. The proposed method detects scene texts by representing each text instance as an individual binary mask, which is tolerant of curve texts and regions with dense instances. Extensive experiments on public scene text detection datasets demonstrate the effectiveness of the proposed framework.
翻译:本文探讨了在自然图像中现场文本检测的多尺度汇总战略。 我们展示了集成文本TRansfrench(ATTR), 目的是用多尺度的自省机制在现场图像中显示文本。 从图像金字塔和多分辨率开始, 特征首先在不同的尺度上提取, 并具有共同的重量, 然后输入到变异器的编码器- 解码器结构中。 多尺度图像显示是强有力的, 包含关于不同大小的文本内容的丰富信息。 文本变换器汇总了这些特征, 以学习不同尺度的相互作用, 并改进文本表达方式。 提议的方法通过将每个文本实例作为单个的二元遮罩来检测现场文本, 这是一种对曲线文本和大量实例的区域的容忍性。 在公共场景文本检测数据集上进行广泛的实验显示了拟议框架的有效性 。