Arbitrary-shaped scene text detection is a challenging task due to the variety of text changes in font, size, color, and orientation. Most existing regression based methods resort to regress the masks or contour points of text regions to model the text instances. However, regressing the complete masks requires high training complexity, and contour points are not sufficient to capture the details of highly curved texts. To tackle the above limitations, we propose a novel light-weight anchor-free text detection framework called TextDCT, which adopts the discrete cosine transform (DCT) to encode the text masks as compact vectors. Further, considering the imbalanced number of training samples among pyramid layers, we only employ a single-level head for top-down prediction. To model the multi-scale texts in a single-level head, we introduce a novel positive sampling strategy by treating the shrunk text region as positive samples, and design a feature awareness module (FAM) for spatial-awareness and scale-awareness by fusing rich contextual information and focusing on more significant features. Moreover, we propose a segmented non-maximum suppression (S-NMS) method that can filter low-quality mask regressions. Extensive experiments are conducted on four challenging datasets, which demonstrate our TextDCT obtains competitive performance on both accuracy and efficiency. Specifically, TextDCT achieves F-measure of 85.1 at 17.2 frames per second (FPS) and F-measure of 84.9 at 15.1 FPS for CTW1500 and Total-Text datasets, respectively.
翻译:任意形状的场景文本检测是一项艰巨的任务,因为字体、大小、颜色和方向的文字变化多种多样。大多数基于回归的现有方法都采用倒退文本区域的遮罩或等距点来回移,以模拟文字实例。然而,退缩完整的遮罩需要高程度的训练复杂性,光谱点不足以捕捉高度曲线文本的细节。为了应对上述限制,我们提议了一个名为TextDCT的轻量级固定文本检测框架,采用离散的连线变换(DCT)来将文字遮罩编码为紧凑矢量。此外,考虑到金字塔层的培训样本数量不平衡,我们仅使用一个单层头级头部的顶部,以模拟多尺寸的遮罩,需要高度复杂的训练复杂性,而光谱点不足以捕捉到高曲线文本细节的精度。为了解决上述限制,我们提议了一个名为TextDCT的新颖的免重量级识别模块(FAM),通过使用丰富的84种背景信息,并侧重于更显著的特征,我们提议一个不固定的(S-NMS-9-NFS-Cx-Cx-Cximal laimalalalal lais) 方法可以分别在测试15-FS-FS-Bal-Bal-Bal-Cal-Cal-CFS-S-S-S-S-Sxxxx。