Recently, Transformer-based methods, which predict polygon points or Bezier curve control points for localizing texts, are popular in scene text detection. However, these methods built upon detection transformer framework might achieve sub-optimal training efficiency and performance due to coarse positional query modeling.In addition, the point label form exploited in previous works implies the reading order of humans, which impedes the detection robustness from our observation. To address these challenges, this paper proposes a concise Dynamic Point Text DEtection TRansformer network, termed DPText-DETR. In detail, DPText-DETR directly leverages explicit point coordinates to generate position queries and dynamically updates them in a progressive way. Moreover, to improve the spatial inductive bias of non-local self-attention in Transformer, we present an Enhanced Factorized Self-Attention module which provides point queries within each instance with circular shape guidance. Furthermore, we design a simple yet effective positional label form to tackle the side effect of the previous form. To further evaluate the impact of different label forms on the detection robustness in real-world scenario, we establish an Inverse-Text test set containing 500 manually labeled images. Extensive experiments prove the high training efficiency, robustness, and state-of-the-art performance of our method on popular benchmarks. The code and the Inverse-Text test set are available at https://github.com/ymy-k/DPText-DETR.
翻译:最近,基于变压器的方法预测了多角点或用于本地化文本的Bezier曲线控制点,这些方法在现场文本检测中很受欢迎,但是,这些基于检测变压器框架的方法可能由于偏差的定位查询模型而实现低于最佳的培训效率和性能。此外,在以往工作中使用的点标签形式意味着人类的阅读顺序,这妨碍了我们观察到的检测力度。为了应对这些挑战,本文件建议采用一个简洁的动态点文本定位标签格式,称为DPText-DETR。详细来说,DPText-DETR直接利用明确点坐标生成位置查询并动态更新它们。此外,为了改进变压器中非本地自我注意的空间感化偏差偏差,我们提出了一个强化的质化自控模块,在每次查看时都提供循环形状指导的点查询。此外,我们设计了一个简单有效的位置标签格式,以解决前一种形式的侧面效果。为了进一步评估不同标签表格对真实世界情景中检测稳健度的影响,并动态更新这些坐标。此外,我们在变压系统测试基准中,我们建立了一套高端标准。