The prosperity of deep learning contributes to the rapid progress in scene text detection. Among all the methods with convolutional networks, segmentation-based ones have drawn extensive attention due to their superiority in detecting text instances of arbitrary shapes and extreme aspect ratios. However, the bottom-up methods are limited to the performance of their segmentation models. In this paper, we propose DPTNet (Dual-Path Transformer Network), a simple yet effective architecture to model the global and local information for the scene text detection task. We further propose a parallel design that integrates the convolutional network with a powerful self-attention mechanism to provide complementary clues between the attention path and convolutional path. Moreover, a bi-directional interaction module across the two paths is developed to provide complementary clues in the channel and spatial dimensions. We also upgrade the concentration operation by adding an extra multi-head attention layer to it. Our DPTNet achieves state-of-the-art results on the MSRA-TD500 dataset, and provides competitive results on other standard benchmarks in terms of both detection accuracy and speed.
翻译:深层学习的繁荣有助于现场文字探测的迅速进展。在与革命网络的所有方法中,以分裂为基础的方法已引起广泛的注意,因为它们在探测任意形状和极端方面比率的文字实例方面具有优越性。然而,自下而上的方法仅限于其分解模型的性能。在本文件中,我们提议DPTNet(Dual-Path变异器网络),这是一个简单而有效的结构,用以模拟现场文字探测任务的全球和地方信息。我们进一步提议了一种平行的设计,将革命网络与一个强大的自我注意机制结合起来,以提供注意力路径和相联路径之间的互补线索。此外,正在开发一个双向互动模块,以提供通道和空间层面的互补线索。我们还通过增加一个多头关注层来提升集中操作。我们的DPTNet在MSRA-TD500数据集上取得了最先进的结果,并在探测准确性和速度方面提供其他标准基准的竞争性结果。