Recently, plenty of work has tried to introduce transformers into computer vision tasks, with good results. Unlike classic convolution networks, which extract features within a local receptive field, transformers can adaptively aggregate similar features from a global view using self-attention mechanism. For object detection, Feature Pyramid Network (FPN) proposes feature interaction across layers and proves its extremely importance. However, its interaction is still in a local manner, which leaves a lot of room for improvement. Since transformer was originally designed for NLP tasks, adapting processing subject directly from text to image will cause unaffordable computation and space overhead. In this paper, we utilize a linearized attention function to overcome above problems and build a novel architecture, named Content-Augmented Feature Pyramid Network (CA-FPN), which proposes a global content extraction module and deeply combines with FPN through light linear transformers. What's more, light transformers can further make the application of multi-head attention mechanism easier. Most importantly, our CA-FPN can be readily plugged into existing FPN-based models. Extensive experiments on the challenging COCO object detection dataset demonstrated that our CA-FPN significantly outperforms competitive baselines without bells and whistles. Code will be made publicly available.
翻译:最近,许多工作都试图将变压器引入计算机视野任务,并取得了良好的成果。与典型的变压网络不同,变压器在本地可容纳的场域内提取特征,变压器可以使用自我注意机制从全球视角对相似特征进行适应性综合。对于物体探测,地貌型金字塔网络(FPN)提出跨层的特征互动,并证明其极端重要性。然而,它的相互作用仍然以本地方式进行,这留下很大的改进空间。由于变压器最初是为NLP任务设计的,因此直接将处理从文本到图像的文本进行修改将造成无法负担的计算和空间管理费用。在本文件中,我们利用线性关注功能从上面克服问题,并建立一个新的结构,称为内容-增强的地貌型金字塔网络(CA-FPN),它提出全球内容提取模块,并通过光线性变压变换器与FPN密切结合。此外,光变压器可以使多头关注机制的应用更加容易。最重要的是,我们的CA-FPN可以很容易地插入现有的FPN-PN模型。在具有挑战性的CO 竞争性的物体探测基准数据上进行广泛的实验。