While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the tokens. Moreover, through experiments on CrossFormer, we observe another two issues that affect vision transformers' performance, i.e. the enlarging self-attention maps and amplitude explosion. Thus, we further propose a progressive group size (PGS) paradigm and an amplitude cooling layer (ACL) to alleviate the two issues, respectively. The CrossFormer incorporating with PGS and ACL is called CrossFormer++. Extensive experiments show that CrossFormer++ outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks. The code will be available at: https://github.com/cheerss/CrossFormer.
翻译:虽然不同尺度的特征对视觉输入具有明显的重要性,但现有的视觉变异器还没有明确地加以利用。 为此,我们首先提出一个跨比例的视觉变异器CrossFormer。 它引入了跨比例嵌入层(CEL)和长距离关注(LSDA) 。 一方面, CEL将每个象征与不同尺度的多个补丁混合在一起,为自我注意模块本身提供跨比例的特性。 另一方面, LSDA 将自我注意模块分割成一个短距离模块和一个长距离对应方,不仅减少计算负担,而且还在符号中保留小规模和大型的特性。 此外,通过在 CrossFormer的实验,我们观察了影响视觉变异器性表现的另外两个问题,即扩大自我注意地图和缩放爆炸。 因此,我们进一步提议一个递进组码(PGS) 范式和一个缩放级冷却层(ACL) 来缓解这两个问题, 分别是: CrossFormer 将PGS/CL 和 ADL 的变异性图像分析器, 显示 Cros-CLADRADRisal 。 显示 Cross 的CLADRADRADRADRADR</s>