Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e. gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and low-frequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%) with only 1/4 parameters and 1/3 FLOPs. Code and models will be released at https://github.com/sail-sg/iFormer.
翻译:最近的研究表明, 变异器具有强大的建立远程依赖性的能力, 但是在捕捉以本地信息为主的高频率时, 变异器仍然不称职。 为了解决这个问题, 我们展示了一个新和通用的感知变异器, 或者短短的 iFormer, 有效地学习了在视觉数据中高和低频信息的全面性特征。 具体地说, 我们设计了一个感知混音器, 以明确显示向变异器获取高频信息的能力。 与最近的混合框架不同, 感知混合器通过频道分解机制带来更高的效率, 将平行的变异/ 混合路径和自控路径作为高和低频混合器。 同时, 底层在捕捉高频信息方面能发挥更多作用, 而顶层在模拟低频全球信息方面, 我们进一步引入了一个频率斜度斜度斜度结构, 例如, 向高频混音器输入的维度参数, 并且向低频变异频变异的变频变异的S- 级S- 的S- creal- creal commation 。