Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly employ down-sampling operations (e.g., average pooling) over keys/values to dramatically reduce the computational cost. In this work, we argue that such over-aggressive down-sampling design is not invertible and inevitably causes information dropping especially for high-frequency components in objects (e.g., texture details). Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way. This proposal enables self-attention learning with lossless down-sampling over keys/values, facilitating the pursuing of a better efficiency-vs-accuracy trade-off. Furthermore, inverse wavelet transforms are leveraged to strengthen self-attention outputs by aggregating local contexts with enlarged receptive field. We validate the superiority of Wave-ViT through extensive experiments over multiple vision tasks (e.g., image recognition, object detection and instance segmentation). Its performances surpass state-of-the-art ViT backbones with comparable FLOPs. Source code is available at \url{https://github.com/YehLi/ImageNetModel}.
翻译:多尺度视觉变换器(ViT)已成为计算机视觉任务的一个强大支柱,而输入补码数在变换器的二次缩微缩微缩微缩微缩微缩微缩(W.r.t.t.)中的自我注意计算方法则成为了计算机视觉任务的强大支柱。因此,现有解决方案通常对键盘/值进行下调抽样操作(例如平均集合),以大幅降低计算成本。在这项工作中,我们争辩说,这种过度进取式下调设计并非不可忽略,而且不可避免地导致信息下降,特别是目标中高频部件(例如,文字细节)的信息下降。在波盘理论的驱动下,我们建造一个新的Wavelet 变换器(\ textbf{Wave-ViT}),以统一的方式,用波盘变和自留式学习来制定不可忽略的下调缩缩缩缩缩微缩略图。这个提议可以让自我自留式学习,对键/值进行无损失的缩略微缩微缩微缩微缩微缩图,促进追求更高的 Violtlet变形变换V目标,通过本地的图像测试,通过放大地展示,加强可自我测试。