To allow image analysis in resource-constrained scenarios without compromising generalizability, we introduce WaveMix -- a novel and flexible neural framework that reduces the GPU RAM (memory) and compute (latency) compared to CNNs and transformers. In addition to using convolutional layers that exploit shift-invariant image statistics, the proposed framework uses multi-level two-dimensional discrete wavelet transform (2D-DWT) modules to exploit scale-invariance and edge sparseness, which gives it the following advantages. Firstly, the fixed weights of wavelet modules do not add to the parameter count while reorganizing information based on these image priors. Secondly, the wavelet modules scale the spatial extents of feature maps by integral powers of $\frac{1}{2}\times\frac{1}{2}$, which reduces the memory and latency required for forward and backward passes. Finally, a multi-level 2D-DWT leads to a quicker expansion of the receptive field per layer than pooling (which we do not use) and it is a more effective spatial token mixer. WaveMix also generalizes better than other token mixing models, such as ConvMixer, MLP-Mixer, PoolFormer, random filters, and Fourier basis, because the wavelet transform is much better suited for image decomposition and spatial token mixing. WaveMix is a flexible model that can perform well on multiple image tasks without needing architectural modifications. WaveMix achieves a semantic segmentation mIoU of 83% on the Cityscapes validation set outperforming transformer and CNN-based architectures. We also demonstrate the advantages of WaveMix for classification on multiple datasets and show that WaveMix establishes new state-of-the-results in Places-365, EMNIST, and iNAT-mini datasets.
翻译:为了在不损及一般性的情况下对资源限制的情景进行图像分析,我们引入了WaveMix -- -- 一个新颖和灵活的神经框架,比CNN和变压器减少 GPU 内存(模拟)和计算(延迟)与CNN和变压器相比较。除了使用利用变换内变图像统计的组合层外,拟议框架还使用多层双维离散波盘变(2D-DWT)模块,以利用规模变化和边缘内分差,从而使它具有以下优势。首先,波盘模块的固定重量不会增加参数数,同时根据这些图像前期重组信息。第二,波盘模块通过 $\frac{1\%2 ⁇ time\freme\frac{{1\%2}来缩放地图的空间范围范围。此外,多层内二维离散波变换(D-DWT)模块可以更快地扩展可接受的字段,比集合(我们没有使用),而且它是一个更有效的空间图像流流变MML 的模型,因为多Mix 变变的模型比Mix更能更能更能、更能、更能、更能、更能、更能、更能、更能、更能的流变化了内更能、更能、更能、更能、更能的变化了MLILILILI)