Gains in the ability to generalize on image analysis tasks for neural networks have come at the cost of increased number of parameters and layers, dataset sizes, training and test computations, and GPU RAM. We introduce a new architecture -- WaveMix-Lite -- that can generalize on par with contemporary transformers and convolutional neural networks (CNNs) while needing fewer resources. WaveMix-Lite uses 2D-discrete wavelet transform to efficiently mix spatial information from pixels. WaveMix-Lite seems to be a versatile and scalable architectural framework that can be used for multiple vision tasks, such as image classification and semantic segmentation, without requiring significant architectural changes, unlike transformers and CNNs. It is able to meet or exceed several accuracy benchmarks while training on a single GPU. For instance, it achieves state-of-the-art accuracy on five EMNIST datasets, outperforms CNNs and transformers in ImageNet-1K (64$\times$64 images), and achieves an mIoU of 75.32 % on Cityscapes validation set, while using less than one-fifth the number parameters and half the GPU RAM of comparable CNNs or transformers. Our experiments show that while the convolutional elements of neural architectures exploit the shift-invariance property of images, new types of layers (e.g., wavelet transform) can exploit additional properties of images, such as scale-invariance and finite spatial extents of objects.
翻译:在对神经网络图像分析任务进行概括化的能力提高方面,由于参数和层数的增加、数据集大小、培训和测试计算以及 GPU RAM,神经网络图像分析任务取得了进步。我们引入了一个新的架构 -- -- WaveMix-Lite -- -- 可以与当代变压器和进化神经网络(CNNs)相提并论,同时需要更少的资源。WaveMix-Lite使用2D分解波盘转换,以高效地混合像素的空间信息。WaveMix-Lite似乎是一个多功能和可缩放的建筑框架,可用于多重视觉任务,例如图像分类和语义分割,而不需要重大的建筑变化,不同于变压器和CNN。它能够达到或超过几个精确基准,同时在5个 EMNIST数据集上达到最先进的准确度,在图像网络-1K中超越CNN和变压器(64美元\时间值)中,在图像变压图中达到75.32的 mIU值值,同时在GMAISPL的新的变动模型中,在CIS-deal-deal 的模型中可以比较地标中显示。