Multi-head-self-attention (MHSA)-equipped models have achieved notable performance in computer vision. Their computational complexity is proportional to quadratic numbers of pixels in input feature maps, resulting in slow processing, especially when dealing with high-resolution images. New types of token-mixer are proposed as an alternative to MHSA to circumvent this problem: an FFT-based token-mixer, similar to MHSA in global operation but with lower computational complexity. However, despite its attractive properties, the FFT-based token-mixer has not been carefully examined in terms of its compatibility with the rapidly evolving MetaFormer architecture. Here, we propose a novel token-mixer called dynamic filter and DFFormer and CDFFormer, image recognition models using dynamic filters to close the gaps above. CDFFormer achieved a Top-1 accuracy of 85.0%, close to the hybrid architecture with convolution and MHSA. Other wide-ranging experiments and analysis, including object detection and semantic segmentation, demonstrate that they are competitive with state-of-the-art architectures; Their throughput and memory efficiency when dealing with high-resolution image recognition is convolution and MHSA, not much different from ConvFormer, and far superior to CAFormer. Our results indicate that the dynamic filter is one of the token-mixer options that should be seriously considered. The code is available at https://github.com/okojoalg/dfformer
翻译:多头自留(MHSA)设备齐备的模型在计算机视野中取得了显著的绩效。它们的计算复杂性与输入特征图中像素的四象数成比例,导致处理速度缓慢,特别是在处理高分辨率图像时。提出了新的代号混合器,作为MHSA的替代方法,以绕过这一问题:一个基于FFFT的代号混合器,类似于MHSA的全球操作,但计算复杂性较低。然而,尽管它具有有吸引力,但FFT的代号混合器在与迅速演变的MetaFormer结构的兼容性方面没有经过仔细审查。在这里,我们提出了一个新型代号混合器,称为动态过滤器和DFFFFFFormer和CDFormer,使用动态过滤器的图像识别模型,用以缩小上述差距。 CDFFFormer实现了85.0%的顶级精确度,接近与MHSA的混合结构。其他广泛的实验和分析,包括对象探测和语义分解,表明它们具有竞争力,在考虑中的共体-艺术结构结构结构结构中具有竞争力;它们通过高分辨率和高分辨率的记忆显示,我们最能-CARC-CAAL-CA是具有深刻的。</s>