Capturing long-range dependencies efficiently is essential for visual recognition tasks, yet existing methods face limitations. Convolutional neural networks (CNNs) struggle with restricted receptive fields, while Vision Transformers (ViTs) achieve global context and long-range modeling at a high computational cost. State-space models (SSMs) offer an alternative, but their application in vision remains underexplored. This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness. At its core, the Gamba bottleneck block that includes, Gamba Cell, an adaptation of Mamba for 2D spatial structures, alongside a Multi-Head Self-Attention (MHSA) mechanism and a Gated Fusion Module for effective feature representation. The interplay of these components ensures that vGamba leverages the low computational demands of SSMs while maintaining the accuracy of attention mechanisms for modeling long-range dependencies in vision tasks. Additionally, the Fusion module enables seamless interaction between these components. Extensive experiments on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.
翻译:高效捕获长程依赖关系对于视觉识别任务至关重要,然而现有方法存在局限性。卷积神经网络(CNN)受限于有限的感受野,而视觉Transformer(ViT)虽能实现全局上下文和长程建模,但计算成本高昂。状态空间模型(SSM)提供了一种替代方案,但其在视觉领域的应用仍待深入探索。本研究提出vGamba,一种融合SSM与注意力机制的混合视觉骨干网络,旨在提升效率与表达能力。其核心是Gamba瓶颈块,包含针对二维空间结构适配Mamba的Gamba单元、多头自注意力(MHSA)机制以及用于有效特征表征的门控融合模块。这些组件的协同作用确保vGamba在利用SSM低计算需求的同时,保持注意力机制在视觉任务长程依赖建模中的准确性。此外,融合模块实现了组件间的无缝交互。在分类、检测和分割任务上的大量实验表明,vGamba在精度与计算效率之间取得了优越的平衡,性能超越多种现有模型。