Semantic segmentation is a fundamental task in computer vision with wide-ranging applications, including autonomous driving and robotics. While RGB-based methods have achieved strong performance with CNNs and Transformers, their effectiveness degrades under fast motion, low-light, or high dynamic range conditions due to limitations of frame cameras. Event cameras offer complementary advantages such as high temporal resolution and low latency, yet lack color and texture, making them insufficient on their own. To address this, recent research has explored multimodal fusion of RGB and event data; however, many existing approaches are computationally expensive and focus primarily on spatial fusion, neglecting the temporal dynamics inherent in event streams. In this work, we propose MambaSeg, a novel dual-branch semantic segmentation framework that employs parallel Mamba encoders to efficiently model RGB images and event streams. To reduce cross-modal ambiguity, we introduce the Dual-Dimensional Interaction Module (DDIM), comprising a Cross-Spatial Interaction Module (CSIM) and a Cross-Temporal Interaction Module (CTIM), which jointly perform fine-grained fusion along both spatial and temporal dimensions. This design improves cross-modal alignment, reduces ambiguity, and leverages the complementary properties of each modality. Extensive experiments on the DDD17 and DSEC datasets demonstrate that MambaSeg achieves state-of-the-art segmentation performance while significantly reducing computational cost, showcasing its promise for efficient, scalable, and robust multimodal perception.
翻译:语义分割是计算机视觉中的一项基础任务,在自动驾驶和机器人等领域具有广泛应用。尽管基于RGB的方法借助CNN和Transformer已取得优异性能,但在快速运动、低光照或高动态范围场景下,由于帧相机的固有局限,其效果会显著下降。事件相机具备高时间分辨率和低延迟等互补优势,但缺乏色彩与纹理信息,单独使用时性能不足。为此,近期研究开始探索RGB与事件数据的多模态融合方法;然而,现有方案大多计算开销高昂,且主要关注空间维度融合,忽略了事件流固有的时序动态特性。本研究提出MambaSeg——一种新颖的双分支语义分割框架,采用并行的Mamba编码器分别高效建模RGB图像与事件流。为降低跨模态歧义,我们设计了双维度交互模块(DDIM),包含跨空间交互模块(CSIM)与跨时间交互模块(CTIM),可沿空间和时序维度联合执行细粒度融合。该设计提升了跨模态对齐能力,减少了歧义,并充分挖掘了各模态的互补特性。在DDD17和DSEC数据集上的大量实验表明,MambaSeg在显著降低计算成本的同时,实现了最先进的分割性能,展现了其在高效、可扩展且鲁棒的多模态感知领域的应用潜力。