Object segmentation for robotic grasping under dynamic conditions often faces challenges such as occlusion, low light conditions, motion blur and object size variance. To address these challenges, we propose a Deep Learning network that fuses two types of visual signals, event-based data and RGB frame data. The proposed Bimodal SegNet network has two distinct encoders, one for each signal input and a spatial pyramidal pooling with atrous convolutions. Encoders capture rich contextual information by pooling the concatenated features at different resolutions while the decoder obtains sharp object boundaries. The evaluation of the proposed method undertakes five unique image degradation challenges including occlusion, blur, brightness, trajectory and scale variance on the Event-based Segmentation (ESD) Dataset. The evaluation results show a 6-10\% segmentation accuracy improvement over state-of-the-art methods in terms of mean intersection over the union and pixel accuracy. The model code is available at https://github.com/sanket0707/Bimodal-SegNet.git
翻译:动态条件下机器人抓取的目标分割常面临遮挡、低光照、运动模糊及目标尺寸变化等挑战。为应对这些挑战,我们提出一种融合两类视觉信号(事件数据与RGB帧数据)的深度学习网络。所提出的双模态SegNet网络包含两个独立的编码器(分别处理每种信号输入)以及一个带空洞卷积的空间金字塔池化模块。编码器通过在不同分辨率下对拼接特征进行池化以捕获丰富的上下文信息,而解码器则能获取清晰的目标边界。该方法在基于事件的分割(ESD)数据集上进行了五项独特的图像退化挑战评估,包括遮挡、模糊、亮度、轨迹和尺度变化。评估结果显示,在平均交并比和像素精度指标上,其分割准确率较现有最优方法提升了6-10%。模型代码发布于https://github.com/sanket0707/Bimodal-SegNet.git