With the increasing demand of autonomous systems, pixelwise semantic segmentation for visual scene understanding needs to be not only accurate but also efficient for potential real-time applications. In this paper, we propose Context Aggregation Network, a dual branch convolutional neural network, with significantly lower computational costs as compared to the state-of-the-art, while maintaining a competitive prediction accuracy. Building upon the existing dual branch architectures for high-speed semantic segmentation, we design a cheap high resolution branch for effective spatial detailing and a context branch with light-weight versions of global aggregation and local distribution blocks, potent to capture both long-range and local contextual dependencies required for accurate semantic segmentation, with low computational overheads. We evaluate our method on two semantic segmentation datasets, namely Cityscapes dataset and UAVid dataset. For Cityscapes test set, our model achieves state-of-the-art results with mIOU of 75.9%, at 76 FPS on an NVIDIA RTX 2080Ti and 8 FPS on a Jetson Xavier NX. With regards to UAVid dataset, our proposed network achieves mIOU score of 63.5% with high execution speed (15 FPS).
翻译:由于对自主系统的需求不断增加,对视觉场景理解的像素语义断层不仅需要准确,而且对于潜在的实时应用来说也需要高效。在本文中,我们提出环境聚合网络,这是一个双分支相联神经神经网络,其计算成本比最新技术要低得多,同时保持有竞争力的预测准确性。在现有的高速语义分割的双分支结构的基础上,我们设计了一个廉价的高分辨率分支,用于有效空间详细描述,并设计一个具有轻量版全球聚合和本地分布块背景分支,能够捕捉准确的语义分割所需的长距离和当地环境依赖性,而计算偏低。我们评估了两个语义分割数据集的方法,即城景数据集和UAVID数据集。对于城景测试集,我们模型以75.9%的MIDIA RTX 2080-Ti 和 8 FPS 用于Jetson Xavier 网络,拟议以NX5 高速度(15xx) 实现我们VIDI RTX 和8 FPS 的UAVS 执行速度。