Video semantic segmentation aims to generate accurate semantic maps for each video frame. To this end, many works dedicate to integrate diverse information from consecutive frames to enhance the features for prediction, where a feature alignment procedure via estimated optical flow is usually required. However, the optical flow would inevitably suffer from inaccuracy, and then introduce noises in feature fusion and further result in unsatisfactory segmentation results. In this paper, to tackle the misalignment issue, we propose a spatial-temporal fusion (STF) module to model dense pairwise relationships among multi-frame features. Different from previous methods, STF uniformly and adaptively fuses features at different spatial and temporal positions, and avoids error-prone optical flow estimation. Besides, we further exploit feature refinement within a single frame and propose a novel memory-augmented refinement (MAR) module to tackle difficult predictions among semantic boundaries. Specifically, MAR can store the boundary features and prototypes extracted from the training samples, which together form the task-specific memory, and then use them to refine the features during inference. Essentially, MAR can move the hard features closer to the most likely category and thus make them more discriminative. We conduct extensive experiments on Cityscapes and CamVid, and the results show that our proposed methods significantly outperform previous methods and achieves the state-of-the-art performance. Code and pretrained models are available at https://github.com/jfzhuang/ST_Memory.
翻译:视频语义分割图旨在为每个视频框架生成准确的语义图。 为此,许多工作致力于整合连续框架的多种信息,以加强预测特征,通常需要通过估计光学流进行特征调整程序;然而,光学流不可避免地会受到不准确的影响,然后在特征融合中引入噪音,进而产生不令人满意的分解结果。在本文中,为了解决不匹配问题,我们提议一个空间时际融合模块,用于模拟多框架特征之间密集的双向关系。与以往方法不同,STF在不同空间和时间位置的统一和适应性引信特征不同,并避免容易出错的光学流估计。此外,我们进一步利用单一框架的特征改进,并提出一个新的记忆提示性改进模块,以解决语义边界之间的困难预测。具体地说,MAR可以存储从培训样本中提取的边界特征和原型,这些特征共同构成特定任务的记忆,然后在推断过程中使用这些模型来改进特征。 基本而言,MAR可以将硬性特征移动到最易出错的光学流流流流路。 因此,我们提出的以前的方法可以大大地展示前的模方法。