In recent years, video semantic segmentation has made great progress with advanced deep neural networks. However, there still exist two main challenges \ie, information inconsistency and computation cost. To deal with the two difficulties, we propose a novel motion-state alignment framework for video semantic segmentation to keep both motion and state consistency. In the framework, we first construct a motion alignment branch armed with an efficient decoupled transformer to capture dynamic semantics, guaranteeing region-level temporal consistency. Then, a state alignment branch composed of a stage transformer is designed to enrich feature spaces for the current frame to extract static semantics and achieve pixel-level state consistency. Next, by a semantic assignment mechanism, the region descriptor of each semantic category is gained from dynamic semantics and linked with pixel descriptors from static semantics. Benefiting from the alignment of these two kinds of effective information, the proposed method picks up dynamic and static semantics in a targeted way, so that video semantic regions are consistently segmented to obtain precise locations with low computational complexity. Extensive experiments on Cityscapes and CamVid datasets show that the proposed approach outperforms state-of-the-art methods and validates the effectiveness of the motion-state alignment framework.
翻译:近年来,随着先进的深度神经网络的发展,视频语义分割取得了巨大的进展。然而,仍然存在两个主要的挑战,即信息不一致和计算成本。为了解决这两个困难,我们提出了一种新颖的运动状态对齐框架,用于视频语义分割以保持运动和状态的一致性。在该框架中,我们首先构建了一个动态对齐分支,配备了一种高效的分离式变换器,以捕捉动态语义,保证了区域级别的时间一致性。然后,设计了一个状态对齐分支,由一个阶段变换器组成,用于丰富当前帧的特征空间,提取静态语义,并实现像素级状态一致性。接下来,通过语义分配机制,从动态语义获取每个语义类别的区域描述符,并通过静态语义将其与像素描述符链接起来。由于这两种有效信息的对齐,所以该方法以有针对性的方式提取动态和静态语义,以便以低计算复杂度精确地分割视频语义区域的位置。在Cityscapes和CamVid数据集上的广泛实验表明,所提出的方法优于最先进的方法,并证明了运动状态对齐框架的有效性。