Video semantic segmentation (VSS) is a computationally expensive task due to the per-frame prediction for videos of high frame rates. In recent work, compact models or adaptive network strategies have been proposed for efficient VSS. However, they did not consider a crucial factor that affects the computational cost from the input side: the input resolution. In this paper, we propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient VSS. AR-Seg aims to reduce the computational cost by using low resolution for non-keyframes. To prevent the performance degradation caused by downsampling, we design a Cross Resolution Feature Fusion (CReFF) module, and supervise it with a novel Feature Similarity Training (FST) strategy. Specifically, CReFF first makes use of motion vectors stored in a compressed video to warp features from high-resolution keyframes to low-resolution non-keyframes for better spatial alignment, and then selectively aggregates the warped features with local attention mechanism. Furthermore, the proposed FST supervises the aggregated features with high-resolution features through an explicit similarity loss and an implicit constraint from the shared decoding layer. Extensive experiments on CamVid and Cityscapes show that AR-Seg achieves state-of-the-art performance and is compatible with different segmentation backbones. On CamVid, AR-Seg saves 67% computational cost (measured in GFLOPs) with the PSPNet18 backbone while maintaining high segmentation accuracy. Code: https://github.com/THU-LYJ-Lab/AR-Seg.
翻译:由于对高框架率视频的每框架预测,视频语义断层(VSS)是一项计算成本高昂的任务。 在近期的工作中,为高效的 VSS 提出了压缩模型或适应性网络战略。 但是,他们没有考虑影响输入方计算成本的一个关键因素: 输入解析。 在本文中,我们提议一个名为AR-Seg的修改解析框架,用于压缩视频,以实现高效 VSS 。 AR-Seg 的目的是通过对非关键框架使用低分辨率来降低计算成本。 为了防止降格导致的性能退化,我们设计了一个跨分辨率变异(CREFF)模块, 并用一个新的功能相似性能培训(FST) 。 然而, CREFF首先使用存储在压缩视频中到从高分辨率键框到低分辨率非关键框架的扭曲性视频实现高效 VSS。 然后有选择地将扭曲性功能与本地关注机制结合起来。 此外, 拟议的 FST ST 管理着高分辨率的合成特征,通过一个清晰的分辨率解析性精度- ARFF 的精度(C- Restal-S) 和隐含性磁带的磁带的运行, 显示不同的磁带。</s>