Real-time semantic segmentation on high-resolution videos is challenging due to the strict requirements of speed. Recent approaches have utilized the inter-frame continuity to reduce redundant computation by warping the feature maps across adjacent frames, greatly speeding up the inference phase. However, their accuracy drops significantly owing to the imprecise motion estimation and error accumulation. In this paper, we propose to introduce a simple and effective correction stage right after the warping stage to form a framework named Tamed Warping Network (TWNet), aiming to improve the accuracy and robustness of warping-based models. The experimental results on the Cityscapes dataset show that with the correction, the accuracy (mIoU) significantly increases from 67.3% to 71.6%, and the speed edges down from 65.5 FPS to 61.8 FPS. For non-rigid categories such as "human" and "object", the improvements of IoU are even higher than 18 percentage points.
翻译:由于速度要求严格,高分辨率视频的实时语义分割具有挑战性。最近的方法利用框架间连续性来减少冗余计算,在相邻框架之间扭曲地貌图,大大加快了推断阶段。然而,由于动作估计不精确和错误积累,其准确性显著下降。在本文中,我们提议在扭曲阶段之后立即引入一个简单而有效的修正阶段,以形成一个名为“铁制扭曲网络(TWNet)”的框架,目的是提高以扭曲为基础的模型的准确性和稳健性。城市景象数据集的实验结果表明,随着校正,精度(MIOU)从67.3%大幅上升至71.6%,速度边缘从65.5%FPS下降到61.8FPS。对于“人类”和“目标”等非固定类别,IoU的改进甚至高于18个百分点。