In video segmentation, generating temporally consistent results across frames is as important as achieving frame-wise accuracy. Existing methods rely either on optical flow regularization or fine-tuning with test data to attain temporal consistency. However, optical flow is not always avail-able and reliable. Besides, it is expensive to compute. Fine-tuning the original model in test time is cost sensitive. This paper presents an efficient, intuitive, and unsupervised online adaptation method, AuxAdapt, for improving the temporal consistency of most neural network models. It does not require optical flow and only takes one pass of the video. Since inconsistency mainly arises from the model's uncertainty in its output, we propose an adaptation scheme where the model learns from its own segmentation decisions as it streams a video, which allows producing more confident and temporally consistent labeling for similarly-looking pixels across frames. For stability and efficiency, we leverage a small auxiliary segmentation network (AuxNet) to assist with this adaptation. More specifically, AuxNet readjusts the decision of the original segmentation network (Main-Net) by adding its own estimations to that of MainNet. At every frame, only AuxNet is updated via back-propagation while keeping MainNet fixed. We extensively evaluate our test-time adaptation approach on standard video benchmarks, including Cityscapes, CamVid, and KITTI. The results demonstrate that our approach provides label-wise accurate, temporally consistent, and computationally efficient adaptation (5+ folds overhead reduction comparing to state-of-the-art test-time adaptation methods).
翻译:在视频分割中,产生跨框架的时间一致性结果与实现框架准确性同样重要。 现有方法要么依靠光学流正规化,要么依靠测试数据微调来达到时间一致性。 然而,光学流并不总是有用而且可靠。 此外,计算成本昂贵。 在测试时间对原始模型进行微调是成本敏感的。 本文展示了一种高效、直观和不受监督的在线调整方法AuxAdapt, 以提高大多数神经网络模型的时间一致性。 它不需要光学流,只需要一次视频。 由于主要由于模型产出的不确定性造成不一致,我们提议了一个适应方案,让模型从自己的分解决定中学习,因为它流动了视频,这样可以产生更自信和时间一致的跨框架类似像素标签。 为了稳定和效率,我们利用一个小型的辅助分解网络(AuxNet) 来帮助进行这一调整。 更具体地说, AuxNet 重新校正了原始分解网络(MainNet)的决定, 并且只通过模型的不确定性来进行一次分解。 我们的缩缩缩缩算方法,我们通过主网络的缩缩缩缩缩的缩的缩的缩缩缩缩缩图, 将每个框架都显示我们的标准的缩缩缩图。