Video semantic segmentation requires to utilize the complex temporal relations between frames of the video sequence. Previous works usually exploit accurate optical flow to leverage the temporal relations, which suffer much from heavy computational cost. In this paper, we propose a Temporal Memory Attention Network (TMANet) to adaptively integrate the long-range temporal relations over the video sequence based on the self-attention mechanism without exhaustive optical flow prediction. Specially, we construct a memory using several past frames to store the temporal information of the current frame. We then propose a temporal memory attention module to capture the relation between the current frame and the memory to enhance the representation of the current frame. Our method achieves new state-of-the-art performances on two challenging video semantic segmentation datasets, particularly 80.3% mIoU on Cityscapes and 76.5% mIoU on CamVid with ResNet-50.
翻译:视频语系分割需要利用视频序列框架之间复杂的时间关系。 先前的作品通常利用准确的光学流来利用时间关系, 时间关系受到沉重的计算成本的影响。 在本文中, 我们提议建立一个时间记忆关注网络( TMANet), 以适应性地整合视频序列的长程时间关系, 其依据是自留机制, 而不作详尽的光学流预测。 特别地, 我们用过去几个框架构建一个记忆, 以存储当前框架的时间信息 。 我们然后提议一个时间记忆关注模块, 以捕捉当前框架与记忆之间的关系, 以加强当前框架的代表性。 我们的方法在两个具有挑战性的视频语系分割数据集上取得了新的最新表现, 特别是80.3% MIOU在城市景区和76.5% mIOU在与ResNet- 50的 CamVid上实现了76.5% mIOU。