This paper proposes a novel deep learning-based video object matting method that can achieve temporally coherent matting results. Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength for video matting networks. This module computes temporal correlations for pixels adjacent to each other along the time axis in feature space, which is robust against motion noises. We also design a novel loss term to train the attention weights, which drastically boosts the video matting performance. Besides, we show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network with a sparse set of user-annotated keyframes. To facilitate video matting and trimap generation networks' training, we construct a large-scale video matting dataset with 80 training and 28 validation foreground video clips with ground-truth alpha mattes. Experimental results show that our method can generate high-quality alpha mattes for various videos featuring appearance change, occlusion, and fast motion. Our code and dataset can be found at: https://github.com/yunkezhang/TCVOM
翻译:本文提出一种新的深层次学习的视频对象交配方法,可以实现时间一致的交配结果。 它的关键组成部分是一个基于关注的短期汇总模块, 使视频交配网络的图像交配网络的强度最大化。 这个模块计算了相邻像素的时间轴在地貌空间中的时间轴上的时间相关性, 这是针对运动噪音的强力。 我们还设计了一个新颖的损失术语, 以训练关注重量, 这极大地提升了视频交配性能。 此外, 我们展示了如何通过微调一个最先进的视频对象分割网络, 并配有一套稀少的用户附加说明的键盘, 从而有效解决三角生成问题。 为了便利视频交配和三角生成网络的培训, 我们用80个培训和28个对地面视频剪片进行校验, 实验结果显示, 我们的方法可以生成高品质的阿尔法, 用于显示外观变化、 oclusion 和快速运动的各种视频。 我们的代码和数据集可以在 https://giuthub. /ynockVz/tromTC 上找到 。