This paper proposes a novel deep learning-based video object matting method that can achieve temporally coherent matting results. Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength for video matting networks. This module computes temporal correlations for pixels adjacent to each other along the time axis in feature space to be robust against motion noises. We also design a novel loss term to train the attention weights, which drastically boosts the video matting performance. Besides, we show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network with a sparse set of user-annotated keyframes. To facilitate video matting and trimap generation networks' training, we construct a large-scale video matting dataset with 80 training and 28 validation foreground video clips with ground-truth alpha mattes. Experimental results show that our method can generate high-quality alpha mattes for various videos featuring appearance change, occlusion, and fast motion. Our code and dataset can be found at https://github.com/yunkezhang/TCVOM
翻译:本文提出一种新的深层次学习的视频对象交配方法,可以实现时间一致的交配结果。 它的关键组成部分是一个基于关注的短期汇总模块, 能最大限度地提高图像交配网络对视频交配网络的强度。 这个模块计算了相邻像素在地表空间时轴上的时间轴上的时间相关性, 以抵御运动噪音。 我们还设计了一个新颖的损失术语, 以训练关注重量, 从而大大提升视频交配性能。 此外, 我们展示了如何通过微调一个最先进的视频对象分割网络, 以及一套稀少的用户附加说明的键盘来有效解决三角体生成问题。 为了方便视频交配和三角体生成网络的培训, 我们用80个培训和28个验证的地面视频剪片, 并配有地图图图图图图解的字母配料。 实验结果显示, 我们的方法可以生成高品质的字母配料, 包括外观变化、 oclusion 和快速运动。 我们的代码和数据设置可以在 https://github. commez/hungVTC 上找到 https/ aghungVTC。