We pose video object segmentation as spectral graph clustering in space and time, with one graph node for each pixel and edges forming local space-time neighborhoods. We claim that the strongest cluster in this video graph represents the salient object. We start by introducing a novel and efficient method based on 3D filtering for approximating the spectral solution, as the principal eigenvector of the graph's adjacency matrix, without explicitly building the matrix. This key property allows us to have a fast parallel implementation on GPU, orders of magnitude faster than classical approaches for computing the eigenvector. Our motivation for a spectral space-time clustering approach, unique in video semantic segmentation literature, is that such clustering is dedicated to preserving object consistency over time, which we evaluate using our novel segmentation consistency measure. Further on, we show how to efficiently learn the solution over multiple input feature channels. Finally, we extend the formulation of our approach beyond the segmentation task, into the realm of object tracking. In extensive experiments we show significant improvements over top methods, as well as over powerful ensembles that combine them, achieving state-of-the-art on multiple benchmarks, both for tracking and segmentation.
翻译:我们将视频对象部分作为时空光谱图集,每个像素和边缘有一个图形节点,形成局部空间时区。 我们声称, 本视频图中最强的组群代表突出的物体。 我们首先采用基于3D过滤的新型高效方法, 以接近光谱溶液, 作为图的对称矩阵的主要导体, 但不明确构建矩阵。 这个关键属性允许我们快速平行地执行GPU, 其数量级比典型的计算等离子体速度快。 我们的光谱空间时间组群集方法在视频语义分解文献中独有的动机是, 这种组群是专门用来保持时间上的物体一致性的, 我们用我们新的分化一致性测量尺度来进行评估。 下一步, 我们展示如何在多个输入特征频道上高效地学习解决方案。 最后, 我们将我们方法的构思范围扩大到对象跟踪领域, 我们的广泛实验显示在顶端方法上取得了显著的改进, 以及超强的酶群集, 以便同时跟踪它们。