The appearance of an object can be fleeting when it transforms. As eggs are broken or paper is torn, their color, shape and texture can change dramatically, preserving virtually nothing of the original except for the identity itself. Yet, this important phenomenon is largely absent from existing video object segmentation (VOS) benchmarks. In this work, we close the gap by collecting a new dataset for Video Object Segmentation under Transformations (VOST). It consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long on average and densely labeled with instance masks. A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent. We then extensively evaluate state-of-the-art VOS methods and make a number of important discoveries. In particular, we show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues. This motivates us to propose a few modifications for the top-performing baseline that improve its capabilities by better modeling spatio-temporal information. But more broadly, the hope is to stimulate discussion on learning more robust video object representations.
翻译:对象的外观可能是短暂的,当它转化时。当鸡蛋被打破或纸张被撕裂时,它们的颜色、形状和质地可能会发生巨大的改变,保留的原来的形态几乎为零,除了它的身份本身。然而,这个重要的现象在现有的视频对象分割(VOS)基准中很少出现。在这项工作中,我们通过收集新的“转换”视频对象分割数据集(VOST)来填补这个差距。它由700多个高分辨率视频组成,拍摄于各种环境中,平均每个视频21秒,并带有密集的实例掩码标签。采用谨慎的、多步的方法来确保这些视频专注于复杂的目标转换,捕捉它们的全部时间范围。然后我们广泛评估了最先进的VOS方法,并发现了一些重要的发现。特别是,我们表明现有方法在应用到这一新任务时存在困难,它们的主要限制在于过度依赖静态外观线索。这促使我们对表现最好的基线进行了一些修改,通过更好地建模时空信息来提高其能力。但更广泛地,希望刺激关于学习更强大的视频对象表示的讨论。