Existing video object segmentation (VOS) benchmarks focus on short-term videos which just last about 3-5 seconds and where objects are visible most of the time. These videos are poorly representative of practical applications, and the absence of long-term datasets restricts further investigation of VOS on the application in realistic scenarios. So, in this paper, we present a new benchmark dataset and evaluation methodology named LVOS, which consists of 220 videos with a total duration of 421 minutes. To the best of our knowledge, LVOS is the first densely annotated long-term VOS dataset. The videos in our LVOS last 1.59 minutes on average, which is 20 times longer than videos in existing VOS datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objeccts. Moreover, we provide additional language descriptions to encourage the exploration of integrating linguistic and visual features for video object segmentation. Based on LVOS, we assess existing video object segmentation algorithms and propose a Diverse Dynamic Memory network (DDMemory) that consists of three complementary memory banks to exploit temporal information adequately. The experiment results demonstrate the strength and weaknesses of prior methods, pointing promising directions for further study. Our objective is to provide the community with a large and varied benchmark to boost the advancement of long-term VOS. Data and code are available at \url{https://lingyihongfd.github.io/lvos.github.io/}.
翻译:现有视频目标分割(VOS)基准以短期视频为重点,这些视频仅持续了大约3-5秒,而且大多数时间都能看到物体。这些视频在实际应用方面没有很好地代表实际应用,没有长期的数据集限制了对VOS应用现实情景的进一步调查。因此,在本文中,我们介绍了一个新的基准数据集和评价方法,名为LVOS,由220个视频组成,总持续时间为421分钟。据我们所知,LVOS是第一个密集的附加注释的长期VOS数据集。我们的LVOS平均1.59分钟视频比现有的VOS数据集的视频长20倍。每个视频包括各种属性,特别是野生应用的挑战,如长期重现和跨时相类似的Objeccts。此外,我们提供了额外的语言描述,鼓励探索将语言和视觉特性整合到视频目标分割的图像分割。基于LVOS,我们评估现有的视频对象分割算法,并提议一个可变的动态内存网络(DDMMemoryal)20分钟。每个视频视频包含各种特性特性的特性特征, 以及我们之前的模型实验中显示大目标群域的深度研究。