This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at https://henghuiding.com/SAAS/.
翻译:本研究聚焦于多镜头半监督视频对象分割(MVOS),其目标是通过初始掩码指示的目标对象,在包含多个镜头的视频中进行全程分割。现有的VOS方法主要针对单镜头视频,难以处理镜头间的不连续性,从而限制了其在实际场景中的应用。我们提出了一种过渡模拟数据增强策略(TMA),利用单镜头数据实现跨镜头泛化,以缓解标注多镜头数据严重稀疏的问题;同时提出了跨镜头分割任意对象(SAAS)模型,该模型能有效检测并理解镜头转换。为支持MVOS的评估与未来研究,我们引入了Cut-VOS——一个新的MVOS基准数据集,具有密集的掩码标注、多样化的对象类别和高频率的镜头转换。在YouMVOS和Cut-VOS上的大量实验表明,所提出的SAAS模型通过有效模拟、理解并分割复杂转换,实现了最先进的性能。代码与数据集发布于https://henghuiding.com/SAAS/。