Semi-supervised video object segmentation is a task of segmenting the target object in a video sequence given only a mask annotation in the first frame. The limited information available makes it an extremely challenging task. Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning. Nevertheless, they are either less discriminative for similar instances or insufficient in the utilization of spatio-temporal information. In this work, we propose to integrate transductive and inductive learning into a unified framework to exploit the complementarity between them for accurate and robust video object segmentation. The proposed approach consists of two functional branches. The transduction branch adopts a lightweight transformer architecture to aggregate rich spatio-temporal cues while the induction branch performs online inductive learning to obtain discriminative target information. To bridge these two diverse branches, a two-head label encoder is introduced to learn the suitable target prior for each of them. The generated mask encodings are further forced to be disentangled to better retain their complementarity. Extensive experiments on several prevalent benchmarks show that, without the need of synthetic training data, the proposed approach sets a series of new state-of-the-art records. Code is available at https://github.com/maoyunyao/JOINT.
翻译:半监督的视频对象分割是将目标对象在视频序列中进行分解的任务,因为第一个框架只有一个掩码说明。 可获得的信息有限, 任务极为艰巨 。 多数以往最优秀的方法都采用了基于匹配的感应推理或在线感应学习。 尽管如此, 它们对于相似的情况不是歧视性较小, 或者是在利用spatio- 时间信息方面不够充分。 在这项工作中, 我们提议将感应和感应学习纳入一个统一框架, 以利用它们之间的互补性, 实现准确和稳健的视频对象分割。 提议的方法由两个功能分支组成。 感应部门采用一个轻量的变压器结构, 以汇总丰富的电磁感应提示, 而感应部门则进行在线感应学习, 以获取有区别的目标信息。 为了弥合这两个不同的分支, 引入了双头标签诱导器, 来学习每个分支的合适目标。 生成的遮罩编码进一步被迫分解, 以更好地保持它们的互补性。 在几个流行的基准上进行广泛的实验, 显示不需要合成训练数据, MAGI 正在 的方法设置一个新的州/ 。