Referring video object segmentation (R-VOS) aims to segment the object masks in a video given a referring linguistic expression to the object. It is a recently introduced task attracting growing research attention. However, all existing works make a strong assumption: The object depicted by the expression must exist in the video, namely, the expression and video must have an object-level semantic consensus. This is often violated in real-world applications where an expression can be queried to false videos, and existing methods always fail in such false queries due to abusing the assumption. In this work, we emphasize that studying semantic consensus is necessary to improve the robustness of R-VOS. Accordingly, we pose an extended task from R-VOS without the semantic consensus assumption, named Robust R-VOS ($\mathrm{R}^2$-VOS). The $\mathrm{R}^2$-VOS task is essentially related to the joint modeling of the primary R-VOS task and its dual problem (text reconstruction). We embrace the observation that the embedding spaces have relational consistency through the cycle of text-video-text transformation, which connects the primary and dual problems. We leverage the cycle consistency to discriminate the semantic consensus, thus advancing the primary task. Parallel optimization of the primary and dual problems are enabled by introducing an early grounding medium. A new evaluation dataset, $\mathrm{R}^2$-Youtube-VOS, is collected to measure the robustness of R-VOS models against unpaired videos and expressions. Extensive experiments demonstrate that our method not only identifies negative pairs of unrelated expressions and videos, but also improves the segmentation accuracy for positive pairs with a superior disambiguating ability. Our model achieves the state-of-the-art performance on Ref-DAVIS17, Ref-Youtube-VOS, and the novel $\mathrm{R}^2$-Youtube-VOS dataset.
翻译:使用视频对象分割( R- VOS) 的目的是将对象面罩分割在视频中, 给对象以语言表达方式 。 这是一个最近推出的任务, 吸引了越来越多的研究关注。 然而, 所有现有作品都做出了一个强烈的假设: 视频中必须存在表达方式所描述的物体, 即表达方式和视频必须有一个目标级的语义共识。 在真实世界应用程序中, 这经常被违反, 可以对虚假视频进行表达, 而现有的方法总是由于滥用假设而在错误的询问中失败。 在这项工作中, 我们强调, 研究语义共识对于提高 R- VOS 的稳健性是必要的。 因此, 我们从 R- VOS 提出一个延长的任务, 而没有使用语义共识的假设, Robust R- VOS 视频显示了一个延长的任务, 因此, 正在将双轨的 R- ViVO2 数据转换方法 连接到我们 的双轨的 IM 。