This paper proposes a framework for the interactive video object segmentation (VOS) in the wild where users can choose some frames for annotations iteratively. Then, based on the user annotations, a segmentation algorithm refines the masks. The previous interactive VOS paradigm selects the frame with some worst evaluation metric, and the ground truth is required for calculating the evaluation metric, which is impractical in the testing phase. In contrast, in this paper, we advocate that the frame with the worst evaluation metric may not be exactly the most valuable frame that leads to the most performance improvement across the video. Thus, we formulate the frame selection problem in the interactive VOS as a Markov Decision Process, where an agent is learned to recommend the frame under a deep reinforcement learning framework. The learned agent can automatically determine the most valuable frame, making the interactive setting more practical in the wild. Experimental results on the public datasets show the effectiveness of our learned agent without any changes to the underlying VOS algorithms. Our data, code, and models are available at https://github.com/svip-lab/IVOS-W.
翻译:本文提出了野外互动视频物体分割框架( VOS ), 用户可以在野外选择一些描述框架 。 然后, 根据用户的批注, 分割算法使面具更精细。 先前的互动 VOS 模式选择框架, 使用一些最差的评价度量, 而计算评价度量需要地面真相, 这在测试阶段是不切实际的。 相反, 在本文中, 我们主张, 使用最差的评价度量度的框可能不是导致整个视频性能改进的最有价值的框架 。 因此, 我们将互动 VOS 的框架选择问题设计成一个 Markov 决策程序, 使一个代理在深度强化学习框架内学习推荐框架。 学习过的代理可以自动确定最有价值的框架, 使互动环境在野外更加实用。 公共数据集的实验结果显示我们所学的代理的效力, 而没有改变VOS 算法的基础。 我们的数据、 代码和模型可在 https://github.com/svip-lab/IVOS-W 上查阅 。