While current methods for interactive Video Object Segmentation (iVOS) rely on scribble-based interactions to generate precise object masks, we propose a Click-based interactive Video Object Segmentation (CiVOS) framework to simplify the required user workload as much as possible. CiVOS builds on de-coupled modules reflecting user interaction and mask propagation. The interaction module converts click-based interactions into an object mask, which is then inferred to the remaining frames by the propagation module. Additional user interactions allow for a refinement of the object mask. The approach is extensively evaluated on the popular interactive~DAVIS dataset, but with an inevitable adaptation of scribble-based interactions with click-based counterparts. We consider several strategies for generating clicks during our evaluation to reflect various user inputs and adjust the DAVIS performance metric to perform a hardware-independent comparison. The presented CiVOS pipeline achieves competitive results, although requiring a lower user workload.
翻译:虽然目前交互式视频对象分割法(iVOS)依靠基于刻字的交互作用来生成精确的物体面罩,但我们提议了一个基于点击的互动式视频对象分割法(CiVOS)框架,以尽可能简化所需的用户工作量。CiVOS建立在反映用户互动和遮罩传播的分离模块上。互动模块将基于点击的交互作用转换成一个对象面罩,然后通过传播模块将其推断为剩余框架。额外的用户互动可以改进对象面罩。该方法在流行的交互式~DAVIS数据集上进行了广泛评价,但不可避免地调整了与基于点击的对应方的基于刻字的交互作用。我们考虑了在评估期间生成点击数项战略,以反映各种用户的投入,调整DAVIS的性能衡量标准,以进行硬件独立的比较。推出的CiVOS管道取得了竞争性结果,尽管需要较低的用户工作量。