Unsupervised Video Object Segmentation (UVOS) refers to the challenging task of segmenting the prominent object in videos without manual guidance. In other words, the network detects the accurate region of the target object in a sequence of RGB frames without prior knowledge. In recent works, two approaches for UVOS have been discussed that can be divided into: appearance and appearance-motion based methods. Appearance based methods utilize the correlation information of inter-frames to capture target object that commonly appears in a sequence. However, these methods does not consider the motion of target object due to exploit the correlation information between randomly paired frames. Appearance-motion based methods, on the other hand, fuse the appearance features from RGB frames with the motion features from optical flow. Motion cue provides useful information since salient objects typically show distinctive motion in a sequence. However, these approaches have the limitation that the dependency on optical flow is dominant. In this paper, we propose a novel framework for UVOS that can address aforementioned limitations of two approaches in terms of both time and scale. Temporal Alignment Fusion aligns the saliency information of adjacent frames with the target frame to leverage the information of adjacent frames. Scale Alignment Decoder predicts the target object mask precisely by aggregating differently scaled feature maps via continuous mapping with implicit neural representation. We present experimental results on public benchmark datasets, DAVIS 2016 and FBMS, which demonstrate the effectiveness of our method. Furthermore, we outperform the state-of-the-art methods on DAVIS 2016.
翻译:未经监督的视频对象截面(UVOS)是指在未经人工指导的情况下在视频中分割突出对象的艰巨任务。换句话说,网络在未经事先了解的情况下,在RGB框架序列中检测目标对象的准确区域。在最近的作品中,对UVOS的两种方法进行了讨论,这些方法可以分为:外观和外观动作方法。基于外观的方法利用跨框架的关联信息捕捉通常按顺序显示的目标对象。然而,这些方法并不考虑目标对象的动作,因为利用随机配对框架之间的相关信息。以外观动作为基础的方法,将RGB框架的外观特征与光学流的动态特征结合起来。在最近的作品中,对UVOS的两种方法的相对相关性信息进行了分析,这些方法提供了有用的信息,因为突出对象通常在顺序中显示独特的运动。然而,这些方法的局限性是光流的主要依赖性。在本文中,我们提出了一个新的框架,可以解决上述两种方法在时间和规模上都存在的局限性。在另一边框上,基于外观的显示基于RGB框架的外观的外观框架的外观的外观图像,我们目前对面目标目标的图像的图像的图像的图像,我们通过不断对面的BRA基准对面的图像进行对比的定位,对准,将当前的BIS标面的BA-BRBRBRBS的图像的定位,将目前的BRBA-BA-BA-BA-BA-BA-BS的精确的定位的定位的定位的定位的定位定位定位定位定位定位定位定位定位定位定位定位定位定位定位定位,以显示我们对准的定位对准的定位的定位的定位的定位的定位的定位的定位的定位,将当前的图像对准的定位定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位,以对准。</s>