Several unsupervised and self-supervised approaches have been developed in recent years to learn visual features from large-scale unlabeled datasets. Their main drawback however is that these methods are hardly able to recognize visual features of the same object if it is simply rotated or the perspective of the camera changes. To overcome this limitation and at the same time exploit a useful source of supervision, we take into account video object tracks. Following the intuition that two patches in a track should have similar visual representations in a learned feature space, we adopt an unsupervised clustering-based approach and constrain such representations to be labeled as the same category since they likely belong to the same object or object part. Experimental results on two downstream tasks on different datasets demonstrate the effectiveness of our Online Deep Clustering with Video Track Consistency (ODCT) approach compared to prior work, which did not leverage temporal information. In addition we show that exploiting an unsupervised class-agnostic, yet noisy, track generator yields to better accuracy compared to relying on costly and precise track annotations.
翻译:近些年来,开发了几种不受监督和自我监督的方法,从大型无标签数据集中学习视觉特征。但是,这些方法的主要缺点是,如果只是旋转或摄像器变化的视角,这些方法几乎无法识别同一对象的视觉特征。为了克服这一局限性,同时利用有用的监督来源,我们考虑到视频物体轨迹。根据直觉,轨道上的两个补丁应在一个学习的特征空间中具有相似的视觉表现,我们采用了一种不受监督的集群法,并限制将这种表述标为同一类别,因为它们可能属于同一对象或对象部分。不同数据集上两个下游任务的实验结果表明,与先前的工作相比,我们在线深层集群与视频跟踪一致性(ODCT)方法的有效性,而先前的工作没有利用时间信息。此外,我们显示,利用一个不超强的类合成但噪音的生成器,比依赖昂贵和精确的轨迹描述更精确的精确度。