This paper presents a novel method for face clustering in videos using a video-centralised transformer. Previous works often employed contrastive learning to learn frame-level representation and used average pooling to aggregate the features along the temporal dimension. This approach may not fully capture the complicated video dynamics. In addition, despite the recent progress in video-based contrastive learning, few have attempted to learn a self-supervised clustering-friendly face representation that benefits the video face clustering task. To overcome these limitations, our method employs a transformer to directly learn video-level representations that can better reflect the temporally-varying property of faces in videos, while we also propose a video-centralised self-supervised framework to train the transformer model. We also investigate face clustering in egocentric videos, a fast-emerging field that has not been studied yet in works related to face clustering. To this end, we present and release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering. We evaluate our proposed method on both the widely used Big Bang Theory (BBT) dataset and the new EasyCom-Clustering dataset. Results show the performance of our video-centralised transformer has surpassed all previous state-of-the-art methods on both benchmarks, exhibiting a self-attentive understanding of face videos.
翻译:本文展示了使用视频集中式变压器在视频中进行面对面组合的新方法。 以前的作品通常采用对比式学习, 学习框架级代表制, 并使用平均集合式来汇总时间层面的特征。 这种方法可能无法完全捕捉复杂的视频动态。 此外, 尽管最近视频基对比式学习取得了进展, 但很少有人试图学习一个自我监督的集群友好型的面部代表制, 这有利于视频集中式任务。 为了克服这些局限性, 我们的方法使用变压器直接学习视频级代表制, 这可以更好地反映视频中面部暂时变化的属性, 同时我们还提议一个视频集中式自我监督的自我监督框架来培训变压器模型。 我们还调查了以自我为中心的视频为主的组合式组合, 这是一个快速膨胀的场,尚未在与组合相关的工作中加以研究。 为此, 我们介绍并发布第一个大规模自我中心型的图像组合数据集,名为“ 简单Com- Clustering ” 。 我们在广泛使用的大BBBT数据集和新的“ 易变压式” Cloan- Crostical Cludeal 演示了我们之前的图像- clastition- graft- gravical- glas- sal- suptraviduview