Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. For this and other video understanding tasks, supervised approaches have achieved encouraging performance but require a high volume of detailed frame-level annotations. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video. Our main finding is that representing a video with a 1-nearest neighbor graph by taking into account the time progression is sufficient to form semantically and temporally consistent clusters of frames where each cluster may represent some action in the video. Additionally, we establish strong unsupervised baselines for action segmentation and show significant performance improvements over published unsupervised methods on five challenging action segmentation datasets. Our approach also outperforms weakly-supervised methods by large margins on 4 of these datasets. Interestingly, we also achieve better results than many fully-supervised methods that have reported results on these datasets. Our code is available at https://github.com/ssarfraz/FINCH-Clustering/tree/master/TW-FINCH
翻译:行动区隔是指在视频中推断出在视频中具有内在一致性的视觉概念的界限,这是许多视频理解任务的重要要求。对于这一任务和其他视频理解任务来说,受监督的方法已经取得了令人鼓舞的业绩,但需要大量的详细框架说明。我们展示了一种完全自动和不受监督的方法,在不需要任何培训的视频中分割行动。我们的建议是一种有效的时间加权等级分类算法,它可以将视频的语义一致框架组合在一起。我们的主要发现是,一个带有最近距离邻居图的视频,考虑到时间进展,就足以形成语义性和时间性一致的一组框架,其中每个组组在视频中可以代表一些行动。此外,我们为行动区隔设置了强大的不受监督的基线,并展示了在5个具有挑战性的行动区隔数据集上公布的未经监督的方法方面的显著业绩改进。我们的方法也比这些数据集中4个大边距的受监管的方法差强。有意思的是,我们还取得了比许多完全受监督的方法更好的结果,而许多完全受监督的一组组群隔开的一组框架,其中每个组群别可能代表了这些数据集/FINC/FINSAR/FINSAR/FIDSARSAR/FAR可以使用我们的代码。