Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. For this and other video understanding tasks, supervised approaches have achieved encouraging performance but require a high volume of detailed frame-level annotations. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video. Our main finding is that representing a video with a 1-nearest neighbor graph by taking into account the time progression is sufficient to form semantically and temporally consistent clusters of frames where each cluster may represent some action in the video. Additionally, we establish strong unsupervised baselines for action segmentation and show significant performance improvements over published unsupervised methods on five challenging action segmentation datasets. Our code is available at https://github.com/ssarfraz/FINCH-Clustering/tree/master/TW-FINCH
翻译:行动区划是指在视频中推断出在视频中具有内在一致性的视觉概念的界限,这是许多视频理解任务的重要要求。对于这一任务和其他视频理解任务来说,受监督的方法已经取得了令人鼓舞的业绩,但需要大量的详细框架说明。我们展示了一种完全自动和不受监督的方法,在不需要任何培训的视频中分割行动。我们的建议是一种有效的时间加权等级分类算法,它可以将视频的语义一致框架分组。我们的主要发现是,在考虑到时间进展的情况下,代表一个带有1个最近邻图的视频,足以形成语义性和时间性一致的一组框架,其中每个组群可以在视频中代表一些行动。此外,我们为行动区划建立了强有力的不受监督的基线,并展示了在5个具有挑战性的行动区别数据集上公布的未经监督的方法方面所取得的显著的绩效改进。我们的代码可在https://github.com/sarfraz/FINCH-Clustering/tree/master/TW-FINCHCH中查阅。