Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning, to preserve the spatial layout and sequential nature of the video features. A two-step clustering pipeline on these embedded feature representations then allows us to enforce temporal consistency within, as well as across videos. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes. Our evaluation on three challenging datasets shows the impact of each component and, furthermore, demonstrates our state-of-the-art unsupervised action segmentation results.
翻译:在未剪辑的视频中,时间行动分解最近引起越来越多的关注。然而,在未剪辑的视频中,说明行动类别和框架边界非常耗时,成本也非常高,特别是在大型数据集方面。为了解决这一问题,我们提议采用不受监督的方法,从未剪辑的视频序列中学习行动类别。特别是,我们提议建立一个时间嵌入网络,将相对时间预测、特征重建以及顺序顺序学习结合起来,以保持视频特征的空间布局和顺序性质。在这些嵌入的特征显示上双步组合管道,从而使我们能够在这些嵌入的特征显示中以及视频之间执行时间一致性。根据所查明的组群,我们把视频解码成一致的时间段,与具有语义意义的行动类相对应。我们对三个挑战性数据集的评估显示了每个组成部分的影响,并展示了我们最先进的、不受监督的行动分解结果。</s>