In this paper, we propose a novel fully unsupervised framework that learns action representations suitable for the action segmentation task from the single input video itself, without requiring any training data. Our method is a deep metric learning approach rooted in a shallow network with a triplet loss operating on similarity distributions and a novel triplet selection strategy that effectively models temporal and semantic priors to discover actions in the new representational space. Under these circumstances, we successfully recover temporal boundaries in the learned action representations with higher quality compared with existing unsupervised approaches. The proposed method is evaluated on two widely used benchmark datasets for the action segmentation task and it achieves competitive performance by applying a generic clustering algorithm on the learned representations.
翻译:在本文中,我们提出了一种全新的、完全无监督的框架,该框架可以从单个输入视频本身学习适合于动作分割任务的动作表示,而不需要任何训练数据。我们的方法是一种基于深度度量学习的方法,根植于使用相似分布的三元组损失的浅层网络,以及一种新的三元组选择策略,该策略有效地建模了时序和语义先验,以发现新的表示空间中的动作。在这种情况下,我们成功地以更高质量地恢复了学习到的动作表示中的时序边界,相较于现有的无监督方法,具有更好的性能。所提出的方法已在动作分割任务的两个广泛使用的基准数据集上进行了评估,并通过将通用聚类算法应用于学习到的表示来实现竞争性能。