We present a novel approach for unsupervised activity segmentation, which uses video frame clustering as a pretext task and simultaneously performs representation learning and online clustering. This is in contrast with prior works where representation learning and clustering are often performed sequentially. We leverage temporal information in videos by employing temporal optimal transport. In particular, we incorporate a temporal regularization term which preserves the temporal order of the activity into the standard optimal transport module for computing pseudo-label cluster assignments. The temporal optimal transport module enables our approach to learn effective representations for unsupervised activity segmentation. Furthermore, previous methods require storing learned features for the entire dataset before clustering them in an offline manner, whereas our approach processes one mini-batch at a time in an online manner. Extensive evaluations on three public datasets, i.e. 50-Salads, YouTube Instructions, and Breakfast, and our dataset, i.e., Desktop Assembly, show that our approach performs on par or better than previous methods for unsupervised activity segmentation, despite having significantly less memory constraints.
翻译:我们为未经监督的活动分类提供了一种新颖的方法,即将视频框架分组作为一种托辞,同时进行代表性学习和在线分类。这与以往常常按顺序进行代表性学习和分组的工作形成对照。我们通过使用时间最佳运输方式在视频中利用时间信息。我们特别将一个时间正规化术语将活动的时间顺序保留在计算假标签集群任务的标准最佳运输模块中。时间最佳运输模块使我们得以了解未经监督的活动分类的有效表述方式。此外,以往的方法要求将整个数据集的学习特点储存在离线方式中,而我们的方法则要求以在线方式一次性地处理。对三种公共数据集,即50-萨拉德、YouTube指令和早餐的广泛评价,以及我们的数据集,即桌面大会,显示我们的方法在未受监督的活动分类方面,尽管有显著的记忆力限制,但仍然以平坦或优于以往的方法。