We present a novel approach for unsupervised activity segmentation, which uses video frame clustering as a pretext task and simultaneously performs representation learning and online clustering. This is in contrast with prior works where representation learning and clustering are often performed sequentially. We leverage temporal information in videos by employing temporal optimal transport and temporal coherence loss. In particular, we incorporate a temporal regularization term into the standard optimal transport module, which preserves the temporal order of the activity, yielding the temporal optimal transport module for computing pseudo-label cluster assignments. Next, the temporal coherence loss encourages neighboring video frames to be mapped to nearby points while distant video frames are mapped to farther away points in the embedding space. The combination of these two components results in effective representations for unsupervised activity segmentation. Furthermore, previous methods require storing learned features for the entire dataset before clustering them in an offline manner, whereas our approach processes one mini-batch at a time in an online manner. Extensive evaluations on three public datasets, i.e. 50-Salads, YouTube Instructions, and Breakfast, and our dataset, i.e., Desktop Assembly, show that our approach performs on par or better than previous methods for unsupervised activity segmentation, despite having significantly less memory constraints.
翻译:我们为不受监督的活动分类提供了一种新颖的活动分类方法,即将视频框架群集用作借口,同时进行代表学习和在线分类。这与以往经常按顺序进行代表学习和分组工作的以往工作形成对照。我们在视频中利用时间信息,采用时间最佳交通方式和时间一致性损失。我们尤其将时间正规化术语纳入标准的最佳运输模块,以保持活动的时间顺序,为计算假标签集群任务生成时间最佳运输模块。接下来,时间一致性损失鼓励将相邻视频框架映射到附近地点,而远处视频框架则被映射到更远的嵌入空间。这两个组成部分的结合导致有效展示不受监督的活动分类。此外,以往的方法要求在将整个数据集以离线方式组合之前,先将其存储学到的特征储存起来,而我们的方法则以在线方式一次性处理一个小型组合,对三个公共数据集进行广泛的评价,即50-萨拉德、YouTube指令和早餐,以及我们的数据设置,即桌面大会,显示我们尽管有比以往更强的记忆度,但仍有比以往更精确的活动限制,但是我们的存储方法表现得更差得多。