Temporal action segmentation is a task to classify each frame in the video with an action label. However, it is quite expensive to annotate every frame in a large corpus of videos to construct a comprehensive supervised training dataset. Thus in this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos. To do this we leverage self-supervised video classification approaches to perform unsupervised feature extraction. On top of these features we develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way. We evaluate on both classic datasets (Breakfast, 50Salads) and emerging fine-grained action datasets (FineGym) with more complex activity structures and similar sub-actions. Results show that our method achieves state-of-the-art performance on all three datasets with up to 22\% improvement, and can even outperform some weakly-supervised approaches, demonstrating its effectiveness and generalizability.
翻译:将视频中的每个框架分类为动作标签, 将时间动作偏移部分是一项任务。 但是, 在大量视频中对每个框架进行批注, 以构建一个全面监管的培训数据集, 费用相当高 。 因此, 在这项工作中, 我们探索一种不受贴标签的视频集操作的自监督方法, 并预测视频中可能的一系列时间段 。 要做到这一点, 我们利用自我监督的视频分类方法来进行不受监督的特征提取。 在这些特征之外, 我们开发了 CAP, 这是一种新型的共生行动对等算法, 它不仅能够捕捉活动结构下的各个子行动之间的关联性, 而且能够准确和笼统地估计子行动的时间轨迹。 我们对经典的数据集( Breakfast, 50Salads) 和正在形成的精细的动作数据集( FineGym) 进行了评估, 以更复杂的活动结构和类似的子动作。 结果显示, 我们的方法在所有三个数据集上都取得了状态的艺术性表现, 并且可以显示其一般性、 监督性和超弱性。