Self-supervised video representation methods typically focus on the representation of temporal attributes in videos. However, the role of stationary versus non-stationary attributes is less explored: Stationary features, which remain similar throughout the video, enable the prediction of video-level action classes. Non-stationary features, which represent temporally varying attributes, are more beneficial for downstream tasks involving more fine-grained temporal understanding, such as action segmentation. We argue that a single representation to capture both types of features is sub-optimal, and propose to decompose the representation space into stationary and non-stationary features via contrastive learning from long and short views, i.e. long video sequences and their shorter sub-sequences. Stationary features are shared between the short and long views, while non-stationary features aggregate the short views to match the corresponding long view. To empirically verify our approach, we demonstrate that our stationary features work particularly well on an action recognition downstream task, while our non-stationary features perform better on action segmentation. Furthermore, we analyse the learned representations and find that stationary features capture more temporally stable, static attributes, while non-stationary features encompass more temporally varying ones.
翻译:自我监督的视频代表方法通常侧重于视频中时间属性的表达方式。然而,固定和非静止属性的作用很少被探讨:静止特征在整个视频中仍然类似,能够预测视频层面的行动类别。非静止特征代表时间上的差异性特征,更有利于下游任务,涉及更细微的时间理解,例如行动分割。我们争辩说,一种单一的表达方式捕捉两种类型的特征都是次最佳的,建议通过从长视和短视中对比学习,将代表空间分解为固定和非静止特征,即长视序列及其较短的次序列。固定特征在短视和长视之间共享,而非静止特征则将短视组合为相应的长视。为了以经验来验证我们的方法,我们证明我们的固定特征在行动识别下游任务上特别有效,而我们的非静止特征则在行动分割上表现得更好。此外,我们分析了所学的表述方式,并发现固定特征在时间上更为稳定、固定的特征,而非静止特征则包含更稳定的时空特性。