This paper presents TCE: Temporally Coherent Embeddings for self-supervised video representation learning. The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space, rather than indirectly learning it through ranking or predictive proxy tasks. In the same way that high-level visual information in the world changes smoothly, we believe that nearby frames in learned representations will benefit from demonstrating similar properties. Using this assumption, we train our TCE model to encode videos such that adjacent frames exist close to each other and videos are separated from one another. Using TCE we learn robust representations from large quantities of unlabeled video data. We thoroughly analyse and evaluate our self-supervised learned TCE models on a downstream task of video action recognition using multiple challenging benchmarks (Kinetics400, UCF101, HMDB51). With a simple but effective 2D-CNN backbone and only RGB stream inputs, TCE pre-trained representations outperform all previous selfsupervised 2D-CNN and 3D-CNN pre-trained on UCF101. The code and pre-trained models for this paper can be downloaded at: https://github.com/csiro-robotics/TCE
翻译:本文展示了TCE: 用于自我监督的视频演示学习的TCE: 临时一致的嵌入模式; 提议的方法利用未贴标签的视频数据的固有结构,明确加强嵌入空间的时间一致性,而不是通过排序或预测代理任务间接地学习。 同样,在世界上高级视觉信息顺利变化的情况下,我们认为,学习式演示的附近框架将受益于类似属性的演示。 使用这一假设,我们培训我们的TCE模型,将视频编码成相邻框架彼此接近、视频相互分离的视频。 使用TECE,我们从大量未贴标签的视频数据中学习了强有力的演示。 我们彻底分析和评估了我们自我监督的TCE模型,这是利用多重挑战性基准(Kinetics400, UCF101, HMDB51)。 有了简单有效的2D-CNN骨干和只有RGB流输入, TCE 预先培训式演示表比UCFCF101预先训练的2D-CNN和 3D-CNN模型。