Recently, pretext-task based methods are proposed one after another in self-supervised video feature learning. Meanwhile, contrastive learning methods also yield good performance. Usually, new methods can beat previous ones as claimed that they could capture "better" temporal information. However, there exist setting differences among them and it is hard to conclude which is better. It would be much more convincing in comparison if these methods have reached as closer to their performance limits as possible. In this paper, we start from one pretext-task baseline, exploring how far it can go by combining it with contrastive learning, data pre-processing, and data augmentation. A proper setting has been found from extensive experiments, with which huge improvements over the baselines can be achieved, indicating a joint optimization framework can boost both pretext task and contrastive learning. We denote the joint optimization framework as Pretext-Contrastive Learning (PCL). The other two pretext task baselines are used to validate the effectiveness of PCL. And we can easily outperform current state-of-the-art methods in the same training manner, showing the effectiveness and the generality of our proposal. It is convenient to treat PCL as a standard training strategy and apply it to many other works in self-supervised video feature learning.
翻译:最近,在自我监督的视频功能学习中,基于托辞任务的方法在自我监督的视频特征学习中被逐个提出。同时,对比式学习方法也产生良好的业绩。通常,新方法可以比以前的方法胜过以前的方法,因为它们声称它们能够捕捉“更好的”时间信息。然而,它们之间存在差异,很难得出更好的结论。如果这些方法尽可能接近其性能限制,则与最近相比,基于托辞的任务方法会更令人信服。在本文件中,我们从一个托辞任务基线开始,探讨它与对比性学习、数据处理预处理和数据增强相结合能在多大程度上达到最佳效果。从广泛的实验中找到了适当的设置,可以大大改进基线,表明联合优化框架可以促进托辞任务和反差性学习。我们把联合优化框架称为先发制人学习(PCL),用另外两个托辞任务基线来验证PCL的效能。我们也可以很容易地用同样的培训方式超越目前的状态方法,显示我们提案的有效性和一般性。我们从广泛的实验中找到一个适当的设置,在基准上可以实现巨大的改进,这表明联合优化框架能够把PCL当作其他标准训练战略。我们的许多自我学习。