BEVT: BERT 视频变换器预培训 (BEVT: BERT Pretraining of Video Transformers)

This paper studies the BERT pretraining of video transformers. It is a straightforward but worth-studying extension given the recent success from BERT pretraining of image transformers. We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning. In particular, BEVT first performs masked image modeling on image data, and then conducts masked image modeling jointly with masked video modeling on video data. This design is motivated by two observations: 1) transformers learned on image datasets provide decent spatial priors that can ease the learning of video transformers, which are often times computationally-intensive if trained from scratch; 2) discriminative clues, i.e., spatial and temporal information, needed to make correct predictions vary among different videos due to large intra-class and inter-class variations. We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results. On Kinetics 400, for which recognition mostly relies on discriminative spatial representations, BEVT achieves comparable results to strong supervised baselines. On Something-Something-V2 and Diving 48, which contain videos relying on temporal dynamics, BEVT outperforms by clear margins all alternative baselines and achieves state-of-the-art performance with a 71.4% and 87.2% Top-1 accuracy respectively.

翻译：本文研究了视频变压器的BERT预培训。这是一个直接但值得研究的延伸, 因为它最近从BERT预培训图像变压器的成功中取得了成功。我们引入了BEVT, 它将视频代表的学习分解为空间代表性学习和时间动态学习。特别是, BEVT首先对图像数据进行蒙面图像建模, 然后与视频数据蒙面的视频建模联合进行蒙面图像建模。这个设计有两个观测的动机:(1) 在图像数据集上学习的变压器提供了体面的空间前科, 能够便利视频变压器的学习,如果从抓起训练,这些变压器往往具有计算强度;(2) 分析性线索, 即空间和时间信息, 以对不同视频进行精确的预测, 特别是由于大型的阶级内部和阶级之间的变异。我们在三个具有挑战性的视频基准上进行了广泛的实验, BEVT取得非常有希望的结果。在基雅特400上, 其认识主要依赖于有区别的空间表达, BEVT 取得与强有力的基准可比的结果。在某点- V2 和D- 4 4 4 基底位上, 上分别包含了上的所有图像都具有清晰的基基基基, 基, 基基基基基基基基基基基基基, 基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基基

相关内容