Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. Code will be available at https://github.com/OpenGVLab/UniFormerV2.
翻译:视觉变异器(VVITs)最近展示了自己在学习长期视频依赖性方面的实力。 不幸的是,由于对象征性品进行盲目全球比较,他们在应对本地视频冗余方面表现出了局限性。 UniFormer成功地缓解了这一问题,在变压器格式中,将混凝土和自我关注作为关系聚合器。然而,这个模型在微调视频之前,需要600V的复杂图像预设短语。这在实践中阻碍了它的广泛使用。相反,开放源的VTs随时可用,且在丰富的图像监督下非常精练。基于这些观察,我们提出了一个通用的范式模式,用高效的 Uniformer 设计来将预先训练的VITs(VITs)作为关系聚合器。我们称之为这个家庭Unifermer VormerV2, 因为它继承了Uniformerst 块的简洁风格。但是它包含全新的本地和全球关系聚合器,使得它能够更精确地在实践中实现精确的精确度平衡,包括无缝的Silental-Ialalalalalalalalalalal2, 在Valalal-alalalalalalation上, 在Val-al-alation 2, 将实现我们的任何 Val-stalationalationalational-calation2 和UIentalbalationalationalational-sal-sal-salmalations 将获得任何精确度上, 20xalmal