We explore a new perspective on video understanding by casting the video recognition problem as an image recognition task. Our approach rearranges input video frames into super images, which allow for training an image classifier directly to fulfill the task of action recognition, in exactly the same way as image classification. With such a simple idea, we show that transformer-based image classifiers alone can suffice for action recognition. In particular, our approach demonstrates strong and promising performance against SOTA methods on several public datasets including Kinetics400, Moments In Time, Something-Something V2 (SSV2), Jester and Diving48. We also experiment with the prevalent ResNet image classifiers in computer vision to further validate our idea. The results on both Kinetics400 and SSV2 are comparable to some of the best-performed CNN approaches based on spatio-temporal modeling. Our source codes and models are available at https://github.com/IBM/sifar-pytorch.
翻译:通过将视频识别问题作为图像识别任务,我们探索了视频理解的新视角。我们的方法将视频框架输入到超级图像中,这样可以直接培训一个图像分类员来完成行动识别任务,与图像分类完全一样。有了这样一个简单的想法,我们显示仅以变压器为基础的图像分类员就足以实现行动识别。特别是,我们的方法显示,在几个公共数据集上,SOTA方法,包括动因-400、时间之梦、时间之梦、某些东西的V2(SSV2)、Jester和Diving48等,表现强劲而有希望。我们还在计算机视觉中试验流行的ResNet图像分类员,以进一步验证我们的想法。关于Kinitics 400和SSV2的结果都与基于时尚模型的一些最完善的CNN方法相似。我们的源代码和模型可在https://github.com/IBM/sifar-pytorch查阅。