We propose a new perspective on video understanding by casting the video recognition problem as an image recognition task. We show that an image classifier alone can suffice for video understanding without temporal modeling. Our approach is simple and universal. It composes input frames into a super image to train an image classifier to fulfill the task of action recognition, in exactly the same way as classifying an image. We prove the viability of such an idea by demonstrating strong and promising performance on four public datasets including Kinetics400, Something-to-something (V2), MiT and Jester, using a recently developed vision transformer. We also experiment with the prevalent ResNet image classifiers in computer vision to further validate our idea. The results on Kinetics400 are comparable to some of the best-performed CNN approaches based on spatio-temporal modeling. our code and models will be made available at https://github.com/IBM/sifar-pytorch.
翻译:我们提出一个新的视频理解视角,将视频识别问题作为图像识别任务。 我们显示光是图像分类器就足以在不做时间模型的情况下进行视频理解。 我们的方法简单而普遍。 它将输入框组成一个超级图像分类器,以培训一个图像分类器完成行动识别任务,与对图像进行分类的方式完全相同。 我们通过在四个公共数据集上展示强大和有希望的表现来证明这种想法的可行性,包括动因400、某种东西(V2)、某种东西(MIT)和Jester,使用最近开发的视觉变异器。 我们还在计算机视觉中试用流行的ResNet图像分类器,以进一步验证我们的想法。 Kinitics400的结果可以与基于spatio-时间模型的一些最完善的CNN方法相比。 我们的代码和模型将在https://github.com/IBM/sifar-pytorch上公布。