Over the past few years, various tasks involving videos such as classification, description, summarization and question answering have received a lot of attention. Current models for these tasks compute an encoding of the video by treating it as a sequence of images and going over every image in the sequence. However, for longer videos this is very time consuming. In this paper, we focus on the task of video classification and aim to reduce the computational time by using the idea of distillation. Specifically, we first train a teacher network which looks at all the frames in a video and computes a representation for the video. We then train a student network whose objective is to process only a small fraction of the frames in the video and still produce a representation which is very close to the representation computed by the teacher network. This smaller student network involving fewer computations can then be employed at inference time for video classification. We experiment with the YouTube-8M dataset and show that the proposed student network can reduce the inference time by upto 30% with a very small drop in the performance
翻译:在过去几年里,涉及诸如分类、描述、概括和答题回答等视频的不同任务引起了人们的极大关注。这些任务的现有模型通过将视频作为图像序列处理,对视频编码进行计算,将视频的编码作为序列中的每个图像序列进行。然而,对于较长的视频来说,这非常耗时。在本文中,我们侧重于视频分类的任务,目的是利用蒸馏的想法减少计算时间。具体地说,我们首先培训一个教师网络,在视频中查看所有框架,然后为视频计算一个演示。然后,我们培训一个学生网络,其目标只是处理视频框架的一小部分,并且仍然制作一个非常接近教师网络所计算的代表性。这个较小的学生网络在视频分类的推论时间里可以使用较少的计算。我们用YouTube-8M数据集进行实验,并表明,拟议的学生网络可以将推算时间减少30%,其表现的下降幅度很小。