In order to deal with variant-length long videos, prior works extract multi-modal features and fuse them to predict students' engagement intensity. In this paper, we present a new end-to-end method Class Attention in Video Transformer (CavT), which involves a single vector to process class embedding and to uniformly perform end-to-end learning on variant-length long videos and fixed-length short videos. Furthermore, to address the lack of sufficient samples, we propose a binary-order representatives sampling method (BorS) to add multiple video sequences of each video to augment the training set. BorS+CavT not only achieves the state-of-the-art MSE (0.0495) on the EmotiW-EP dataset, but also obtains the state-of-the-art MSE (0.0377) on the DAiSEE dataset. The code and models have been made publicly available at https://github.com/mountainai/cavt.
翻译:为了处理变式长长视频,先前的作品提取了多种模式特征,并结合这些特征来预测学生的参与强度。在本文中,我们展示了一种新的端到端方法 " 视频变换器中注意 " (CavT)中的端到端方法 ",它涉及一个单一的矢量,用于处理分类嵌入和统一进行关于变式长视频和固定长短视频的端到端学习;此外,为解决缺乏足够的样本的问题,我们提议了一种二进制代表抽样方法(BorS),以添加每部视频的多个视频序列,以充实培训内容。BorS+CavT不仅实现了EmotiW-EP数据集上的最新MSE(0.0495),而且还获得了DAISE数据集上的最新MSE(0.0377),代码和模型已经公布在https://github.com/motainai/cavt上。