In this report, we introduce our adaptation of image-text models for long-term action anticipation. Our Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network. The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information within a video clip of few frames. We show that the features obtained from both encoders are complementary to each other, thus outperforming the baseline on Ego4D for the task of long-term action anticipation. Our code is available at github.com/srijandas07/clip_baseline_LTA_Ego4d.
翻译:在本报告中,我们介绍了对图像文本模型进行长期行动预测的适应性。我们的视频+ CLIP框架使用一个大型的预先训练的配对图像文本模型: CLIP 和一个视频编码器 慢速网络。 CLIP 嵌入提供了对行动相关对象的精细理解, 而慢速网络负责在几个框架的视频剪辑中模拟时间信息。 我们显示,从两个编码器获得的特征是相辅相成的, 从而超过了Ego4D 上长期行动预测的基线。 我们的代码可以在 guthub.com/srijandas07/clip_baseline_LTA_Ego4d 上找到。