We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features. Compared to existing temporal aggregation strategies, AVT has the advantage of both maintaining the sequential progression of observed actions while still capturing long-range dependencies--both critical for the anticipation task. Through extensive experiments, we show that AVT obtains the best reported performance on four popular action anticipation benchmarks: EpicKitchens-55, EpicKitchens-100, EGTEA Gaze+, and 50-Salads; and it wins first place in the EpicKitchens-100 CVPR'21 challenge.
翻译:我们提出预期性视频变换器(AVT),这是一个端到端关注的视频建模结构,关注以前观察到的视频,以便预测未来的行动。我们共同培训模型,以预测视频序列中的下一步行动,同时学习可以预测未来框架特征的框架特征编码器。与现有的时间汇总战略相比,AVT的优势是既保持所观察到的行动的顺序发展,同时又仍然捕捉对预期任务至关重要的远程依赖性。通过广泛的实验,我们显示AVT在四种大众行动预期基准(EpicKitchens-55、EpicKitchens-100、EGTEA Gaze+和50-Salads)上取得了所报告的最佳表现:EpicKitchens-100、EGTEA Gaze+和50-Salads;它赢得了EpicKitchens-100 CVPR'21挑战中的第一位。