Action anticipation involves predicting future actions having observed the initial portion of a video. Typically, the observed video is processed as a whole to obtain a video-level representation of the ongoing activity in the video, which is then used for future prediction. We introduce ANTICIPATR which performs long-term action anticipation leveraging segment-level representations learned using individual segments from different activities, in addition to a video-level representation. We propose a two-stage learning approach to train a novel transformer-based model that uses these two types of representations to directly predict a set of future action instances over any given anticipation duration. Results on Breakfast, 50Salads, Epic-Kitchens-55, and EGTEA Gaze+ datasets demonstrate the effectiveness of our approach.
翻译:行动预期包括预测今后的行动,观察录像的最初部分,通常,所观察的录像作为整体处理,以获得录像中进行中活动的视频级说明,然后用于今后的预测,我们采用ANTICIPATR,利用利用利用不同活动个别部分以及视频级说明获得的分部分说明的长期行动预测,我们建议采用两阶段学习方法,培训一种新型的变压器模型,利用这两类说明直接预测任何预期期的一组未来行动案例,关于早餐、50沙拉、Epic-Kitchens-55和EGTEA Gaze+数据集的结果显示了我们的方法的有效性。