The task of predicting future actions from a video is crucial for a real-world agent interacting with others. When anticipating actions in the distant future, we humans typically consider long-term relations over the whole sequence of actions, i.e., not only observed actions in the past but also potential actions in the future. In a similar spirit, we propose an end-to-end attention model for action anticipation, dubbed Future Transformer (FUTR), that leverages global attention over all input frames and output tokens to predict a minutes-long sequence of future actions. Unlike the previous autoregressive models, the proposed method learns to predict the whole sequence of future actions in parallel decoding, enabling more accurate and fast inference for long-term anticipation. We evaluate our method on two standard benchmarks for long-term action anticipation, Breakfast and 50 Salads, achieving state-of-the-art results.
翻译:从视频中预测未来行动的任务对于真实世界的代理人与他人互动至关重要。在预测远方未来的行动时,我们人类通常会考虑整个一系列行动的长期关系,即不仅观察过去的行动,而且考虑未来可能的行动。本着类似精神,我们提出一个行动预期的端对端关注模式,称为未来变异器(FUTR),在所有输入框架和产出标牌上调动全球注意力,以预测一个分钟的今后行动序列。与以往的自动递增模型不同,拟议方法学会了平行解码预测整个未来行动序列,从而能够更准确和快速地推导出长期预期。我们评估了我们长期行动预期的两个标准基准方法,即早餐和50萨拉兹,以取得最新的结果。