To anticipate how a human would act in the future, it is essential to understand the human intention since it guides the human towards a certain goal. In this paper, we propose a hierarchical architecture which assumes a sequence of human action (low-level) can be driven from the human intention (high-level). Based on this, we deal with Long-Term Action Anticipation task in egocentric videos. Our framework first extracts two level of human information over the N observed videos human actions through a Hierarchical Multi-task MLP Mixer (H3M). Then, we condition the uncertainty of the future through an Intention-Conditioned Variational Auto-Encoder (I-CVAE) that generates K stable predictions of the next Z=20 actions that the observed human might perform. By leveraging human intention as high-level information, we claim that our model is able to anticipate more time-consistent actions in the long-term, thus improving the results over baseline methods in EGO4D Challenge. This work ranked first in both CVPR@2022 and ECVV@2022 EGO4D LTA Challenge by providing more plausible anticipated sequences, improving the anticipation of nouns and overall actions. The code is available at https://github.com/Evm7/ego4dlta-icvae.
翻译:为了预测人类今后将如何行动,必须理解人类的意图,因为人类将引导人类走向某一目标。在本文件中,我们提出一个等级结构,假设人类行动(低层次)的顺序可以从人的意图(高层)驱动。在此基础上,我们在以自我为中心的视频中处理长期行动预期任务。我们的框架首先通过高层次多任务MLP Mixer(H3M)在N观察的人类行为视频中提取两个层次的人类信息,从而改善EGO4D 挑战中的基准方法的结果。然后,我们通过一个有意识、有条件的自动计算器(I-CVAE)来决定未来的不确定性,该结构将产生对所观察到的人类可能执行的下一个 ⁇ 20 行动(高层) 的稳定预测。我们声称,通过将人类的意图作为高层次信息来利用,我们的模型能够预测长期内更具有时间性的行动,从而改进EGO4D 挑战中的基准方法的结果。这项工作在CVPR2022 和ECV2022 自动计算器(EGO4D) 总体预测 4D 中排名第一,通过提供更可信的预期的MA4DTRA 规则, 改进了现有的整个行动。