To anticipate how a human would act in the future, it is essential to understand the human intention since it guides the human towards a certain goal. In this paper, we propose a hierarchical architecture which assumes a sequence of human action (low-level) can be driven from the human intention (high-level). Based on this, we deal with Long-Term Action Anticipation task in egocentric videos. Our framework first extracts two level of human information over the N observed videos human actions through a Hierarchical Multi-task MLP Mixer (H3M). Then, we condition the uncertainty of the future through an Intention-Conditioned Variational Auto-Encoder (I-CVAE) that generates K stable predictions of the next Z=20 actions that the observed human might perform. By leveraging human intention as high-level information, we claim that our model is able to anticipate more time-consistent actions in the long-term, thus improving the results over baseline methods in EGO4D Challenge. This work ranked first in the EGO4D LTA Challenge by providing more plausible anticipated sequences, improving the anticipation of nouns and overall actions. The code is available at https://github.com/Evm7/ego4dlta-icvae.
翻译:为了预测人类今后将如何行动,必须理解人类的意图,因为人类将引导人类走向某一目标。在本文件中,我们提出一个等级结构,假设一系列人类行动(低层次)可以从人的意图(高层)驱动。在此基础上,我们在以自我为中心的视频中处理长期行动预期任务。我们的框架首先通过一个等级式多任务MLP Mixer(H3M)在N观察的人类行为视频中提取两个层次的人类信息,从而改善EGO4D的基线方法(H3M)的结果。然后,我们通过一个有意识的、有条件的自动计算器(I-CVAE)来决定未来的不确定性,该结构将产生对观察到的人类可能执行的下一个 ⁇ 20 行动的K 稳定预测。我们声称,通过利用人类意图作为高层次的信息,我们的模型能够预测长期内更具有时间性的行动,从而改进EGO4D挑战中的基准方法的结果。这项工作在EGO4D LTA挑战(I-CVA)中排名第一,方法是提供更可信的预期到的顺序。