Egocentric action anticipation consists in predicting a future action the camera wearer will perform from egocentric video. While the task has recently attracted the attention of the research community, current approaches assume that the input videos are "trimmed", meaning that a short video sequence is sampled a fixed time before the beginning of the action. We argue that, despite the recent advances in the field, trimmed action anticipation has a limited applicability in real-world scenarios where it is important to deal with "untrimmed" video inputs and it cannot be assumed that the exact moment in which the action will begin is known at test time. To overcome such limitations, we propose an untrimmed action anticipation task, which, similarly to temporal action detection, assumes that the input video is untrimmed at test time, while still requiring predictions to be made before the actions actually take place. We design an evaluation procedure for methods designed to address this novel task, and compare several baselines on the EPIC-KITCHENS-100 dataset. Experiments show that the performance of current models designed for trimmed action anticipation is very limited and more research on this task is required.
翻译:以地球为中心的行动预测包括预测摄影机将使用自我中心视频进行的未来行动。 虽然这项任务最近引起了研究界的注意,但目前的做法假定输入视频是“剪切的 ”, 意思是短视频序列是在行动开始之前的固定时间抽样的。 我们争论说,尽管最近在实地取得了一些进展,但剪切的行动预测在现实世界中的适用性有限,因为在现实世界中,处理“剪切的”视频输入非常重要,而且不能假定在试验时间开始行动的确切时间是已知的。为了克服这些限制,我们提议了一项未剪切的行动预测任务,它与时间行动探测一样,假定输入视频在试验时间是未剪接的,同时仍然需要在实际采取行动之前作出预测。我们设计了一种评价程序,用于处理这一新任务的方法,比较EPIC-KITCHENS-100数据集的若干基线。 实验表明,目前设计用于剪切行动预测的模型的性能非常有限,需要对此任务进行更多的研究。