This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip where the contact might happen, before any action takes place. The problem is considerably hard, as we aim at estimating the position of such objects in a scenario where the observed clip and the action segment are separated by the so-called ``time to contact'' (TTC) segment. Many methods have been proposed to anticipate the action of a person based on previous hand movements and interactions with the surroundings. However, there have been no attempts to investigate the next possible interactable object, and its future location with respect to the first-person's motion and the field-of-view drift during the TTC window. We define this as the task of Anticipating the Next ACTive Object (ANACTO). To this end, we propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip. We benchmark our method on three datasets: EpicKitchens-100, EGTEA+ and Ego4D. We also provide annotations for the first two datasets. Our approach performs best compared to relevant baseline methods. We also conduct ablation studies to understand the effectiveness of the proposed and baseline methods on varying conditions. Code and ANACTO task annotations will be made available upon paper acceptance.
翻译:本文论述未来下一个活动目标位置的预测问题,对于在任何行动发生之前可能发生接触的某个以自我为中心的视频片段而言,预测未来下一个活动目标位置的问题。问题是相当困难的,因为我们的目标是在所观测的片段和行动部分被所谓的“接触时间”部分分隔的情景中估计这些物体的位置。我们提出了许多方法,根据先前的手动和与周围的相互作用预测一个人的行动。然而,我们没有试图调查下一个可能的互动对象及其未来位置,以及第一个人的运动和TTC窗口中视野漂移。我们将此定义为预测下一个活动对象(ANACTO)的任务。为此,我们提议了一个基于变压器的自我注意框架,以辨别和定位下一个活动目标与周围环境的相互作用。我们用三种数据集:EpicKitchens-100、EGTEA+和Ego4D作为基准,我们没有尝试调查下一个数据定位基准值的方法。我们还将为两个基准值的接受度分析方法提供不同的描述。我们还将对两个基准值进行我们有关基准值的研究。