Spatio-temporal Human-Object Interaction (ST-HOI) detection aims at detecting HOIs from videos, which is crucial for activity understanding. In daily HOIs, humans often interact with a variety of objects, e.g., holding and touching dozens of household items in cleaning. However, existing whole body-object interaction video benchmarks usually provide limited object classes. Here, we introduce a new benchmark based on AVA: Discovering Interacted Objects (DIO) including 51 interactions and 1,000+ objects. Accordingly, an ST-HOI learning task is proposed expecting vision systems to track human actors, detect interactions and simultaneously discover interacted objects. Even though today's detectors/trackers excel in object detection/tracking tasks, they perform unsatisfied to localize diverse/unseen objects in DIO. This profoundly reveals the limitation of current vision systems and poses a great challenge. Thus, how to leverage spatio-temporal cues to address object discovery is explored, and a Hierarchical Probe Network (HPN) is devised to discover interacted objects utilizing hierarchical spatio-temporal human/context cues. In extensive experiments, HPN demonstrates impressive performance. Data and code are available at https://github.com/DirtyHarryLYL/HAKE-AVA.
翻译:Spatio-时间-时间-人类-物体互动(ST-HOI)检测的目的是从视频中检测HOI,这对了解活动至关重要。在日常HOI中,人类经常与各种物体发生互动,例如,在清洁过程中持有和接触数十个家庭物品。然而,现有的全身-物体互动视频基准通常提供有限的对象类别。在这里,我们根据AVA:发现跨物体(DIO),包括51个互动和1 000个以上的物体,推出一个新的基准。因此,建议ST-HOI学习任务,期待有视觉系统跟踪人类行为者,检测互动和同时发现互动对象。即使今天的探测器/跟踪器在目标检测/跟踪任务中表现出色,他们仍然不满意地将DIO中的多样性/不见对象本地化。这深刻地揭示了当前视觉系统的限制,并提出了巨大挑战。因此,探索了如何利用烟雾-时空信号解决发现物体的问题,并设计了一个高级Probe网络(HPN),以利用等级-空间-空间-空间-轨道/空间-空间-空间-空间-空间-空间-空间/空间-实验展示现有数据/空间-空间/空间/空间/空间-实验。