Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to sub-optimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.
翻译:检测人类物体相互作用(HOI)是全面视觉了解机器的一个重要步骤。虽然从静态图像中探测非时空 HOI(例如坐在椅子上)是可行的,但即使人类也不可能从单一的视频框中猜测与时间相关的 HOI(例如打开/关闭一扇门),因为相邻框架在其中起着重要作用。然而,仅对静态图像操作的常规 HOI 方法被用来预测与时间有关的相互作用,这基本上是在没有时间背景的情况下进行猜测,并可能导致亚最佳的性能。在本文中,我们通过以明确的时间信息探测视频 HOI 来弥补这一差距。我们首先显示,由于地貌不一致的问题,共同行动探测基线的天性时间觉变异并不在视频HOI上起作用。我们然后提出了一个简单而有效的架构,称为空间-时空HOI 检测(ST-HI),利用时间信息,如人与物体轨迹、正确定位的视觉特征和空间-时空遮罩等时间信息。我们提出一个新的视频定位基准,作为我们的一个基准。