We introduce the task of weakly supervised learning for detecting human and object interactions in videos. Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object. To address these challenges, we introduce a contrastive weakly supervised training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision. To train our model, we introduce a dataset comprising over 6.5k videos with human-object interaction annotations that have been semi-automatically curated from sentence captions associated with the videos. We demonstrate improved performance over weakly supervised baselines adapted to our task on our video dataset.
翻译:我们引入了在视频中检测人类和物体相互作用的监管薄弱的学习任务。我们的任务提出了独特的挑战,因为一个系统不知道视频中存在哪些类型的人类物体相互作用,或者人类和物体的实际时空位置。为了应对这些挑战,我们引入了一种对比性薄弱的监督培训损失,目的是在视频中将空间区域与动作和对象词汇结合起来,并鼓励移动物体作为自我监督的一种形式的视觉外观在时间上保持连续性。为了培训我们的模型,我们引入了一个数据集,由6.5k以上的视频和人类物体相互作用说明组成,这些说明是半自动根据与视频相关的句子说明整理的。我们展示了与我们视频数据集上的任务相适应的监管薄弱基线相比的绩效。