In videos that contain actions performed unintentionally, agents do not achieve their desired goals. In such videos, it is challenging for computer vision systems to understand high-level concepts such as goal-directed behavior, an ability present in humans from a very early age. Inculcating this ability in artificially intelligent agents would make them better social learners by allowing them to evaluate human action under a teleological lens. To validate the ability of deep learning models to perform this task, we curate the W-Oops dataset, built upon the Oops dataset [15]. W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations. Due to the expensive segment annotation procedure, we propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video leveraging solely video-level labels. In particular, we employ an attention mechanism-based strategy that predicts the temporal regions which contribute the most to a classification task. Meanwhile, our designed overlap regularization allows the model to focus on distinct portions of the video for inferring the goal-directed and unintentional activity while guaranteeing their temporal ordering. Extensive quantitative experiments verify the validity of our localization method. We further conduct a video captioning experiment which demonstrates that the proposed localization module does indeed assist teleological action understanding.
翻译:在包含无意中实施的行动的视频中,代理商没有达到预期的目标。在这样的视频中,计算机视觉系统很难理解高层次的概念,例如目标导向的行为,人类从很小的时候就具备这种能力。在人工智能的代理商中注入这种能力,通过允许他们用远程视角来评估人类的行动,可以使他们更好的社会学习者。为了验证深层次学习模型执行这项任务的能力,我们根据Eos 数据集[15] 整理了W-Oops数据集。W-Oops由2 100个无意的人类行动视频组成,其中44个是目标导向的行为,30个是无意的视频层面的活动标签,通过人类的描述收集。由于成本昂贵的片段说明程序,我们提出在视频中将目标导向的和无意的时间区域本地化进行监管不力的算法,仅利用视频层面的标签。特别是,我们采用了基于关注机制的战略,预测最有助于分类任务的时段区域。同时,我们设计的重叠性规范使模型侧重于视频中的不同部分,用以推断目标导向的视频层面活动。我们提议了目标导向和无意的实验模式,同时保证了对目标导向性实验的模型的正确性活动进行。我们进一步验证。