Learning an egocentric action recognition model from video data is challenging due to distractors (e.g., irrelevant objects) in the background. Further integrating object information into an action model is hence beneficial. Existing methods often leverage a generic object detector to identify and represent the objects in the scene. However, several important issues remain. Object class annotations of good quality for the target domain (dataset) are still required for learning good object representation. Besides, previous methods deeply couple the existing action models and need to retrain them jointly with object representation, leading to costly and inflexible integration. To overcome both limitations, we introduce Self-Supervised Learning Over Sets (SOS), an approach to pre-train a generic Objects In Contact (OIC) representation model from video object regions detected by an off-the-shelf hand-object contact detector. Instead of augmenting object regions individually as in conventional self-supervised learning, we view the action process as a means of natural data transformations with unique spatio-temporal continuity and exploit the inherent relationships among per-video object sets. Extensive experiments on two datasets, EPIC-KITCHENS-100 and EGTEA, show that our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
翻译:从视频数据中学习以自我为中心的行动识别模型具有挑战性,因为背景中的分流器(例如,不相干物体)具有挑战性。因此,进一步将物体信息纳入行动模型是有益的。现有方法往往会利用通用物体探测器来识别和代表现场的物体。然而,仍有若干重要问题。目标领域(数据集)质量良好的物体类说明对于学习良好的物体表示仍然需要。此外,以往的方法深刻地将现有行动模型与现有行动模型相提并论,需要将它们与物体表示法相结合进行再培训,从而导致成本高昂和不灵活的整合。为了克服这两个限制,我们引入了自我共享的数据集(SOS),这是在视频目标区域预先配置通用接触物体(OIC)代表模型的方法,由现成的手点接触探测器检测到。在常规自我监督的学习中,我们不用单独增加目标区域,而是将行动进程视为自然数据转换的一种手段,具有独特的空间-时空连续性,并利用每个视频对象组之间的内在关系。在两个视频对象区域进行广泛的实验,在两个视频目标区域中,EPIC-KINS-S-S-SD-S-S-SD-Sust-S-S-SlA-S-S-S-S-S-S-S-S-SDIS-SDIS-SAR-SD-S-S-S-SD-S-SAR-S-S-S-S-S-S-SAR-S-SAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S