In this work, we focus on semi-supervised learning for video action detection which utilizes both labeled as well as unlabeled data. We propose a simple end-to-end consistency based approach which effectively utilizes the unlabeled data. Video action detection requires both, action class prediction as well as a spatio-temporal localization of actions. Therefore, we investigate two types of constraints, classification consistency, and spatio-temporal consistency. The presence of predominant background and static regions in a video makes it challenging to utilize spatio-temporal consistency for action detection. To address this, we propose two novel regularization constraints for spatio-temporal consistency; 1) temporal coherency, and 2) gradient smoothness. Both these aspects exploit the temporal continuity of action in videos and are found to be effective for utilizing unlabeled videos for action detection. We demonstrate the effectiveness of the proposed approach on two different action detection benchmark datasets, UCF101-24 and JHMDB-21. In addition, we also show the effectiveness of the proposed approach for video object segmentation on the Youtube-VOS which demonstrates its generalization capability The proposed approach achieves competitive performance by using merely 20% of annotations on UCF101-24 when compared with recent fully supervised methods. On UCF101-24, it improves the score by +8.9% and +11% at 0.5 f-mAP and v-mAP respectively, compared to supervised approach.
翻译:在这项工作中,我们侧重于半监督的视频行动检测学习,这种学习既使用标签数据,也使用未标记数据。我们建议采用基于端到端的基于一致性的简单方法,有效利用未标记数据。视频行动检测既需要行动集体预测,也需要对行动进行时空定位。因此,我们调查两种类型的制约因素,即分类一致性和时空一致性。在视频中出现主要背景和静态区域,使得利用时空一致性进行行动检测具有挑战性。为解决这一问题,我们提出了两种基于端到端的监管限制,以有效使用未标记的视频来进行行动检测。因此,我们调查了两种不同的行动检测基准数据集(UCFC101-24和JHMAP-21)的拟议方法的有效性。为了解决这个问题,我们提出了两种新的标准,即:即:Spatio-时间一致性;时间一致性;和坡度平调调度;这两个方面都利用视频行动的时间连续性的连续性,并发现对使用未标记的视频定位的视频检测方法有效使用。我们还展示了拟议在Youtube-101+24S-VAS上对视频对象进行视频目标分割的方法的有效性,同时分别用20-101和20-VAS进行对比分析方法,并用常规化方法,通过建议通过常规化方法对20-xx法,分别在20-21实现。