Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video. TLG is inherently a challenging task, as it requires comprehensive understanding of both sentence semantics and video contents. Previous works either tackle this task in a fully-supervised setting that requires a large amount of temporal annotations or in a weakly-supervised setting that usually cannot achieve satisfactory performance. Since manual annotations are expensive, to cope with limited annotations, we tackle TLG in a semi-supervised way by incorporating self-supervised learning, and propose Self-Supervised Semi-Supervised Temporal Language Grounding (S^4TLG). S^4TLG consists of two parts: (1) A pseudo label generation module that adaptively produces instant pseudo labels for unlabeled samples based on predictions from a teacher model; (2) A self-supervised feature learning module with inter-modal and intra-modal contrastive losses to learn video feature representations under the constraints of video content consistency and video-text alignment. We conduct extensive experiments on the ActivityNet-CD-OOD and Charades-CD-OOD datasets. The results demonstrate that our proposed S^4TLG can achieve competitive performance compared to fully-supervised state-of-the-art methods while only requiring a small portion of temporal annotations.
翻译:根据文本描述,时间语言定位(TLG)旨在将含有特定语义的部段的时间界限本地化。 TLG本质上是一项具有挑战性的任务,因为它要求全面理解句语义和视频内容。 先前的作品要么在一个完全监督的环境中处理这项任务,需要大量时间说明,要么在一个通常无法取得令人满意的性能的薄弱监督环境中处理。 由于手工说明费用昂贵,要应付有限的说明,我们以半监督的方式处理TLG问题,办法是纳入自我监督学习,并提议自超半热语言定位(SQ4TLG),这要求全面理解句语义和视频内容。 S4TLG由两部分组成:(1) 一个假标签生成模块,根据教师模型的预测,为没有标签的样本制作即时化假标签;(2) 一个使用内部和内部模式的自我监督特征学习模块,在视频内容一致性的限制下学习视频特征演示,同时只要求视频内容的一致性和视频-超超超超音调的半热时语言定位。