Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video. TLG is inherently a challenging task, as it requires to have comprehensive understanding of both video contents and text sentences. Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance. To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework. STLG consists of two parts: (1) A pseudo label generation module that produces adaptive instant pseudo labels for unlabeled data based on predictions from a teacher model; (2) A self-supervised feature learning module with two sequential perturbations, i.e., time lagging and time scaling, for improving the video representation by inter-modal and intra-modal contrastive learning. We conduct experiments on the ActivityNet-CD-OOD and Charades-CD-OOD datasets and the results demonstrate that our proposed STLG framework achieve competitive performance compared to fully-supervised state-of-the-art methods with only a small portion of temporal annotations.
翻译:鉴于文本描述,时地语言定位(TLG)旨在将含有特定语义的段段在未剪辑的视频中的时间界限本地化。 TLG本质上是一项具有挑战性的任务,因为它要求全面理解视频内容和文本句。以前的工作要么在完全监督下的环境下处理这项任务,需要大量手动说明,要么在缺乏监督、无法取得令人满意的性能的环境下处理这项任务。为了以有限的注释实现良好的表现,我们以半监督的方式处理这项任务,并提议一个统一的半监督的时地语言定位(STLG)框架。STLG由两部分组成:(1) 假标签生成模块,根据教师模型的预测,为不贴标签的数据制作适应性的即时假标签;(2) 自我监督的特征学习模块,有两种连续的扰动,即时间滞后和时间缩放,通过现代和内部的对比性学习来改进视频代表。我们在活动网-CDOD和Charades-CD-ODS-ODS-OD-ODS-ODSupress production (Spar-Scial-Supal-Adress) Procial Stat-hasset-hassetal sal sal salslupalslupalslationals)框架上拟议的仅仅仅能结果。