Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.
翻译:时间视频定位( TVG) 旨在根据给定的句子查询在视频中将目标部分本地化。 尽管令人敬佩的作品在这项工作中取得了体面的成就, 但是他们严重依赖大量视频拼贴配数据, 这些数据在现实世界情景中收集成本昂贵且耗时。 在本文中, 我们探索是否可以在不配对的注释中学习视频定位模型。 根据我们的最佳知识, 本文是试图在未经监控的设置中在视频 GV 中进行匹配的首份工作 。 考虑到没有配对的监管, 我们提议建立一个新颖的深语类集成网( DSCNet) 来利用来自整个查询集的所有语义信息来构建每个视频中可能的活动, 用于地面定位。 具体地, 我们首先开发一个语言语义采掘模块, 从整个查询集中提取隐含的语义语义特征。 然后, 这些语言的语义特征作为指南, 通过基于视频的语义聚合模块在视频中描述活动。 最后, 我们利用一个地面关注分支来过滤冗余的背景活动, 并完善地格网络上的结果。