通过单一框架注释从文本查询中获取视频时刻 (Video Moment Retrieval from Text Queries via Single Frame Annotation)

Video moment retrieval aims at finding the start and end timestamps of a moment (part of a video) described by a given natural language query. Fully supervised methods need complete temporal boundary annotations to achieve promising results, which is costly since the annotator needs to watch the whole moment. Weakly supervised methods only rely on the paired video and query, but the performance is relatively poor. In this paper, we look closer into the annotation process and propose a new paradigm called "glance annotation". This paradigm requires the timestamp of only one single random frame, which we refer to as a "glance", within the temporal boundary of the fully supervised counterpart. We argue this is beneficial because comparing to weak supervision, trivial cost is added yet more potential in performance is provided. Under the glance annotation setting, we propose a method named as Video moment retrieval via Glance Annotation (ViGA) based on contrastive learning. ViGA cuts the input video into clips and contrasts between clips and queries, in which glance guided Gaussian distributed weights are assigned to all clips. Our extensive experiments indicate that ViGA achieves better results than the state-of-the-art weakly supervised methods by a large margin, even comparable to fully supervised methods in some cases.

翻译：视频瞬间检索旨在找到自然语言查询所描述的时刻的起始和结束时间标记( 视频的一部分) 。受充分监督的方法需要完整的时间边界说明, 才能取得有希望的结果, 因为说明员需要全时观察。受到微弱监督的方法只依赖于配对的视频和查询, 但性能相对较差。在本文中, 我们更仔细地查看批注过程, 并提出一个新的范例, 叫做“ glance 批注 ” 。这个模式只需要在完全监督的对应方的时界内, 一个单一随机框架( 我们称之为“ glance 批注 ” ) 的时间戳。我们说, 这样做是有益的, 因为比较监管不力, 会增加微小的成本, 并且提供更大的性能潜力。在视觉批注设置下, 我们提出一个方法命名为视频时段, 通过对比性说明( Glance Antaning) 批注( ViGA) ( ViGA) ( ViGA) ( Viga) 将输入的视频剪辑剪和剪辑和对比, 剪辑中, 其中我们所引导的戈斯分布的重量分配给所有剪片段的剪片段内的所有剪片段内, 我们的广泛实验显示方法都比监督得更好。