Temporal sentence grounding aims to detect the event timestamps described by the natural language query from given untrimmed videos. The existing fully-supervised setting achieves great performance but requires expensive annotation costs; while the weakly-supervised setting adopts cheap labels but performs poorly. To pursue high performance with less annotation cost, this paper introduces an intermediate partially-supervised setting, i.e., only short-clip or even single-frame labels are available during training. To take full advantage of partial labels, we propose a novel quadruple constraint pipeline to comprehensively shape event-query aligned representations, covering intra- and inter-samples, uni- and multi-modalities. The former raises intra-cluster compactness and inter-cluster separability; while the latter enables event-background separation and event-query gather. To achieve more powerful performance with explicit grounding optimization, we further introduce a partial-full union framework, i.e., bridging with an additional fully-supervised branch, to enjoy its impressive grounding bonus, and be robust to partial annotations. Extensive experiments and ablations on Charades-STA and ActivityNet Captions demonstrate the significance of partial supervision and our superior performance.
翻译:设定时刑的目的是检测自然语言查询中未加剪辑的视频所描述的事件时间戳。现有的完全受监督的环境表现优异,但需要昂贵的注解费用;虽然受监管薄弱的环境采用廉价标签,但表现不佳。为了追求高性能,省略说明费用较低,本文引入了一个半监督的中间环境,即只有短剪或甚至单框架标签在培训期间可用。为了充分利用部分标签,我们提议建立一个新型的四重制约管道,以全面塑造事件拼凑的表象,包括内部和中间的标本、单和多模式。前者提高了集群内部的紧凑性和集群之间的隔绝性;而后者则使得事件背地分离和事件查询聚集在一起。为了以明确的地面优化实现更强有力的业绩,我们进一步引入一个部分完整的联盟框架,即与另一个完全受监督的分支进行连接,以享受其令人印象深刻的地面红利,并强有力地展示了我们总体业绩和部分判断力的优势。