Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
翻译:定时刑(TSG)的目的是从一个未剪辑的视频中用一个句子查询,从一个未剪辑的视频中确定一个特定部分的时间边界。所有现有作品首先利用一个稀少的抽样战略,提取固定数量的视频框架,然后与询问句进行多式互动,但我们认为,这些方法忽视了两个不可或缺的问题:(1) 边界-偏差:附加说明的目标部分一般是指两个具体框架,作为相应的起始和结束时间标本。视频下调过程可能会失去这两个框架,而将相邻的不相干框架作为新的边界。(2) 理由偏见:这种不正确的新边界框架还导致框架-询问互动中的推理偏差,降低了模型的概括化能力。为了减轻以上限制,我们在本文中提议为TSG建立一个新型的暹米取样和解释网络(SSRN ), 引入一个像样的取样机制,以产生额外的背景框架来丰富和完善新的边界。具体地说,正在制定一项推理策略,以了解这些框架之间的相互联系,并在边界上建立软标签,以便更精确地解释框架-克格的模型,降低模型的模型能力。这种机制还能够补充连续的试度。