Current methods for video activity localisation over time assume implicitly that activity temporal boundaries labelled for model training are determined and precise. However, in unscripted natural videos, different activities mostly transit smoothly, so that it is intrinsically ambiguous to determine in labelling precisely when an activity starts and ends over time. Such uncertainties in temporal labelling are currently ignored in model training, resulting in learning mis-matched video-text correlation with poor generalisation in test. In this work, we solve this problem by introducing Elastic Moment Bounding (EMB) to accommodate flexible and adaptive activity temporal boundaries towards modelling universally interpretable video-text correlation with tolerance to underlying temporal uncertainties in pre-fixed annotations. Specifically, we construct elastic boundaries adaptively by mining and discovering frame-wise temporal endpoints that can maximise the alignment between video segments and query sentences. To enable both more accurate matching (segment content attention) and more robust localisation (segment elastic boundaries), we optimise the selection of frame-wise endpoints subject to segment-wise contents by a novel Guided Attention mechanism. Extensive experiments on three video activity localisation benchmarks demonstrate compellingly the EMB's advantages over existing methods without modelling uncertainty.
翻译:目前视频活动本地化的方法随着时间的推移而假定活动的时间界限是确定和精确的。然而,在未标定的自然视频中,不同的活动大多是顺畅的,因此在标签上确定精确的时间活动开始时间和结束时间的标签时,这种时间标签上的不确定性目前在模式培训中被忽视,导致学习与测试中一般化程度差的不相称的视频文本相关性。在这项工作中,我们通过引入“弹性超动感应”来解决这个问题,以适应灵活和适应的活动时间界限,从而模拟通用可解释的视频文本相关性,并容忍在预先固定的描述中潜在的时间不确定性。具体地说,我们通过采矿和发现框架化的时间终点来建立弹性界限,以便最大限度地协调视频段和查询句之间的匹配。为了能够更精确地匹配(对内容的注意)和更稳健的本地化(对弹性边界的注意),我们选择了通过新式导引注意机制选择可分段内容的框架偏向端点。在三种视频活动本地化基准上进行广泛的实验,而没有明显的不确定性,展示EMUB的现有优势。