Weakly-supervised Temporal Action Localization (WTAL) aims to detect the intervals of action instances with only video-level action labels available during training. The key challenge is how to distinguish the segments of interest from the background segments, which are unlabelled even on the video-level. While previous works treat the background as "curses", we consider it as "blessings". Specifically, we first use causal analysis to point out that the common localization errors are due to the unobserved and un-enumerated confounder that resides ubiquitously in visual recognition. Then, we propose a Temporal Smoothing PCA-based (TS-PCA) deconfounder, which exploits the unlabelled background to model an observed substitute for the confounder, to remove the confounding effect. Note that the proposed deconfounder is model-agnostic and non-intrusive, and hence can be applied in any WTAL method without modification. Through extensive experiments on four state-of-the-art WTAL methods, we show that the deconfounder can improve all of them on the public datasets: THUMOS-14 and ActivityNet-1.3.
翻译:微弱监督的时空行动地方化(WTAL)旨在检测行动事件间隔,在培训期间只有视频级别的动作标签。关键的挑战是如何区分感兴趣的部分和背景部分,这些部分甚至没有标签,即使是在视频一级。虽然以前的工作将背景视为“曲线”,但我们认为其为“牺牲品”。具体地说,我们首先使用因果分析来指出,共同的本地化错误是由于在视觉识别中普遍存在的未观测和未统计的混凝土体造成的。然后,我们建议使用一个基于五氯苯甲醚(TS-PCA)的柔和器解析器,利用未贴标签的背景模型作为观察到的聚合者替代物的模型,以去除混结效应。请注意,拟议的断裂器是模型的、非侵扰性,因此可以在任何WTAL方法中不加修改地应用。通过对四种状态的WTAL方法进行广泛的实验,我们展示解析器可以改进公共数据中的所有数据:1.3-14号网络活动。