Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Both appearance and motion features are used in previous works, while they do not utilize them in a proper way but apply simple concatenation or score-level fusion. In this work, we argue that the features extracted from the pretrained extractor, e.g., I3D, are not the WS-TALtask-specific features, thus the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Therefore, we propose a cross-modal consensus network (CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information of the auxiliary modality. Moreover, we treat the attention weights derived from each CCMas the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two common used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method and achieve the state-of-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.
翻译:微弱监管的时间行动本地化(WS-TAL)是一项具有挑战性的任务,旨在将特定视频中的行动场景与视频级的绝对监管实现本地化。 在以往的作品中,使用外观和运动功能,虽然它们没有以适当的方式使用它们,而是应用简单的共化或分级级融合。在这项工作中,我们争辩说,从预先培训的提取器(如I3D)中提取的特征不是WS-TALtask特有功能,因此,需要重新校正功能来减少任务相关信息冗余。 因此,我们提议在以往的作品中采用跨模式的共识网络(CO2-Net)来解决这一问题。 在CO2-Net中,我们主要采用两个相同的拟议的跨模式共识模块(CM),设计一个跨模式关注机制来过滤任务相关信息冗余,例如I3D, 并不是WS-Taltask-Tal-formation-formational 。 因此,我们从每个CCMA中得出的关注度的假称目标的权重。 从另一个CM-Net模块(CO-CM)的假称的假称目标分级化, 以维持共同的实验行动在最后的两种方法上产生的结果。