Video activity localization aims at understanding the semantic content in long untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and there are often no clear-cut transitions between actions. Moreover, the definition of the start and end of events is subjective, which may confuse the model. To alleviate the boundary ambiguity, we propose to study the video activity localization problem from a denoising perspective. Specifically, we propose an encoder-decoder model named DenoiseLoc. During training, a set of action spans is randomly generated from the ground truth with a controlled noise scale. Then we attempt to reverse this process by boundary denoising, allowing the localizer to predict activities with precise boundaries and resulting in faster convergence speed. Experiments show that DenoiseLoc advances %in several video activity understanding tasks. For example, we observe a gain of +12.36% average mAP on QV-Highlights dataset and +1.64% mAP@0.5 on THUMOS'14 dataset over the baseline. Moreover, DenoiseLoc achieves state-of-the-art performance on TACoS and MAD datasets, but with much fewer predictions compared to other current methods.
翻译:视频活动定位旨在理解长视频中的语义内容并检索感兴趣的动作。检索到的动作及其起始和结束位置可用于制作精彩片段、时间动作检测等。不幸的是,学习精确的活动边界位置非常具有挑战性,因为时间活动在时间上是连续的,而且动作之间通常没有明确的转换。而且,起始和结束事件的定义是主观的,可能会让模型产生困惑。为了缓解边界模糊性,我们提出从去噪的角度研究视频活动定位问题。具体地,我们提出了一种编码器-解码器模型DenoiseLoc。在训练期间,使用控制的噪声比例从基本事实中随机生成一组动作跨度。然后,我们试图通过边界去噪来反转此过程,使定位器能够预测具有精确边界的活动,并导致更快的收敛速度。实验表明,DenoiseLoc进步了视频活动理解任务的准确性。例如,在QV-Highlights数据集上,我们观察到平均mAP提高了12.36%,在THUMOS'14数据集上,mAP@0.5提高了1.64%,均超过基线。此外,DenoiseLoc在TACoS和MAD数据集上取得了最先进的性能,但与其他当前方法相比,预测次数要少得多。