Temporal action detection (TAD) with end-to-end training often suffers from the pain of huge demand for computing resources due to long video duration. In this work, we propose an efficient temporal action detector (ETAD) that can train directly from video frames with extremely low GPU memory consumption. Our main idea is to minimize and balance the heavy computation among features and gradients in each training iteration. We propose to sequentially forward the snippet frame through the video encoder, and backward only a small necessary portion of gradients to update the encoder. To further alleviate the computational redundancy in training, we propose to dynamically sample only a small subset of proposals during training. Moreover, various sampling strategies and ratios are studied for both the encoder and detector. ETAD achieves state-of-the-art performance on TAD benchmarks with remarkable efficiency. On ActivityNet-1.3, training ETAD in 18 hours can reach 38.25% average mAP with only 1.3 GB memory consumption per video under end-to-end training. Our code will be publicly released.
翻译:使用端到端培训的时空行动探测(TAD)往往由于视频时间长而对计算资源的巨大需求而痛苦不堪。在这项工作中,我们提议建立一个高效的时间行动检测器(ETAD),可以直接用极低的GPU内存消耗量进行视频框架培训。我们的主要想法是将每次培训迭代的特性和梯度之间的重度计算最小化和平衡。我们提议通过视频编码器将片段框架按顺序向前推进,只有一小部分必要的梯度向后更新编码器。为了进一步减轻培训中的计算冗余,我们提议在培训期间只对一小部分建议进行动态抽样。此外,还研究各种对编码器和探测器的抽样战略和比率。ETAD在TAD基准上达到最先进的性能。关于活动网-1.3,在18小时内培训ETAD可达到38.25%的平均 mAP,在端到端培训中,每部视频的记忆消耗量只有1.3GB。我们的代码将公开发布。