Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources. Because of long video durations and limited GPU memory, most action detectors can only operate on pre-extracted features rather than the original videos, and they still require a lot of computation to achieve high detection performance. To alleviate the heavy computation problem in TAD, in this work, we first propose an efficient action detector with detector proposal sampling, based on the observation that performance saturates at a small number of proposals. This detector is designed with several important techniques, such as LSTM-boosted temporal aggregation and cascaded proposal refinement to achieve high detection quality as well as low computational cost. To enable joint optimization of this action detector and the feature encoder, we also propose encoder gradient sampling, which selectively back-propagates through video snippets and tremendously reduces GPU memory consumption. With the two sampling strategies and the effective detector, we build a unified framework for efficient end-to-end temporal action detection (ETAD), making real-world untrimmed video understanding tractable. ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3. Interestingly, on ActivityNet-1.3, it reaches 37.78% average mAP, while only requiring 6 mins of training time and 1.23 GB memory based on pre-extracted features. With end-to-end training, it reduces the GPU memory footprint by more than 70% with even higher performance (38.21% average mAP), as compared with traditional end-to-end methods. The code is available at https://github.com/sming256/ETAD.
翻译:由于视频时间长和GPU记忆有限,大多数行动探测器只能使用预提取功能,而不是原始视频,而且它们仍然需要大量计算才能达到高检测性能。为了减轻TAD的重度计算问题,我们首先建议使用检测器取样来有效检测器,根据观测结果显示性能饱和数量不多的投标书。该检测器的设计采用若干重要技术,如LSTM启动的时间缩放和升级建议,以便达到高检测质量和低计算成本。为了能够联合优化该动作探测器和功能编码器,还需要大量计算才能达到高检测性能。为了减轻TAD的重度计算问题,我们在此工作中,我们首先建议使用一个高效的检测器检测器进行检测或建议性能检测器取样。根据两种取样策略和有效检测器,我们建立了一个统一的框架,用于高效端到端到端检测(ETADAD),使真实的视频存储器质量质量质量质量得到改进,而GDADAD在平均的状态培训中,在ODADADA中,在S平均的状态上,在O-ADADADADADADDS上,在S 上进行更精确的升级的升级前,在1ODADADADADADADDDDMDDDDDDDDDDDDDDDDDDDDDDDDDS 上,在可以实现。