Temporal action localization (TAL) requires long-form reasoning to predict actions of various lengths and complex content. Given limited GPU memory, training TAL end-to-end on such long-form videos (i.e., from videos to predictions) is a significant challenge. Most methods can only train on pre-extracted features without optimizing them for the localization problem, consequently limiting localization performance. In this work, to extend the potential in TAL networks, we propose a novel end-to-end method Re2TAL, which rewires pretrained video backbones for reversible TAL. Re2TAL builds a backbone with reversible modules, where the input can be recovered from the output such that the bulky intermediate activations can be cleared from memory during training. Instead of designing one single type of reversible module, we propose a network rewiring mechanism, to transform any module with a residual connection to a reversible module without changing any parameters. This provides two benefits: (1) a large variety of reversible networks are easily obtained from existing and even future model designs, and (2) the reversible models require much less training effort as they reuse the pre-trained parameters of their original non-reversible versions. Re2TAL reaches 37.01% average mAP, a new state-of-the-art record on ActivityNet-v1.3, and mAP 64.9% at tIoU=0.5 on THUMOS-14 without using optimal flow.
翻译:时间行动本地化( TAL) 需要长式推理来预测各种长度和复杂内容的行动。 由于 GPU 内存有限, 培训 TAL 端对端使用这种长式视频( 从视频到预测) 是一项巨大的挑战。 大多数方法只能就预选的外观进行训练, 而不会优化它们解决本地化问题, 从而限制本地化的性能。 在这项工作中, 要扩展TAL 网络的潜力, 我们提议一种新型端对端方法 Re2-14TAL, 它将为可逆的 TAL 提供预先训练的视频主干网。 RE2TAL 以可逆模块的形式构建一个主干网, 其输入可以从这样的输出中回收, 使大宗中间激活在培训期间能够从记忆中清除。 我们建议了一个网络重新布线机制, 在不改变任何参数的情况下, 将任何具有残余连接的模块转换成可逆UMUMU 。 这有两个好处:(1) 大量可逆的网络很容易从现有甚至未来的模型设计中获得。 Retal- reversal 流流中, 在原始的 Reval AP 格式上, 样的模型不需要的模型不需要需要大大的模型, 。