Temporal Action Localization (TAL) methods typically operate on top of feature sequences from a frozen snippet encoder that is pretrained with the Trimmed Action Classification (TAC) tasks, resulting in a task discrepancy problem. While existing TAL methods mitigate this issue either by retraining the encoder with a pretext task or by end-to-end fine-tuning, they commonly require an overload of high memory and computation. In this work, we introduce Soft-Landing (SoLa) strategy, an efficient yet effective framework to bridge the transferability gap between the pretrained encoder and the downstream tasks by incorporating a light-weight neural network, i.e., a SoLa module, on top of the frozen encoder. We also propose an unsupervised training scheme for the SoLa module; it learns with inter-frame Similarity Matching that uses the frame interval as its supervisory signal, eliminating the need for temporal annotations. Experimental evaluation on various benchmarks for downstream TAL tasks shows that our method effectively alleviates the task discrepancy problem with remarkable computational efficiency.
翻译:时间行动本地化( TAL) 方法通常在与Trimmed Action 分类( TAC) 任务培训前的冻片段编码器( TAL) 的特征序列之上运作, 导致任务差异问题。 虽然现有的 TAL 方法通过以托辞任务或端到端微调对编码器进行再培训来缓解这一问题, 但它们通常需要超载的内存和计算。 在这项工作中, 我们引入 Soft- Landing ( Sola) 战略, 一个高效而有效的框架, 通过在冻结的编码器上加插一个轻量神经网络, 即 SoLa 模块, 来弥合预先训练的编码器和下游任务之间的可转移性差距。 我们还为 SOLa 模块提出了一个不受监督的培训计划; 以框架相似性为匹配, 将框架间隔用作监督信号, 消除时间说明的需要 。 对下游 TAL 任务的各种基准的实验性评估表明, 我们的方法有效地缓解了任务差异, 以显著的计算效率 。