Existing temporal action localization (TAL) works rely on a large number of training videos with exhaustive segment-level annotation, preventing them from scaling to new classes. As a solution to this problem, few-shot TAL (FS-TAL) aims to adapt a model to a new class represented by as few as a single video. Exiting FS-TAL methods assume trimmed training videos for new classes. However, this setting is not only unnatural actions are typically captured in untrimmed videos, but also ignores background video segments containing vital contextual cues for foreground action segmentation. In this work, we first propose a new FS-TAL setting by proposing to use untrimmed training videos. Further, a novel FS-TAL model is proposed which maximizes the knowledge transfer from training classes whilst enabling the model to be dynamically adapted to both the new class and each video of that class simultaneously. This is achieved by introducing a query adaptive Transformer in the model. Extensive experiments on two action localization benchmarks demonstrate that our method can outperform all the state of the art alternatives significantly in both single-domain and cross-domain scenarios. The source code can be found in https://github.com/sauradip/fewshotQAT
翻译:现有的时间行动本地化(TAL)工作依赖于大量培训视频,这些视频包含全面的分层层次说明,防止它们升级到新班级。作为解决这个问题的一种解决办法,少见的TAL(FS-TAL)旨在将一个模型适应以少数作为单一视频代表的新班级。FS-TAL方法的退出为新班级设置了剪裁培训视频。然而,这种设置不仅是非自然行动通常在未剪裁的视频中捕捉,而且还忽略了包含地表行动分块的重要背景提示的背景视频段。在这项工作中,我们首先提出一个新的FS-TAL设置,提议使用未剪裁的培训视频。此外,还提出了一个新的FS-TAL模型,最大限度地将培训班的知识转移纳入到仅以少数人为代表的新班级,同时使该模型能够动态地适应新班级和每班的视频。通过在模型中引入一个调控变器实现这一点。两个行动本地化基准的广泛实验表明,我们的方法可以大大超越艺术替代物的所有状态,在单度和横跨度/横跨度设想中都能找到 ATsum/Q。在 http源代码中可以找到。