Online action detection aims at the accurate action prediction of the current frame based on long historical observations. Meanwhile, it demands real-time inference on online streaming videos. In this paper, we advocate a novel and efficient principle for online action detection. It merely updates the latest and oldest historical representations in one window but reuses the intermediate ones, which have been already computed. Based on this principle, we introduce a window-based cascade Transformer with a circular historical queue, where it conducts multi-stage attentions and cascade refinement on each window. We also explore the association between online action detection and its counterpart offline action segmentation as an auxiliary task. We find that such an extra supervision helps discriminative history clustering and acts as feature augmentation for better training the classifier and cascade refinement. Our proposed method achieves the state-of-the-art performances on three challenging datasets THUMOS'14, TVSeries, and HDD. Codes will be available after acceptance.
翻译:在线行动探测旨在根据长期历史观察对当前框架进行准确的行动预测。 同时, 它要求在线流动视频进行实时推断。 在本文中, 我们提倡一项创新而高效的在线行动探测原则。 它只是更新一个窗口中最新和最古老的历史代表, 并重新使用已经计算的中间代表。 基于此原则, 我们引入一个基于窗口的级联变换器, 并使用一个循环历史队列, 给每个窗口提供多阶段的关注和级次改进。 我们还探索在线行动探测与其对应的离线行动分割作为辅助任务之间的联系。 我们发现, 这样的额外监管有助于歧视性的历史组合和功能增强, 更好地培训分类和级的完善。 我们提出的方法在三个挑战性数据集THUMOS'14、 TVSeries和HDD. 代码获得接受后, 将会提供最新的最新表现。