Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and cannot capture context information outside the bounding box. Recently, a few query-based action detectors are proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose a new one-stage sparse action detector, termed STMixer. STMixer is based on two core designs. First, we present a query-based adaptive feature sampling module, which endows our STMixer with the flexibility of mining a set of discriminative features from the entire spatiotemporal domain. Second, we devise a dual-branch feature mixing module, which allows our STMixer to dynamically attend to and mix video features along the spatial and the temporal dimension respectively for better feature decoding. Coupling these two designs with a video backbone yields an efficient end-to-end action detector. Without bells and whistles, our STMixer obtains the state-of-the-art results on the datasets of AVA, UCF101-24, and JHMDB.
翻译:传统的视频动作检测器通常采用两阶段管道,在人体检测器中首先生成角色框,然后使用3D RoIAlign为分类提取角色特定特征。这种检测范例需要多阶段训练和推理,并且无法捕获边界框以外的上下文信息。最近,提出了一些基于查询的动作检测器,以一种端对端的方式预测动作实例。然而,它们仍缺乏特征采样和解码的适应性,因此面临着性能不佳或收敛速度慢等问题。本文提出了一种新的单阶段稀疏动作检测器STMixer。STMixer基于两个核心设计。首先,我们提出了一个基于查询的自适应特征采样模块,为我们的STMixer赋予了从整个时空域中挖掘一组具有区分性特征的灵活性。其次,我们设计了一个双分支特征混合模块,允许我们的STMixer分别沿着空间和时间维度动态关注和混合视频特征,以获得更好的特征解码。将这两个设计与视频主干相耦合,就可以得到一个高效的端到端动作检测器。没有花哨的东西,我们的STMixer在AVA、UCF101-24和JHMDB数据集上获得了最先进的结果。