This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries, similar to DETR, which has shown great success in object detection. However, the framework suffers from several problems if directly applied to TAD: the insufficient exploration of inter-query relation in the decoder, the inadequate classification training due to a limited number of training samples, and the unreliable classification scores at inference. To this end, we first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations. Moreover, we propose two losses to facilitate and stabilize the training of action classification. Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries. The proposed method, named ReAct, achieves the state-of-the-art performance on THUMOS14, with much lower computational costs than previous methods. Besides, extensive ablation studies are conducted to verify the effectiveness of each proposed component. The code is available at https://github.com/sssste/React.
翻译:这项工作的目的是利用与DETR类似的行动查询来推进时间行动探测(TAD),采用与DETR类似的行动查询来引导对时间行动探测(TAD),在物体探测方面表现出极大的成功;然而,如果直接应用到TAD,则该框架存在若干问题:对解码器中的查询关系探索不足,由于培训样本数量有限,分类培训不够,分类分数不可靠;为此,我们首先提议在解码器中建立一个相关关注机制,根据它们之间的关系指导对查询的注意;此外,我们提出两个损失,以促进和稳定行动分类的培训;最后,我们提议预测每项推断行动查询的本地化质量,以区分高质量的查询;拟议的方法(称为ReAct)在THUMOS14上达到最新性表现,计算成本比以前低得多;此外,还进行广泛的联系研究,以核实每个拟议组成部分的有效性。该代码可在https://github.com/sste/React查阅。