Temporal action proposal generation is an important and challenging task in video understanding, which aims at detecting all temporal segments containing action instances of interest. The existing proposal generation approaches are generally based on pre-defined anchor windows or heuristic bottom-up boundary matching strategies. This paper presents a simple and end-to-end learnable framework (RTD-Net) for direct action proposal generation, by re-purposing a Transformer-alike architecture. To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR). First, to deal with slowness prior in videos, we replace the original Transformer encoder with a boundary attentive module to better capture temporal information. Second, due to the ambiguous temporal boundary and relatively sparse annotations, we present a relaxed matching loss to relieve the strict criteria of single assignment to each groundtruth. Finally, we devise a three-branch head to further improve the proposal confidence estimation by explicitly predicting its completeness. Extensive experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection. Moreover, due to its simplicity in design, our RTD-Net is more efficient than previous proposal generation methods without non-maximum suppression post-processing. The code will be available at \url{https://github.com/MCG-NJU/RTD-Action}.
翻译:在视频理解中,时间行动提案的生成是一项重要而具有挑战性的任务,其目的在于探测含有行动关注实例的所有时间段。现有的提案生成方法通常以预先定义的锚窗或自下而上黑的边界匹配战略为基础。本文件介绍了一个简单和端到端可学习的框架(RTD-Net),通过重新定位一个类似变异器的架构来生成直接行动提案。为了解决时间和空间之间的重要视觉差异,我们对原变异器检测框架(DETR)做了三项重要的改进。首先,为了处理视频之前的缓慢,我们用一个边界关注模块取代原变异器编码器,以更好地捕捉时间信息。第二,由于时间界限模糊和相对稀少的说明,我们提出了一个轻松的匹配损失框架(RTD-Net),以缓解对每个地盘的单一任务进行严格的标准。最后,我们设计了一个三处头,通过明确预测其完整性来进一步改进提案的可信度估计。关于THUMOS14和活动-1.3基准的广泛实验展示了RTD-Net的有效性,同时展示了实时行动提案的生成和时间段后期检测方法。