Temporal action proposal generation is an important and challenging task in video understanding, which aims at detecting all temporal segments containing action instances of interest. The existing proposal generation approaches are generally based on pre-defined anchor windows or heuristic bottom-up boundary matching strategies. This paper presents a simple and efficient framework (RTD-Net) for direct action proposal generation, by re-purposing a Transformer-alike architecture. To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR). First, to deal with slowness prior in videos, we replace the original Transformer encoder with a boundary attentive module to better capture long-range temporal information. Second, due to the ambiguous temporal boundary and relatively sparse annotations, we present a relaxed matching scheme to relieve the strict criteria of single assignment to each groundtruth. Finally, we devise a three-branch head to further improve the proposal confidence estimation by explicitly predicting its completeness. Extensive experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection. Moreover, due to its simplicity in design, our framework is more efficient than previous proposal generation methods, without non-maximum suppression post-processing. The code and models are made available at https://github.com/MCG-NJU/RTD-Action.
翻译:在视频理解方面,时间行动提案的生成是一项重要而具有挑战性的任务,其目的在于探测含有行动关注实例的所有时间段; 现有提案生成方法通常基于预先定义的锚窗或自下而上偏颇的底线匹配战略; 本文为直接行动提案的生成提供了一个简单而有效的框架(RTD-Net),其方法是重新设计一个类似变异器的架构。 为了解决时间和空间之间的重要视觉差异,我们对最初的变压器检测框架(DETR)做了三项重要的改进。 首先,为了处理视频之前的缓慢,我们用一个注意边界的模块取代原变压器编码器,以更好地捕捉远程时间信息。 其次,由于时间边界模糊和相对稀少的说明,我们提出了一个宽松的匹配方案,以缓解对每个地盘进行单一分配的严格标准。 最后,我们设计了一个三边头,通过明确预测其完整性来进一步改进对提案的信任估计。 关于THUMOS14和ActionNet-1.3基准的广泛实验表明RTD-Net的有效性, 既包括时间行动提案的生成任务,也包括时间性行动检测。 其次,由于我们没有制定简化的模型,因此,在以前设计中采用了不精确的后制式版本。