Temporal action detection (TAD) is extensively studied in the video understanding community by following the object detection pipelines in images. However, complex designs are not uncommon in TAD, such as two-stream feature extraction, multi-stage training, complex temporal modeling, and global context fusion. In this paper, we do not aim to introduce any novel technique for TAD. Instead, we study a simple, straightforward, yet must-known baseline given the current status of complex design and low efficiency in TAD. In our simple baseline (BasicTAD), we decompose the TAD pipeline into several essential components: data sampling, backbone design, neck construction, and detection head. We empirically investigate the existing techniques in each component for this baseline and, more importantly, perform end-to-end training over the entire pipeline thanks to the simplicity in design. Our BasicTAD yields an astounding RGB-Only baseline very close to the state-of-the-art methods with two-stream inputs. In addition, we further improve the BasicTAD by preserving more temporal and spatial information in network representation (termed as BasicTAD Plus). Empirical results demonstrate that our BasicTAD Plus is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction. Our approach can serve as a strong baseline for TAD. The code will be released at https://github.com/MCG-NJU/BasicTAD.
翻译:在视频理解界广泛研究时间行动探测(TAD),通过跟踪图像中的物体探测管道,在视频理解界广泛研究时间行动探测(TAD),然而,在TAD中,复杂的设计并不罕见,例如两流特征提取、多阶段培训、复杂的时间模型和全球背景融合。在本文中,我们不打算为TAD引入任何新技术。相反,我们研究一个简单、直截了当、但必须知道的基线,因为目前设计复杂,TAD效率低。在我们简单的基线(基础TAD)中,我们将TAD管道分解成几个基本组成部分:数据取样、骨干设计、颈部构造和检测头。我们实证地调查了这一基线每个组成部分的现有技术,更重要的是,由于设计简洁,我们没有针对整个管道进行端对端培训。我们的基础TADDD将产生一个与最新设计非常接近的 RGB-Only基线,同时提供两流投入。此外,我们通过在网络代表中保存更多的时间和空间信息(以基础TAD+TA+EPAD),我们的基本数据结果将大大显示我们以往的准确的版本。