In this report, we present our solution for the task of temporal action localization (detection) (task 1) in ActivityNet Challenge 2020. The purpose of this task is to temporally localize intervals where actions of interest occur and predict the action categories in a long untrimmed video. Our solution mainly includes three components: 1) feature encoding: we apply three kinds of backbones, including TSN [7], Slowfast[3] and I3d[1], which are both pretrained on Kinetics dataset[2]. Applying these models, we can extract snippet-level video representations; 2) proposal generation: we choose BMN [5] as our baseline, base on which we design a Cascade Boundary Refinement Network (CBR-Net) to conduct proposal detection. The CBR-Net mainly contains two modules: temporal feature encoding, which applies BiLSTM to encode long-term temporal information; CBR module, which targets to refine the proposal precision under different parameter settings; 3) action localization: In this stage, we combine the video-level classification results obtained by the fine tuning networks to predict the category of each proposal. Moreover, we also apply to different ensemble strategies to improve the performance of the designed solution, by which we achieve 42.788% on the testing set of ActivityNet v1.3 dataset in terms of mean Average Precision metrics.
翻译:在本报告中,我们在2020年活动网挑战中提出时间行动定位(检测)任务(任务1)的解决方案。这一任务的目的是在感兴趣行动的发生时间间隔上对时间定位进行时间定位,并在长期未剪动的视频中预测行动类别。我们的解决办法主要包括三个组成部分:1 特性编码:我们应用三种主干线,包括TSN[7]、Slowfast[3]和I3d[1],它们都预先掌握了动因数据集[2]。运用这些模型,我们可以提取片段级视频演示;2 提案的生成:我们选择BMN[5]作为基线,在此基础上设计一个级联边界精炼网(CBR-Net)来进行建议检测。CBR-Net主要包含两个模块:时间特征编码,它应用BILSTM来编码长期时间信息;CBR模块,其目标是改进不同参数设置的精确度;3 行动本地化:在这一阶段,我们把微调网络获得的视频级别分类结果综合起来,以预测每项建议的类别。此外,我们还根据42-Net标准测试了不同标准要求的绩效。