This technical report presents our solution to the HACS Temporal Action Localization Challenge 2021, Weakly-Supervised Learning Track. The goal of weakly-supervised temporal action localization is to temporally locate and classify action of interest in untrimmed videos given only video-level labels. We adopt the two-stream consensus network (TSCN) as the main framework in this challenge. The TSCN consists of a two-stream base model training procedure and a pseudo ground truth learning procedure. The base model training encourages the model to predict reliable predictions based on single modality (i.e., RGB or optical flow), based on the fusion of which a pseudo ground truth is generated and in turn used as supervision to train the base models. On the HACS v1.1.1 dataset, without fine-tuning the feature-extraction I3D models, our method achieves 22.20% on the validation set and 21.68% on the testing set in terms of average mAP. Our solution ranked the 2nd in this challenge, and we hope our method can serve as a baseline for future academic research.
翻译:这份技术报告介绍了我们对HACS时间行动地方化挑战(2021年)的解决方案,“微弱监督时间行动本地化”的目标是,对只给视频标签的未剪辑视频进行时间定位和分类。我们采用双流共识网络(TSCN)作为应对这一挑战的主要框架。TSCN包括一个双流基础示范培训程序和一个假地面真相学习程序。基础模式培训鼓励模型预测基于单一模式(即RGB或光学流)的可靠预测,这种模式基于生成假地面真相的组合,并反过来用作基础模型培训的监督。关于HACS v1.1.1数据集,在不微调地谱Extraction I3D模型的情况下,我们的方法在平均 mAP的测试中达到了22.20%和21.68%。我们的方法在这项挑战中排名第二,我们希望我们的方法能够作为未来学术研究的基准。