Distracted driving causes thousands of deaths per year, and how to apply deep-learning methods to prevent these tragedies has become a crucial problem. In Track3 of the 6th AI City Challenge, researchers provide a high-quality video dataset with densely action annotations. Due to the small data scale and unclear action boundary, the dataset presents a unique challenge to precisely localize all the different actions and classify their categories. In this paper, we make good use of the multi-view synchronization among videos, and conduct robust Multi-View Practice (MVP) for driving action localization. To avoid overfitting, we fine-tune SlowFast with Kinetics-700 pre-training as the feature extractor. Then the features of different views are passed to ActionFormer to generate candidate action proposals. For precisely localizing all the actions, we design elaborate post-processing, including model voting, threshold filtering and duplication removal. The results show that our MVP is robust for driving action localization, which achieves 28.49% F1-score in the Track3 test set.
翻译:每年造成数千人死亡的驱动因素,以及如何运用深造方法预防这些悲剧已成为一个关键问题。在第6届AI城市挑战赛第3轨中,研究人员提供高质量的视频数据集,带有密集的行动说明。由于数据规模小,行动界限不明确,数据集对准确确定所有不同行动的地点和分类提出了独特的挑战。在本文件中,我们充分利用视频之间的多视图同步,并开展强有力的多视做法,推动行动本地化。为避免过度配置,我们用动因学-700预培训作为功能提取器,微调慢速变速器。然后将不同观点的特征传递给ActionFormer,以产生候选行动提案。为了准确确定所有行动的本地化,我们设计了后处理方法,包括模型投票、阈过滤和重复清除。结果显示,我们的MVP对于驱动行动本地化非常强大,这在Trout3测试组中达到了28.49% F1分数。