Driver distractions are known to be the dominant cause of road accidents. While monitoring systems can detect non-driving-related activities and facilitate reducing the risks, they must be accurate and efficient to be applicable. Unfortunately, state-of-the-art methods prioritize accuracy while ignoring latency because they leverage cross-view and multimodal videos in which consecutive frames are highly similar. Thus, in this paper, we pursue time-effective detection models by neglecting the temporal relation between video frames and investigate the importance of each sensing modality in detecting drives' activities. Experiments demonstrate that 1) our proposed algorithms are real-time and can achieve similar performances (97.5\% AUC-PR) with significantly reduced computation compared with video-based models; 2) the top view with the infrared channel is more informative than any other single modality. Furthermore, we enhance the DAD dataset by manually annotating its test set to enable multiclassification. We also thoroughly analyze the influence of visual sensor types and their placements on the prediction of each class. The code and the new labels will be released.
翻译:众所周知,驱动器分散是道路事故的主要原因。虽然监测系统可以检测非驾驶相关活动,并有助于降低风险,但必须准确和高效适用。不幸的是,最先进的方法优先考虑准确性,而忽略隐蔽性,因为它们利用交叉视图和多式视频,这些视频连续相框非常相似。因此,在本文中,我们通过忽略视频框架之间的时间关系和调查每个感知模式在探测驱动器活动方面的重要性,追求具有时间效力的检测模型。实验表明:(1) 我们提议的算法是实时的,可以实现类似的性能(97.5 ⁇ ACU-PR),与基于视频的模型相比,计算率会大大降低;(2) 红外频道的顶部视图比任何其他单一模式都信息丰富。此外,我们通过手动说明其测试集来加强DAD数据集,以便能够多级化。我们还透彻分析视觉传感器类型及其在预测每类中的位置的影响。代码和新标签将被发布。