Traditional video-based human activity recognition has experienced remarkable progress linked to the rise of deep learning, but this effect was slower as it comes to the downstream task of driver behavior understanding. Understanding the situation inside the vehicle cabin is essential for Advanced Driving Assistant System (ADAS) as it enables identifying distraction, predicting driver's intent and leads to more convenient human-vehicle interaction. At the same time, driver observation systems face substantial obstacles as they need to capture different granularities of driver states, while the complexity of such secondary activities grows with the rising automation and increased driver freedom. Furthermore, a model is rarely deployed under conditions identical to the ones in the training set, as sensor placements and types vary from vehicle to vehicle, constituting a substantial obstacle for real-life deployment of data-driven models. In this work, we present a novel vision-based framework for recognizing secondary driver behaviours based on visual transformers and an additional augmented feature distribution calibration module. This module operates in the latent feature-space enriching and diversifying the training set at feature-level in order to improve generalization to novel data appearances, (e.g., sensor changes) and general feature quality. Our framework consistently leads to better recognition rates, surpassing previous state-of-the-art results of the public Drive&Act benchmark on all granularity levels. Our code is publicly available at https://github.com/KPeng9510/TransDARC.
翻译:传统视频人类活动认识取得了显著进展,这与深层次学习的提升相关,但这一效果却因驾驶员行为理解的下游任务而缓慢。了解车辆舱内的情况对于高级驾驶助理系统(ADAS)至关重要,因为它能够识别分散因素,预测驾驶员的意图,并导致更方便的载人车辆互动。与此同时,司机观测系统面临巨大障碍,因为它们需要捕捉不同的驱动器颗粒,而这种次要活动的复杂性随着自动化的提高和驾驶员自由的提高而增长。此外,很少在与培训组相同的条件下部署模型,因为传感器的放置和类型因车辆而异,对数据驱动模型的实际部署构成重大障碍。在这项工作中,我们提出了一个基于视觉变异器和额外增强功能分布校准模块的基于新的愿景框架,承认二级驱动器的行为。这一模块在潜在地貌空间中运作,使这些二级活动更加多样化,目的是改进新的数据外观,(例如感官变换)和一般地特性质量。我们在公共驱动器上可以持续地实现前期标准水平。