Robust detection of moving vehicles is a critical task for any autonomously operating outdoor robot or self-driving vehicle. Most modern approaches for solving this task rely on training image-based detectors using large-scale vehicle detection datasets such as nuScenes or the Waymo Open Dataset. Providing manual annotations is an expensive and laborious exercise that does not scale well in practice. To tackle this problem, we propose a self-supervised approach that leverages audio-visual cues to detect moving vehicles in videos. Our approach employs contrastive learning for localizing vehicles in images from corresponding pairs of images and recorded audio. In extensive experiments carried out with a real-world dataset, we demonstrate that our approach provides accurate detections of moving vehicles and does not require manual annotations. We furthermore show that our model can be used as a teacher to supervise an audio-only detection model. This student model is invariant to illumination changes and thus effectively bridges the domain gap inherent to models leveraging exclusively vision as the predominant modality.
翻译:对移动车辆进行强力探测是任何自主操作室外机器人或自行驾驶车辆的一项关键任务。解决这项任务的大多数现代方法都依赖于使用大型车辆探测数据集,如NuScenes或Waymo Open数据集,对图像探测器进行培训。提供人工说明是一项昂贵和艰苦的工作,实际上规模不高。为解决这一问题,我们建议一种自我监督的方法,利用视听线索探测视频中的移动车辆。我们的方法是用对应图像和录音图像的图像进行对比性学习,将车辆本地化。在用现实世界数据集进行的广泛试验中,我们证明我们的方法提供了对移动车辆的准确探测,不需要人工说明。我们进一步表明,我们的模型可以用作教师监督只听音的探测模型。这种学生模型不易产生照明变化,从而有效地弥合完全以视觉为主的模型所固有的领域差距。