We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras. Event cameras provide visual information with sub-millisecond latency at a high-dynamic range and with strong robustness against motion blur. These unique properties offer great potential for low-latency object detection and tracking in time-critical scenarios. Prior work in event-based vision has achieved outstanding detection performance but at the cost of substantial inference time, typically beyond 40 milliseconds. By revisiting the high-level design of recurrent vision backbones, we reduce inference time by a factor of 6 while retaining similar performance. To achieve this, we explore a multi-stage design that utilizes three key concepts in each stage: First, a convolutional prior that can be regarded as a conditional positional embedding. Second, local and dilated global self-attention for spatial feature interaction. Third, recurrent temporal feature aggregation to minimize latency while retaining temporal information. RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection - achieving an mAP of 47.2% on the Gen1 automotive dataset. At the same time, RVTs offer fast inference (<12 ms on a T4 GPU) and favorable parameter efficiency (5 times fewer than prior art). Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision.
翻译:我们提出了循环视觉Transformer(RVTs),这是一种用于事件摄像头目标检测的新型骨干网络。事件摄像头具有亚毫秒延迟、高动态范围和强大的运动模糊鲁棒性等独特属性,这些属性在时间关键的场景下提供了实现低延迟目标检测和跟踪的巨大潜力。先前的事件视觉工作已经实现了出色的检测性能,但代价是显著的推理时间,通常超过40毫秒。通过重新审视重复视觉Backbone的高级设计,我们将推理时间减少了6倍,同时保持类似的性能。为了实现这一点,我们探索了一个多阶段的设计,其中每个阶段利用了三个关键概念:首先是可以视为条件位置嵌入的卷积先验。其次是用于空间特征交互的局部和扩张全球自我注意力。第三,是重复时间特征聚合,以最小化延迟,同时保留时间信息。RVTs可以进行从头训练,以实现事件感知目标检测的最新性能 - 在Gen1汽车数据集上实现了47.2%的mAP。同时,RVTs提供了快速的推理(在T4 GPU上小于12毫秒)和有利的参数效率(比先前的研究少5倍)。我们的研究为有效的设计选择带来了新的见解,这些设计选择对事件感知视觉之外的研究也可能是有益的。