Modern autonomous vehicles rely heavily on mechanical LiDARs for perception. Current perception methods generally require 360{\deg} point clouds, collected sequentially as the LiDAR scans the azimuth and acquires consecutive wedge-shaped slices. The acquisition latency of a full scan (~ 100ms) may lead to outdated perception which is detrimental to safe operation. Recent streaming perception works proposed directly processing LiDAR slices and compensating for the narrow field of view (FOV) of a slice by reusing features from preceding slices. These works, however, are all based on a single modality and require past information which may be outdated. Meanwhile, images from high-frequency cameras can support streaming models as they provide a larger FoV compared to a LiDAR slice. However, this difference in FoV complicates sensor fusion. To address this research gap, we propose an innovative camera-LiDAR streaming 3D object detection framework that uses camera images instead of past LiDAR slices to provide an up-to-date, dense, and wide context for streaming perception. The proposed method outperforms prior streaming models on the challenging NuScenes benchmark. It also outperforms powerful full-scan detectors while being much faster. Our method is shown to be robust to missing camera images, narrow LiDAR slices, and small camera-LiDAR miscalibration.
翻译:现代自动扫描器(~ 100ms) 获取全扫描(~ 100ms) 的延迟度可能导致感知损害安全操作。 最近的流传感知工作建议直接处理LIDAR切片,并通过重新使用前切片的切片的特征来弥补切片的狭窄视野(FOV) 。 然而,这些工作都基于单一模式,需要过时的过去信息。 同时,高频相机的图像可以支持流动模型,因为它们提供比LDAR切片更大的FOV(~ 100ms) 。 然而,全扫描(~ 100ms) 获得的延迟度可能会使感知变得过时, 不利于安全操作。 为了解决这一研究缺口, 我们提议建立一个创新的相机- LiDAR流体探测框架, 使用过去LDAR切片的摄像图像, 来提供最新、 密度和广度的图像。 高频谱摄像头图像显示的方法比我们之前的系统更强的图像。 快速的系统模型显示的是, 快速的系统比我们之前的系统更强的图像。