We present an end-to-end method for object detection and trajectory prediction utilizing multi-view representations of LiDAR returns and camera images. In this work, we recognize the strengths and weaknesses of different view representations, and we propose an efficient and generic fusing method that aggregates benefits from all views. Our model builds on a state-of-the-art Bird's-Eye View (BEV) network that fuses voxelized features from a sequence of historical LiDAR data as well as rasterized high-definition map to perform detection and prediction tasks. We extend this model with additional LiDAR Range-View (RV) features that use the raw LiDAR information in its native, non-quantized representation. The RV feature map is projected into BEV and fused with the BEV features computed from LiDAR and high-definition map. The fused features are then further processed to output the final detections and trajectories, within a single end-to-end trainable network. In addition, the RV fusion of LiDAR and camera is performed in a straightforward and computationally efficient manner using this framework. The proposed multi-view fusion approach improves the state-of-the-art on proprietary large-scale real-world data collected by a fleet of self-driving vehicles, as well as on the public nuScenes data set with minimal increases on the computational cost.
翻译:在这项工作中,我们认识到不同视图显示的优点和弱点,并提出了一种高效和通用的引信方法,从所有各种观点综合起来。我们的模型基于一个先进的鸟-眼视图(BEV)网络,这个网络将历史的LIDAR数据序列中的氧化性特征和高清晰度的高清晰度地图结合起来,用于执行探测和预测任务。此外,我们推广这一模型,增加LIDAR Rea-V(RV)特征,在原始的、非量化的表达中使用原始的LIDAR信息。RV特征图被投射到BEV,并与根据LIDAR和高清晰度地图计算的BEV特征相结合。随后,对连接特性进行进一步处理,在一个端对端的可培训网络中输出最后的探测和轨迹。此外,我们以直接和计算方式将原始的LIDAR和相机(RV)结合成更多的LIDAR(RV)特征,在原始的、非量化的表达式的显示中,以直接和计算方式,用这个框架改进了机队内部的大规模数据。