In this work, we propose \textit{MVFuseNet}, a novel end-to-end method for joint object detection and motion forecasting from a temporal sequence of LiDAR data. Most existing methods operate in a single view by projecting data in either range view (RV) or bird's eye view (BEV). In contrast, we propose a method that effectively utilizes both RV and BEV for spatio-temporal feature learning as part of a temporal fusion network as well as for multi-scale feature learning in the backbone network. Further, we propose a novel sequential fusion approach that effectively utilizes multiple views in the temporal fusion network. We show the benefits of our multi-view approach for the tasks of detection and motion forecasting on two large-scale self-driving data sets, achieving state-of-the-art results. Furthermore, we show that MVFusenet scales well to large operating ranges while maintaining real-time performance.
翻译:在这项工作中,我们提出“Textit{MVFuseNet} ”,这是利用LIDAR数据的时间序列进行联合物体探测和运动预测的新型端对端方法。大多数现有方法都以单一视角运行,在射程视图(RV)或鸟眼视图(BEV)中投射数据。相反,我们提议了一种有效利用RV和BEV进行时空特征学习的方法,作为时间聚合网络的一部分和主干网中多尺度特征学习的一部分。此外,我们提议了一种新的连续聚合方法,在时间聚变网络中有效地利用多种观点。我们展示了我们的多视角方法对探测和运动预测任务的好处,在两个大型自我驱动数据集上实现最新结果。此外,我们展示了MVFusenet在保持实时性能的同时,在大型操作范围内的宽度上进行。