Learning accurate depth is essential to multi-view 3D object detection. Recent approaches mainly learn depth from monocular images, which confront inherent difficulties due to the ill-posed nature of monocular depth learning. Instead of using a sole monocular depth method, in this work, we propose a novel Surround-view Temporal Stereo (STS) technique that leverages the geometry correspondence between frames across time to facilitate accurate depth learning. Specifically, we regard the field of views from all cameras around the ego vehicle as a unified view, namely surroundview, and conduct temporal stereo matching on it. The resulting geometrical correspondence between different frames from STS is utilized and combined with the monocular depth to yield final depth prediction. Comprehensive experiments on nuScenes show that STS greatly boosts 3D detection ability, notably for medium and long distance objects. On BEVDepth with ResNet-50 backbone, STS improves mAP and NDS by 2.6% and 1.4%, respectively. Consistent improvements are observed when using a larger backbone and a larger image resolution, demonstrating its effectiveness
翻译:多视图 3D 对象探测需要准确的学习深度。 最近的方法主要是从单镜图像中学习深度,这些图像由于单眼深度学习的性质不当而面临固有的困难。 我们建议采用一种新型的超视视时时光阵列(STS)技术,利用跨时间跨框架之间的几何对应来便利精确深度学习。 具体地说,我们认为自相驾驶飞行器周围所有摄像头的视野是一个统一视图,即环形,并进行时间立体比对。 使用STS不同框架之间的几何对应,并与单眼深度相结合,得出最后深度预测。 核巡视综合实验显示,STS大大提升了3D的探测能力,特别是中远程物体。 在有ResNet-50骨架的BEVDepth,STS将MAP和NDS分别提高2.6%和1.4%。 当使用更大的骨架和更大图像分辨率时,观察到了一致的改进,显示了其效果。