Multi-view 3D object detection (MV3D-Det) in Bird-Eye-View (BEV) has drawn extensive attention due to its low cost and high efficiency. Although new algorithms for camera-only 3D object detection have been continuously proposed, most of them may risk drastic performance degradation when the domain of input images differs from that of training. In this paper, we first analyze the causes of the domain gap for the MV3D-Det task. Based on the covariate shift assumption, we find that the gap mainly attributes to the feature distribution of BEV, which is determined by the quality of both depth estimation and 2D image's feature representation. To acquire a robust depth prediction, we propose to decouple the depth estimation from the intrinsic parameters of the camera (i.e. the focal length) through converting the prediction of metric depth to that of scale-invariant depth and perform dynamic perspective augmentation to increase the diversity of the extrinsic parameters (i.e. the camera poses) by utilizing homography. Moreover, we modify the focal length values to create multiple pseudo-domains and construct an adversarial training loss to encourage the feature representation to be more domain-agnostic. Without bells and whistles, our approach, namely DG-BEV, successfully alleviates the performance drop on the unseen target domain without impairing the accuracy of the source domain. Extensive experiments on various public datasets, including Waymo, nuScenes, and Lyft, demonstrate the generalization and effectiveness of our approach. To the best of our knowledge, this is the first systematic study to explore a domain generalization method for MV3D-Det.
翻译:在Bird-Eye-View(BEV)中,多视图 3D 对象探测(MV3D-Det) 因其成本低、效率高而引起广泛关注。虽然不断提出仅摄像的3D 对象探测的新算法,但大多数这种算法在输入图像领域与培训领域不同时可能会出现急剧的性能退化。在本文中,我们首先分析MV3D-Det任务的域差异的原因。根据共变转换假设,我们发现差距主要在于BEV的特性分布,而BEV的特性分布取决于深度估计质量和2D图像特征表现的质量。为了获得强有力的深度预测,我们建议通过将测量深度预测转换为比例变化深度深度,进行动态视角增强,通过使用同化方法增加外差参数的多样性(即摄像器配置)。此外,我们首先修改焦距值,以创建多个伪偏差值和2DV图像的特性分布,即不进行精确的域域域域域域域域内测测,从而鼓励度测测测测测测深度的深度,这是最佳的域图。</s>