In advanced paradigms of autonomous driving, learning Bird's Eye View (BEV) representation from surrounding views is crucial for multi-task framework. However, existing methods based on depth estimation or camera-driven attention are not stable to obtain transformation under noisy camera parameters, mainly with two challenges, accurate depth prediction and calibration. In this work, we present a completely Multi-Camera Calibration Free Transformer (CFT) for robust BEV representation, which focuses on exploring implicit mapping, not relied on camera intrinsics and extrinsics. To guide better feature learning from image views to BEV, CFT mines potential 3D information in BEV via our designed position-aware enhancement (PA). Instead of camera-driven point-wise or global transformation, for interaction within more effective region and lower computation cost, we propose a view-aware attention which also reduces redundant computation and promotes converge. CFT achieves 49.7% NDS on the nuScenes detection task leaderboard, which is the first work removing camera parameters, comparable to other geometry-guided methods. Without temporal input and other modal information, CFT achieves second highest performance with a smaller image input 1600 * 640. Thanks to view-attention variant, CFT reduces memory and transformer FLOPs for vanilla attention by about 12% and 60%, respectively, with improved NDS by 1.0%. Moreover, its natural robustness to noisy camera parameters makes CFT more competitive.
翻译:在自主驾驶的先进范式中,学习鸟眼视图(BEV)代表来自周围观点的先进模式对于多任务框架至关重要。然而,基于深度估计或摄像器驱动的注意的现有方法并不稳定,无法在噪音摄像参数下实现转型,主要有两个挑战,即准确深度预测和校准。在这项工作中,我们提出了一个完全多镜头校准自由变换器(FFFT),用于强大的BEV代表,重点是探索隐性绘图,而不是依赖相机的内在和外缘。为了引导从图像视图学习到BEV的更好特征,通过我们设计的位置认知增强(PA)来引导BEV的3D潜在信息。为了在更有效的区域内部互动和较低的计算成本,我们建议采用以摄像为驱动的点或全球变异,我们提出一种全景感关注,这也会减少冗余的计算,促进趋同。