In this technical report, we present our solution, dubbed MV-FCOS3D++, for the Camera-Only 3D Detection track in Waymo Open Dataset Challenge 2022. For multi-view camera-only 3D detection, methods based on bird-eye-view or 3D geometric representations can leverage the stereo cues from overlapped regions between adjacent views and directly perform 3D detection without hand-crafted post-processing. However, it lacks direct semantic supervision for 2D backbones, which can be complemented by pretraining simple monocular-based detectors. Our solution is a multi-view framework for 4D detection following this paradigm. It is built upon a simple monocular detector FCOS3D++, pretrained only with object annotations of Waymo, and converts multi-view features to a 3D grid space to detect 3D objects thereon. A dual-path neck for single-frame understanding and temporal stereo matching is devised to incorporate multi-frame information. Our method finally achieves 49.75% mAPL with a single model and wins 2nd place in the WOD challenge, without any LiDAR-based depth supervision during training. The code will be released at https://github.com/Tai-Wang/Depth-from-Motion.
翻译:在本技术报告中,我们为2022年Waymo Open Dataset Challenge 2022年Waymo Open Dataset 中的相机3D探测轨道提出了我们称为MV-FOS3D++的解决方案。对于仅3D的多视相机探测,基于鸟眼视图或3D几何表示法可以利用相邻视图之间重叠区域的立体信号,直接进行3D探测,而无需手工制作后处理。然而,它缺乏2D主干骨的直接语义监督,而这种监督可以用简单的单望远镜探测器来补充。我们的方法最终达到49.75% mAPL,在WOD挑战中赢得2个位置,没有来自LiDAR-W3D++,仅事先用Waymo的物体说明进行训练,并将多视功能转换为3D网格空间,用于检测3D物体。为单一框架理解和时间立体匹配设计了双向颈,以纳入多框架信息。我们的方法最终实现了49.75% mAPL,使用单一模型,并在WD挑战中赢得了2个位置,没有来自Li-MADAR-W-MEB/DMEDMEDD的深度监督。