3D scene understanding plays a vital role in vision-based autonomous driving. While most existing methods focus on 3D object detection, they have difficulty describing real-world objects of arbitrary shapes and infinite classes. Towards a more comprehensive perception of a 3D scene, in this paper, we propose a SurroundOcc method to predict the 3D occupancy with multi-camera images. We first extract multi-scale features for each image and adopt spatial 2D-3D attention to lift them to the 3D volume space. Then we apply 3D convolutions to progressively upsample the volume features and impose supervision on multiple levels. To obtain dense occupancy prediction, we design a pipeline to generate dense occupancy ground truth without expansive occupancy annotations. Specifically, we fuse multi-frame LiDAR scans of dynamic objects and static scenes separately. Then we adopt Poisson Reconstruction to fill the holes and voxelize the mesh to get dense occupancy labels. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our method. Code and dataset are available at https://github.com/weiyithu/SurroundOcc
翻译:3D 场景理解在基于视觉的自主驱动中发挥着关键作用。 虽然大多数现有方法侧重于 3D 对象探测, 但大多数现有方法都难以描述任意形状和无限等级的真实世界物体。 为了更全面地看待三维场景, 我们在本文件中提出一个环形操作方法, 以多相机图像预测三维场景的占用情况。 我们首先为每个图像提取多尺寸的特性, 并采用空间 2D-3D 关注点, 将其提升到 3D 体积空间 。 然后我们应用 3D 演算来逐步标注体积特性, 并对多个级别进行监管 。 为了获得密集的占用预测, 我们设计一条管道来生成密集占用地面的真相, 而无需扩展占用说明 。 具体地说, 我们将动态物体和静态场景的多框架 LiDAR 扫描连接起来 。 然后我们采用 Poisson 重建来填补孔, 并对网格进行催化, 以获得密度占用标签 。 在 nucenes 和 Smantic KITTI 数据集展示我们的方法的优越性。 。 。 在 http://githewurururb.com 中可以 。</s>