In this paper, we propose PETRv2, a unified framework for 3D perception from multi-view images. Based on PETR, PETRv2 explores the effectiveness of temporal modeling, which utilizes the temporal information of previous frames to boost 3D object detection. More specifically, we extend the 3D position embedding (3D PE) in PETR for temporal modeling. The 3D PE achieves the temporal alignment on object position of different frames. A feature-guided position encoder is further introduced to improve the data adaptability of 3D PE. To support for high-quality BEV segmentation, PETRv2 provides a simply yet effective solution by adding a set of segmentation queries. Each segmentation query is responsible for segmenting one specific patch of BEV map. PETRv2 achieves state-of-the-art performance on 3D object detection and BEV segmentation. Detailed robustness analysis is also conducted on PETR framework. We hope PETRv2 can serve as a unified framework for 3D perception.
翻译:在本文中,我们提议PETRv2, 一个用于多视图图像3D感知的统一框架。 基于 PETR, PETRv2, 探索时间模型的有效性, 利用先前框架的时间信息促进3D对象探测。 更具体地说, 我们扩展了 PETR 中嵌入的 3D 位置( 3D PE) 用于时间模型。 3D PE 在不同框架对象位置上实现了时间对齐。 进一步引入了一个特性导位置编码器, 以改善 3D PE 的数据适应性。 为了支持高质量的 BEV 分解, PETRv2 提供了简单而有效的解决方案, 增加了一套分解查询 。 每个分解查询负责分离 BEV 地图中一个特定的补丁 。 PETRv2 在 3D 对象探测和 BEV 分解上实现了最新水平的性能。 在 PETR 框架上也进行了详细的强度分析。 我们希望 PETRv2 能够作为 3D 认知的统一框架 。