3D visual perception tasks based on multi-camera images are essential for autonomous driving systems. Latest work in this field performs 3D object detection by leveraging multi-view images as an input and iteratively enhancing object queries (object proposals) by cross-attending multi-view features. However, individual backbone features are not updated with multi-view features and it stays as a mere collection of the output of the single-image backbone network. Therefore we propose 3M3D: A Multi-view, Multi-path, Multi-representation for 3D Object Detection where we update both multi-view features and query features to enhance the representation of the scene in both fine panoramic view and coarse global view. Firstly, we update multi-view features by multi-view axis self-attention. It will incorporate panoramic information in the multi-view features and enhance understanding of the global scene. Secondly, we update multi-view features by self-attention of the ROI (Region of Interest) windows which encodes local finer details in the features. It will help exchange the information not only along the multi-view axis but also along the other spatial dimension. Lastly, we leverage the fact of multi-representation of queries in different domains to further boost the performance. Here we use sparse floating queries along with dense BEV (Bird's Eye View) queries, which are later post-processed to filter duplicate detections. Moreover, we show performance improvements on nuScenes benchmark dataset on top of our baselines.
翻译:以多镜头图像为基础的 3D 视觉感知任务对于自主驱动系统至关重要 。 本领域的最新工作通过利用多视图图像作为输入,并通过交叉访问多视图功能来进行三维对象检测。 然而, 单个主干特征没有以多视图功能更新, 仅作为单一图像主干网络输出的集体。 因此, 我们提议 3M3D : 多视图、 多路径、 多代表 3D 对象探测, 我们在此更新多视图功能和查询功能, 以提高多视图图像在精细全视图视图和粗略全球视图中的形象。 首先, 我们通过多视图轴自我关注来更新多视图查询( 对象建议建议) 。 它将把全景信息纳入多视图功能, 仅作为单图像主干网主干网络输出的集合。 因此, 我们建议3D 3D 对象探测窗口的多视图、 多路径、 多代表代表 显示本地细节, 不仅在多视图轴轴轴轴中进行显示, 还会通过多视角搜索, 显示其他空间访问的深度访问的进度, 。 最后, 我们将更多浏览访问中显示B的进度, 我们的深度访问, 将显示其他空间定位的深度访问的深度访问, 复制到 。