多视角三维物体探测器的历史物体预测增强训练 (Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction)

In this paper, we propose a new paradigm, named Historical Object Prediction (HoP) for multi-view 3D detection to leverage temporal information more effectively. The HoP approach is straightforward: given the current timestamp t, we generate a pseudo Bird's-Eye View (BEV) feature of timestamp t-k from its adjacent frames and utilize this feature to predict the object set at timestamp t-k. Our approach is motivated by the observation that enforcing the detector to capture both the spatial location and temporal motion of objects occurring at historical timestamps can lead to more accurate BEV feature learning. First, we elaborately design short-term and long-term temporal decoders, which can generate the pseudo BEV feature for timestamp t-k without the involvement of its corresponding camera images. Second, an additional object decoder is flexibly attached to predict the object targets using the generated pseudo BEV feature. Note that we only perform HoP during training, thus the proposed method does not introduce extra overheads during inference. As a plug-and-play approach, HoP can be easily incorporated into state-of-the-art BEV detection frameworks, including BEVFormer and BEVDet series. Furthermore, the auxiliary HoP approach is complementary to prevalent temporal modeling methods, leading to significant performance gains. Extensive experiments are conducted to evaluate the effectiveness of the proposed HoP on the nuScenes dataset. We choose the representative methods, including BEVFormer and BEVDet4D-Depth to evaluate our method. Surprisingly, HoP achieves 68.5% NDS and 62.4% mAP with ViT-L on nuScenes test, outperforming all the 3D object detectors on the leaderboard. Codes will be available at https://github.com/Sense-X/HoP.

翻译：在本文中，我们提出了一种名为历史物体预测（HoP）的新范例，用于更有效地利用时间信息的多视角三维（3D）探测。HoP方法很直观：给定当前时间戳t，我们从相邻帧中生成时间戳t-k的伪鸟瞰图（BEV）特征，并利用该特征预测时间戳t-k上的物体集。我们的方法是基于这样的观察成立的：强制探测器同时捕捉历史时间戳上发生的物体的空间位置和时间运动能够导致更准确的BEV特征学习。首先，我们精心设计了短期和长期时间解码器，它们可以在不涉及相应摄像机图像的情况下为时间戳t-k生成伪BEV特征。其次，额外的物体解码器被灵活地附加在这个伪BEV特征上，以预测物体目标。请注意，我们只在训练过程中执行HoP，因此所提出的方法在推理过程中不会引入额外的开销。作为一种即插即用的方法，HoP可以轻松地集成到最先进的BEV检测框架中，包括BEVFormer和BEVDet系列。此外，辅助HoP方法与普遍的时间建模方法相互补充，导致了明显的性能提升。我们进行了广泛的实验，以评估所提出的HoP在nuScenes数据集上的有效性。我们选择了代表性方法，包括BEVFormer和BEVDet4D-Depth来评估我们的方法。令人惊讶的是，HoP在nuScenes测试中使用ViT-L实现了68.5％的NDS和62.4％的mAP，优于排行榜上的所有3D物体探测器。代码将在https://github.com/Sense-X/HoP上提供。