Transfomer-based approaches advance the recent development of multi-camera 3D detection both in academia and industry. In a vanilla transformer architecture, queries are randomly initialised and optimised for the whole dataset, without considering the differences among input frames. In this work, we propose to leverage the predictions from an image backbone, which is often highly optimised for 2D tasks, as priors to the transformer part of a 3D detection network. The method works by (1). augmenting image feature maps with 2D priors, (2). sampling query locations via ray-casting along 2D box centroids, as well as (3). initialising query features with object-level image features. Experimental results shows that 2D priors not only help the model converge faster, but also largely improve the baseline approach by up to 12% in terms of average precision.
翻译:基于外容器的方法推进了学术界和工业界最近开发的多相机 3D 探测技术。 在香草变压器结构中,查询是随机的,对整个数据集是优化的,没有考虑到输入框架之间的差异。在这项工作中,我们提议利用图像主干(通常对二维任务来说是高度优化的)预测,这是三维探测网络变压器部分的前奏。方法(1)用2D前奏增加图像特征图,(2)通过2D箱式机器人和(3)通过光谱投射取样查询地点,用目标级图像特征初始查询特征。实验结果显示,二维先行不仅帮助模型更快地聚合,而且在很大程度上改进基线方法,平均精确度达到12%。