Though 3D object detection from point clouds has achieved rapid progress in recent years, the lack of flexible and high-performance proposal refinement remains a great hurdle for existing state-of-the-art two-stage detectors. Previous works on refining 3D proposals have relied on human-designed components such as keypoints sampling, set abstraction and multi-scale feature fusion to produce powerful 3D object representations. Such methods, however, have limited ability to capture rich contextual dependencies among points. In this paper, we leverage the high-quality region proposal network and a Channel-wise Transformer architecture to constitute our two-stage 3D object detection framework (CT3D) with minimal hand-crafted design. The proposed CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation for the point features within each proposal. Specifically, CT3D uses proposal's keypoints for spatial contextual modelling and learns attention propagation in the encoding module, mapping the proposal to point embeddings. Next, a new channel-wise decoding module enriches the query-key interaction via channel-wise re-weighting to effectively merge multi-level contexts, which contributes to more accurate object predictions. Extensive experiments demonstrate that our CT3D method has superior performance and excellent scalability. Remarkably, CT3D achieves the AP of 81.77% in the moderate car category on the KITTI test 3D detection benchmark, outperforms state-of-the-art 3D detectors.
翻译:尽管近年来从点云中探测到的3D目标迅速取得进展,但缺乏灵活和高性能的改进建议仍然是现有两阶段最新检测器的巨大障碍。改进3D建议以前的工作依靠的是人为设计的部件,如关键点取样、设置抽象和多尺度特性聚合,以产生强大的3D物体表示。但这种方法在捕捉各点之间丰富的背景依赖性方面的能力有限。在本文中,我们利用高质量的区域建议网络和一个频道式变换器结构,组成我们两阶段的3D目标探测框架(CT3D),其手动设计很少。拟议的CT3D建议同时进行建议认知嵌入和频道式背景组合,以了解每个建议中的点特征。具体来说,CT3D将建议的关键点用于空间背景建模,并在编码模块中学习关注的传播,将建议定位定位为嵌入点。下一个中、中流频道解码模块,通过频道重标分的3D目标框架框架(CT)构成我们两阶段的3D目标互动。拟议的CT同时进行建议-认知嵌入和导出多层次的测试3,这可以更准确地显示我们精确的试算工具。