We propose PSFormer, an effective point transformer model for 3D salient object detection. PSFormer is an encoder-decoder network that takes full advantage of transformers to model the contextual information in both multi-scale point- and scene-wise manners. In the encoder, we develop a Point Context Transformer (PCT) module to capture region contextual features at the point level; PCT contains two different transformers to excavate the relationship among points. In the decoder, we develop a Scene Context Transformer (SCT) module to learn context representations at the scene level; SCT contains both Upsampling-and-Transformer blocks and Multi-context Aggregation units to integrate the global semantic and multi-level features from the encoder into the global scene context. Experiments show clear improvements of PSFormer over its competitors and validate that PSFormer is more robust to challenging cases such as small objects, multiple objects, and objects with complex structures.
翻译:我们建议 PSFormer, 这是用于 3D 突出对象探测的有效点变压器模型 。 PSFormer 是一个编码器- 解码器网络, 充分利用变异器, 以多尺度的点和场景两种方式模拟背景信息 。 在编码器中, 我们开发了一个点环境变换器模块, 以捕捉点一级的区域背景特征; PCT 包含两个不同的变压器, 以挖掘各点之间的关系 。 在解码器中, 我们开发了一个场景环境变换器模块, 以学习场景层面的背景表现; SCT 包含同时加添和转换的区块和多文本聚合单位, 以便将全球语义和多级别特性从编码器融入到全球场背景中。 实验显示 PSFormer 相对于其竞争者有了明显改善, 并证实 PSFormer 更能应对小物体、 多重物体和结构复杂的物体等具有挑战性的案例。