The mainstream CNN-based remote sensing (RS) image semantic segmentation approaches typically rely on massive labeled training data. Such a paradigm struggles with the problem of RS multi-view scene segmentation with limited labeled views due to the lack of considering 3D information within the scene. In this paper, we propose ''Implicit Ray-Transformer (IRT)'' based on Implicit Neural Representation (INR), for RS scene semantic segmentation with sparse labels (such as 4-6 labels per 100 images). We explore a new way of introducing multi-view 3D structure priors to the task for accurate and view-consistent semantic segmentation. The proposed method includes a two-stage learning process. In the first stage, we optimize a neural field to encode the color and 3D structure of the remote sensing scene based on multi-view images. In the second stage, we design a Ray Transformer to leverage the relations between the neural field 3D features and 2D texture features for learning better semantic representations. Different from previous methods that only consider 3D prior or 2D features, we incorporate additional 2D texture information and 3D prior by broadcasting CNN features to different point features along the sampled ray. To verify the effectiveness of the proposed method, we construct a challenging dataset containing six synthetic sub-datasets collected from the Carla platform and three real sub-datasets from Google Maps. Experiments show that the proposed method outperforms the CNN-based methods and the state-of-the-art INR-based segmentation methods in quantitative and qualitative metrics.
翻译:主流的CNN 图像语义分解法(RS) 以主流的CNN 图像 图像语义分解法(RS) 通常依赖于大量标签化的培训数据。 这种范式在塞族共和国多视区区分解问题中挣扎,由于在现场缺乏对 3D 信息的考虑,因此带有有限的标签。 在本文中,我们基于隐性神经显示(INR) 提出“Implic Ray- Transformation (IRT) ”, 用于RS 现场语义分解(例如每100个图像4-6个标签) 。 我们探索了一种新的方法,在任务之前引入多视3D结构,然后引入准确和视觉一致的语义区分解分解问题。 拟议的方法包括两阶段学习过程。 在第一阶段,我们优化一个神经系统分解(IPR) 的线性字段以隐含多视图像的颜色和 3D 数字分解(我们从先前的或2D) 数字结构中增加了一个具有挑战性的数据格式 。</s>