Recent advancements in foundation models (FMs), such as GPT-4 and LLaMA, have attracted significant attention due to their exceptional performance in zero-shot learning scenarios. Similarly, in the field of visual learning, models like Grounding DINO and the Segment Anything Model (SAM) have exhibited remarkable progress in open-set detection and instance segmentation tasks. It is undeniable that these FMs will profoundly impact a wide range of real-world visual learning tasks, ushering in a new paradigm shift for developing such models. In this study, we concentrate on the remote sensing domain, where the images are notably dissimilar from those in conventional scenarios. We developed a pipeline that leverages multiple FMs to facilitate remote sensing image semantic segmentation tasks guided by text prompt, which we denote as Text2Seg. The pipeline is benchmarked on several widely-used remote sensing datasets, and we present preliminary results to demonstrate its effectiveness. Through this work, we aim to provide insights into maximizing the applicability of visual FMs in specific contexts with minimal model tuning. The code is available at https://github.com/Douglas2Code/Text2Seg.
翻译:近年来,基础模型(FMs),如GPT-4和LLaMA的最新进展在零样本学习场景中表现出了卓越的性能,引起了广泛关注。同样,在视觉学习领域,Grounding DINO和Segment Anything Model(SAM)等模型在开放式检测和实例分割任务中展现出了显着的进展。这些基础模型无疑将深刻影响各种现实世界的视觉学习任务,开启开发此类模型的新范式。在本研究中,我们着重研究遥感领域,其中图像与传统场景中的图像明显不同。我们开发了一种利用多个基础模型的管道来促进文本提示引导的遥感图像语义分割任务的方法,我们将其称为Text2Seg。在几个广泛使用的遥感数据集上进行基准测试,并展示初步结果以证明其有效性。通过这项工作,我们旨在提供关于在特定情境下最大化视觉基础模型的适用性且最小化模型调整的见解。代码在 https://github.com/Douglas2Code/Text2Seg 上可用。