Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers being the primary choice due to their good scalability and representation ability. However, large-scale models in remote sensing (RS) have not yet been sufficiently explored. In this paper, we resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models perform. To handle the large sizes and objects of arbitrary orientations in RS images, we propose a new rotated varied-size window attention to replace the original full attention in transformers, which can significantly reduce the computational cost and memory footprint while learning better object representation by extracting rich context from the generated diverse windows. Experiments on detection tasks show the superiority of our model over all state-of-the-art models, achieving 81.24% mAP on the DOTA-V1.0 dataset. The results of our models on downstream classification and segmentation tasks also show competitive performance compared to existing advanced methods. Further experiments show the advantages of our models in terms of computational complexity and data efficiency in transferring.
翻译:大型视觉基础模型在自然图像的视觉任务方面取得了显著进展,视觉变压器由于其可伸缩性和代表性能力而成为首要选择。然而,遥感中的大型模型尚未得到充分探讨。在本文件中,我们采用拥有约1亿个参数的普通视觉变压器,并首次试图提出适合塞族共和国任务的大型视觉变压器,并调查这种大型模型的运作情况。为了处理塞族共和国图像中任意定向的巨大尺寸和对象,我们提议采用新的旋转式不同尺寸窗口关注器,以取代变压器中最初的完全关注器,这可以大大降低计算成本和记忆足迹,同时通过从生成的不同窗口中提取丰富的环境学习更好的对象表示法。关于探测任务的实验显示,我们的模型优于所有最先进的模型,在DATA-V1.0数据集上实现了81.24%的 mAP。我们关于下游分类和分化的模型的结果也显示了与现有先进方法相比的竞争性业绩。进一步实验显示,我们的模型在计算复杂度和数据传输效率方面具有优势。