In recent years, the success of large-scale vision-language models (VLMs) such as CLIP has led to their increased usage in various computer vision tasks. These models enable zero-shot inference through carefully crafted instructional text prompts without task-specific supervision. However, the potential of VLMs for generalization tasks in remote sensing (RS) has not been fully realized. To address this research gap, we propose a novel image-conditioned prompt learning strategy called the Visual Attention Parameterized Prompts Learning Network (APPLeNet). APPLeNet emphasizes the importance of multi-scale feature learning in RS scene classification and disentangles visual style and content primitives for domain generalization tasks. To achieve this, APPLeNet combines visual content features obtained from different layers of the vision encoder and style properties obtained from feature statistics of domain-specific batches. An attention-driven injection module is further introduced to generate visual tokens from this information. We also introduce an anti-correlation regularizer to ensure discrimination among the token embeddings, as this visual information is combined with the textual tokens. To validate APPLeNet, we curated four available RS benchmarks and introduced experimental protocols and datasets for three domain generalization tasks. Our results consistently outperform the relevant literature and code is available at https://github.com/mainaksingha01/APPLeNet
翻译:机译摘要:
近年来,基于视觉和语义的大规模模型(VLMs)(例如CLIP)的成功已经在各种计算机视觉任务中得到广泛应用。这些模型使得在没有任务特定监督下通过精心制作的提示文本进行零-shot推理成为可能。然而,在遥感领域中,VLMs在泛化任务方面的潜力尚未得到充分实现。为了解决这一研究差距,我们提出了一种称为视觉注意力参数化提示学习网络(APPLeNet)的新型受图像条件限制的提示学习策略。APPLeNet强调了遥感场景分类中多尺度特征学习的重要性,并对领域泛化的视觉样式和内容进行分离。为了实现这一目标,APPLeNet结合了从不同的Vision编码器层中获得的视觉内容特征以及从领域特定批次的特征统计中获得的样式属性。随后引入了一个基于注意力的注入模块来生成来自这些信息的视觉标记。我们还引入了一种反相关正则化器,以确保令牌嵌入之间的区别,因为该视觉信息与文本令牌相结合。为了验证APPLeNet,我们整理了四个可用的RS基准,并介绍了三个领域泛化任务的实验协议和数据集。我们的实验结果始终优于相关文献,代码可在https://github.com/mainaksingha01/APPLeNet获得。