Recent progress in deterministic prompt learning has become a promising alternative to various downstream vision tasks, enabling models to learn powerful visual representations with the help of pre-trained vision-language models. However, this approach results in limited performance for dense prediction tasks that require handling more complex and diverse objects, since a single and deterministic description cannot sufficiently represent the entire image. In this paper, we present a novel probabilistic prompt learning to fully exploit the vision-language knowledge in dense prediction tasks. First, we introduce learnable class-agnostic attribute prompts to describe universal attributes across the object class. The attributes are combined with class information and visual-context knowledge to define the class-specific textual distribution. Text representations are sampled and used to guide the dense prediction task using the probabilistic pixel-text matching loss, enhancing the stability and generalization capability of the proposed method. Extensive experiments on different dense prediction tasks and ablation studies demonstrate the effectiveness of our proposed method.
翻译:近年来,确定性提示学习已成为各种下游视觉任务的有希望的替代方案,通过借助预先训练的视觉语言模型,使模型能够学习强大的视觉表示。然而,对于需要处理更复杂和多样化对象的密集预测任务,这种方法的性能受到了限制,因为单个确定性描述不能充分表示整个图像。在本文中,我们提出了一种新颖的概率提示学习方法,以充分利用密集预测任务中的视觉语言知识。首先,我们引入了可学习的类不可知属性提示,用于描述对象类的通用属性。将属性与类信息和视觉环境知识相结合,定义类特定的文本分布。使用文本表示进行采样,并使用概率像素文本匹配损失指导密集预测任务,增强了所提出方法的稳定性和泛化能力。不同密集预测任务的广泛实验和消融研究证明了我们所提出方法的有效性。