With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly matching each prompt to the same visual feature is problematic, as it pushes the prompts to converge to one point. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. Specifically, we first model images and the categories with visual and textual feature sets. Then, we apply a two-stage optimization strategy to learn the prompts. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data. Extensive experiments are conducted on the few-shot recognition task and the improvement demonstrates the superiority of our method.
翻译:随着对大型视觉语言模型(如CLIP)的日益关注,人们花了大量精力致力于建设高效的速率。与仅学习单一速率的常规方法不同,我们提议学习多种全面速率,以描述诸如固有属性或外来背景等不同类别的不同特征。然而,将每个速率直接匹配到相同的视觉特征是有问题的,因为它将提示推向一个点。为了解决这个问题,我们建议应用最佳的传输方法来匹配视觉和文本模式。具体地说,我们首先使用图像模型和具有视觉和文字特征集的类别。然后,我们采用两阶段优化战略来学习提示。在内部环绕中,我们优化最优化的传输距离,以将视觉特征和Sinkhorn算法的速率相匹配,而在外环中,我们了解与受监督数据距离的速率。在微小的识别任务上进行了广泛的实验,改进显示了我们方法的优越性。