With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly matching each prompt to the same visual feature is problematic, as it pushes the prompts to converge to one point. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. Specifically, we first model images and the categories with visual and textual feature sets. Then, we apply a two-stage optimization strategy to learn the prompts. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data. Extensive experiments are conducted on the few-shot recognition task and the improvement demonstrates the superiority of our method. The code is available at https://github.com/CHENGY12/PLOT.
翻译:随着对大型视觉语言模型(如CLIP)的日益关注,人们花了大量精力致力于建设高效的速率。与仅学习单一速率的传统方法不同,我们提议学习多种全面速率,以描述诸如固有属性或外来背景等不同类别的不同特征。然而,直接将每个速率与同一视觉特征相匹配是有问题的,因为它使速率趋同到一个点。为了解决这个问题,我们建议应用最佳运输方式来匹配视觉和文本模式。具体地说,我们首先使用图像模型以及带有视觉和文字特征集的类别。然后,我们采用两阶段优化战略来学习速率。在内部循环中,我们优化最优化的传输距离,以将视觉特征和Sinkhorn算法的速率相匹配,而在外循环中,我们了解与受监督数据的距离的速率。在微小的识别任务上进行了广泛的实验,改进显示了我们方法的优越性。代码可在https://github.com/CHENG12/PLOT上查阅。