With the increasing attention to pre-trained vision-language models (VLMs), \eg, CLIP, substantial efforts have been devoted to many downstream tasks, especially in test-time adaptation (TTA). However, previous works focus on learning prototypes only in the textual modality while overlooking the ambiguous semantics in class names. These ambiguities lead to textual prototypes that are insufficient to capture visual concepts, resulting in limited performance. To address this issue, we introduce \textbf{ProtoMM}, a training-free framework that constructs multimodal prototypes to adapt VLMs during the test time. By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning. More importantly, the visual particles are dynamically updated as the testing stream flows. This allows our multimodal prototypes to continually learn from the data, enhancing their generalizability in unseen scenarios. In addition, we quantify the importance of the prototypes and test images by formulating their semantic distance as an optimal transport problem. Extensive experiments on 15 zero-shot benchmarks demonstrate the effectiveness of our method, achieving a 1.03\% average accuracy improvement over state-of-the-art methods on ImageNet and its variant datasets.
翻译:随着预训练视觉语言模型(VLMs,例如 CLIP)日益受到关注,大量研究工作已投入到众多下游任务中,特别是在测试时适应(TTA)领域。然而,先前的工作主要侧重于仅在文本模态中学习原型,而忽视了类别名称中存在的语义模糊性。这些模糊性导致文本原型不足以捕捉视觉概念,从而限制了模型性能。为解决这一问题,我们提出了 \\textbf{ProtoMM},一种无需训练的框架,可在测试时构建多模态原型以适配 VLMs。通过将原型视为文本描述和视觉粒子的离散分布,ProtoMM 能够结合多模态特征进行全面的原型学习。更重要的是,视觉粒子会随着测试数据流的推进而动态更新。这使得我们的多模态原型能够持续从数据中学习,从而增强其在未见场景中的泛化能力。此外,我们通过将原型与测试图像之间的语义距离建模为一个最优传输问题,来量化它们的重要性。在 15 个零样本基准测试上的大量实验证明了我们方法的有效性,在 ImageNet 及其变体数据集上,平均准确率比现有最先进方法提升了 1.03%。