Contrastive Language-Image Pre-training (CLIP) is a widely used multimodal model that aligns text and image representations through large-scale training. While it performs strongly on zero-shot and few-shot tasks, its robustness to linguistic variation, particularly paraphrasing, remains underexplored. Paraphrase robustness is essential for reliable deployment, especially in socially sensitive contexts where inconsistent representations can amplify demographic biases. In this paper, we introduce the Paraphrase Ranking Stability Metric (PRSM), a novel measure for quantifying CLIP's sensitivity to paraphrased queries. Using the Social Counterfactuals dataset, a benchmark designed to reveal social and demographic biases, we empirically assess CLIP's stability under paraphrastic variation, examine the interaction between paraphrase robustness and gender, and discuss implications for fairness and equitable deployment of multimodal systems. Our analysis reveals that robustness varies across paraphrasing strategies, with subtle yet consistent differences observed between male- and female-associated queries.
翻译:对比语言-图像预训练(CLIP)是一种广泛使用的多模态模型,通过大规模训练实现文本与图像表征的对齐。尽管其在零样本和少样本任务上表现优异,但其对语言变异(尤其是释义)的鲁棒性仍未得到充分探究。释义鲁棒性对于可靠部署至关重要,特别是在社会敏感场景中,不一致的表征可能加剧人口统计偏差。本文提出释义排序稳定性度量(PRSM),这是一种量化CLIP对释义查询敏感性的新指标。基于旨在揭示社会与人口统计偏差的基准数据集Social Counterfactuals,我们实证评估了CLIP在释义变异下的稳定性,探究了释义鲁棒性与性别之间的相互作用,并讨论了多模态系统公平性与公正部署的启示。分析表明,鲁棒性随释义策略不同而变化,且在男性和女性相关查询间观察到微妙而一致的差异。