Recently pre-trained multimodal models, such as CLIP, have received a surge of attention for their exceptional capabilities towards connecting images and natural language. The textual representations in English can be desirably transferred to multilingualism and support promising downstream multimodal tasks for different languages. Nevertheless, previous fairness discourse in vision-and-language learning mainly focuses on monolingual representational biases, and rarely scrutinizes the principles of multilingual fairness in this multimodal setting, where one language is equated to a group of individuals and images provide the universal grounding for bridging different languages. In this paper, we provide a nuanced understanding of individual fairness and group fairness by viewing language as the recipient of fairness notions. We define new fairness notions within multilingual context and analytically articulate that, pre-trained vision-and-language representations are individually fair across languages but not guaranteed to group fairness. Furthermore, we conduct extensive experiments to explore the prevalent group disparity across languages and protected groups including race, gender and age.
翻译:最近受过训练的多式联运模式,如CLIP,因其在图像和自然语言的连接方面的特殊能力而受到高度关注。英语文字表述方式最好可以转移到多语种,支持不同语言的有希望的下游多式联运任务。然而,以往的视觉和语言学习公平论述主要侧重于单一语言代表偏见,很少在这种多式联运环境中审查多语种公平原则,在这种环境中,一种语言等同于一组个人,图像为弥合不同语言提供了普遍的基础。在本文中,我们通过将语言视为公平概念的接受者,对个人公平和群体公平有了细微的了解。我们在多语种背景中界定了新的公平概念,并在分析上阐明,经过培训的视觉和语言表述方式在语言上是个别公平的,但不能保证群体公平。此外,我们进行了广泛的实验,以探讨不同语言和受保护群体之间普遍存在的差别,包括种族、性别和年龄。