Numerous works have analyzed biases in vision and pre-trained language models individually - however, less attention has been paid to how these biases interact in multimodal settings. This work extends text-based bias analysis methods to investigate multimodal language models, and analyzes intra- and inter-modality associations and biases learned by these models. Specifically, we demonstrate that VL-BERT (Su et al., 2020) exhibits gender biases, often preferring to reinforce a stereotype over faithfully describing the visual scene. We demonstrate these findings on a controlled case-study and extend them for a larger set of stereotypically gendered entities.
翻译:许多著作分别分析了视觉和经过培训的语文模式中的偏见----然而,对这些偏见在多式联运环境中的相互作用没有多少注意,这项工作扩大了基于文字的偏见分析方法的范围,以调查多式联运模式,并分析了这些模式在内部和相互模式中的协会和偏见,具体地说,我们证明VL-BERT(Su等人,2020年)表现出了性别偏见,往往倾向于强化一种成见,而不是忠实地描述视觉场景,我们在受控的案例研究中展示了这些结果,并将这些结果推广到更多的陈规定型的性别实体。