We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary. In contrast, we seek to approximate representations from an encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" for generating concepts directly within the embedding space of the model. We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these compositional structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice. Finally, we empirically explore these structures in CLIP's embeddings and we evaluate their usefulness for solving different vision-language tasks such as classification, debiasing, and retrieval. Our results show that simple linear algebraic operations on embedding vectors can be used as compositional and interpretable methods for regulating the behavior of VLMs.
翻译:我们在预训练的视觉语言模型(Vision-Language Models,VLMs)数据嵌入中探索组合结构。传统上,组合性被认为与对来自预先存在的词汇表中的词的嵌入进行代数运算有关。相反,我们试图通过嵌入空间中较小集合的向量的组合来逼近编码器的表示。这些向量可以被看作是在模型的嵌入空间中直接生成概念的“理想词汇”。我们首先提出了一个从几何角度理解组合结构的框架。然后我们解释了在VLM嵌入中这些组合结构以概率意味着什么,为什么它们在实践中出现。最后,我们在CLIP的嵌入中实证探索这些结构,并评估了它们在解决不同的视觉语言任务,如分类、去偏差和检索等任务中的有用性。我们的结果表明,对嵌入向量进行简单的线性代数运算可以作为调节VLM行为的组合和可解释的方法。