We investigate compositional structures in vector data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary. In contrast, we seek to approximate label representations from a text encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" which can be used to generate new concepts in an efficient way. We present a theoretical framework for understanding linear compositionality, drawing connections with mathematical representation theory and previous definitions of disentanglement. We provide theoretical and empirical evidence that ideal words provide good compositional approximations of composite concepts and can be more effective than token-based decompositions of the same concepts.
翻译:我们调查了从经过培训的视觉语言模型(VLMs)中嵌入的矢量数据中的构成结构。传统上,组成性与从先前存在的词汇中嵌入单词的代数操作有关。相反,我们试图将文本编码器的标签表示作为嵌入空间中较小的一组矢量的组合。这些矢量可被视为“理想单词”,可有效用于生成新概念。我们提出了一个理论框架,用于理解线性构成性,绘制与数学表达理论和先前的解析定义的连接。我们提供了理论和经验证据,证明理想单词提供了综合概念的很好的构成近似,比以象征性方式生成相同概念的分解效果更大。</s>