In this paper, we study whether representations of primitive concepts--such as colors and shapes of object parts--emerge automatically within these pretrained VL models. We propose a two-step framework, Compositional Concept Mapping (CompMap), to investigate this. CompMap asks a VL model to generate concept activations with text prompts from a predefined list of primitive concepts, and then learns to construct an explicit composition model that maps the primitive concept activations (e.g. the likelihood of black tail or red wing) to composite concepts (e.g. a red-winged blackbird). We demonstrate that a composition model can be designed as a set operation, and show that a composition model is straightforward for machines to learn from ground truth primitive concepts (as a linear classifier). We thus hypothesize that if primitive concepts indeed emerge in a VL pretrained model, its primitive concept activations can be used to learn a composition model similar to the one designed by experts. We propose a quantitative metric to measure the degree of similarity, and refer to the metric as the interpretability of the learned primitive concept representations of VL models. We also measure the classification accuracy when using the primitive concept activations and the learned composition model to predict the composite concepts, and refer to it as the usefulness metric. Our study reveals that state-of-the-art VL pretrained models learn primitive concepts that are highly useful for fine-grained visual recognition on the CUB dataset, and compositional generalization tasks on the MIT-States dataset. However, we observe that the learned composition models have low interpretability in our qualitative analyses. Our results reveal the limitations of existing VL models, and the necessity of pretraining objectives that encourage the acquisition of primitive concepts.
翻译:在本文中, 我们研究原始概念( 例如黑尾或红翼的可能性) 是否自动在经过预先训练的VL模型中出现原始概念( 例如, 红翼或红翼) 的表达方式 。 我们证明, 组合模型可以设计成一个固定操作, 并表明, 组合模型可以让机器从地面真相原始概念( 作为线性分类器) 学习。 因此我们假设, 如果原始概念确实出现在预先界定的原始概念清单中, 其原始概念的启动模式( 例如, 黑尾或红翼的可能性 ) 将原始概念的启动模式( 比如, 黑尾或红翼的可能性 ) 与复合概念( 例如, 红翼黑鸟 ) 的表达方式。 我们证明, 组合模型可以设计成一个固定操作, 并显示, 组合模型的精确性, 我们用原始模型的原始模型解释性模型来解释, 我们用原始概念的精确性来测量我们所学的模型。