We propose a novel framework ConceptX, to analyze how latent concepts are encoded in representations learned within pre-trained language models. It uses clustering to discover the encoded concepts and explains them by aligning with a large set of human-defined concepts. Our analysis on seven transformer language models reveal interesting insights: i) the latent space within the learned representations overlap with different linguistic concepts to a varying degree, ii) the lower layers in the model are dominated by lexical concepts (e.g., affixation), whereas the core-linguistic concepts (e.g., morphological or syntactic relations) are better represented in the middle and higher layers, iii) some encoded concepts are multi-faceted and cannot be adequately explained using the existing human-defined concepts.
翻译:我们提出一个新的框架“概念X”,分析潜在概念如何在经过训练的语文模型中学习的表现形式中编码。它利用集群来发现编码的概念,并通过与大量人类定义的概念保持一致来解释这些概念。 我们对七种变压器语言模型的分析揭示了有趣的洞察力:(一) 所学的表述中的潜在空间在不同程度上与不同的语言概念重叠;(二) 模型的下层以词汇概念(例如连接)为主,而核心语言概念(例如形态或合成关系)在中层和上层得到更好的体现,(三) 一些编码的概念是多面的,无法利用现有的人定义概念进行充分解释。