Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks. In this work, we study the universality hypothesis by examining how small neural networks learn to implement group composition. We present a novel algorithm by which neural networks may implement composition for any finite group via mathematical representation theory. We then show that networks consistently learn this algorithm by reverse engineering model logits and weights, and confirm our understanding using ablations. By studying networks of differing architectures trained on various groups, we find mixed evidence for universality: using our algorithm, we can completely characterize the family of circuits and features that networks learn on this task, but for a given network the precise circuits learned -- as well as the order they develop -- are arbitrary.
翻译:普遍性是机械解释的一个关键假设 -- -- 不同的模型在接受类似任务的培训时学习类似的特征和电路。在这项工作中,我们研究普遍性假设,研究小神经网络如何学会实施群体构成。我们提出了一个新的算法,神经网络可以通过数学表达理论为任何有限群体实施构成。然后我们显示网络通过反向工程模型逻辑和重量来持续学习这种算法,并用推理来确认我们的理解。通过研究对不同群体进行训练的不同结构的网络,我们发现普遍性的证据混杂不清:使用我们的算法,我们可以完全描述网络在这项任务上学习的电路和特征,但对于特定网络来说,所学到的准确的电路和它们所开发的秩序是任意的。