We investigate the phenomenon of neuron universality in independently trained GPT-2 Small models, examining these universal neurons-neurons with consistently correlated activations across models-emerge and evolve throughout training. By analyzing five GPT-2 models at five checkpoints, we identify universal neurons through pairwise correlation analysis of activations over a dataset of 5 million tokens. Ablation experiments reveal significant functional impacts of universal neurons on model predictions, measured via cross entropy loss. Additionally, we quantify neuron persistence, demonstrating high stability of universal neurons across training checkpoints, particularly in early and deeper layers. These findings suggest stable and universal representational structures emerge during language model training.
翻译:本研究探究了在独立训练的GPT-2 Small模型中神经元通用性的现象,考察了这些通用神经元——即在不同模型间激活模式持续相关的神经元——在训练过程中的涌现与演化。通过分析五个GPT-2模型在五个训练检查点的数据,我们基于对500万词元数据集的激活值进行成对相关性分析,识别出通用神经元。消融实验揭示了通用神经元对模型预测的显著功能影响,该影响通过交叉熵损失进行量化。此外,我们量化了神经元的持久性,证明通用神经元在训练检查点间具有高度稳定性,尤其在早期层和深层网络中表现突出。这些发现表明,在语言模型训练过程中会涌现出稳定且通用的表征结构。