Large language models are powerful but costly. We ask whether meta-learning can make the pretraining of small language models not only better but also more interpretable. We integrate first-order MAML with subset-masked LM pretraining, producing four LLama-style decoder-only models (11M-570M params), and evaluate it on a fundamental NLP task with many settings and real-world applications. Compared with vanilla training, our model (i) reaches the same loss up to 1.6x sooner, (ii) improves F1 on multilingual Universal NER under equal compute, and (iii) makes the training dynamics easy to read: first the network's representations fan out ("diversify") and later they collapse into a smaller, shared subspace ("compress"). This two-stage shift shows up as a rise-and-fall in both effective-rank curves and attention-head entropy. The same curves pinpoint which layers specialise earliest and which later reconverge, giving a compact, interpretable signature of meta-adaptation. Code, checkpoints and WandB logs are released.
翻译:大语言模型性能强大但成本高昂。本文探讨元学习能否使小语言模型的预训练不仅效果更优,同时更具可解释性。我们将一阶MAML算法与子集掩码语言模型预训练相结合,构建了四个LLama风格的仅解码器模型(参数量1100万至5.7亿),并在具有多场景设置及实际应用的基础自然语言处理任务上开展评估。相较于基线训练方法,本模型具备以下特性:(i)以最高1.6倍的速度达到同等损失值;(ii)在相同计算资源下提升多语言通用命名实体识别的F1分数;(iii)使训练动态过程更易于解析:网络表征首先呈现发散态势(“多样化”),随后坍缩至更小的共享子空间(“压缩”)。这种两阶段转变体现为有效秩曲线与注意力头熵值的先升后降现象。通过相同曲线可精准识别各网络层的早期专业化阶段与后期再收敛过程,从而形成紧凑且可解释的元适应特征标识。相关代码、模型检查点及WandB日志均已开源。