We present a systematic empirical study of small language models under strict compute constraints, analyzing how architectural choices and training budget interact to determine performance. Starting from a linear next-token predictor, we progressively introduce nonlinearities, self-attention, and multi-layer transformer architectures, evaluating each on character-level modeling of Tiny Shakespeare and word-level modeling of Penn Treebank (PTB) and WikiText-2. We compare models using test negative log-likelihood (NLL), parameter count, and approximate training FLOPs to characterize accuracy-efficiency trade-offs. Our results show that attention-based models dominate MLPs in per-FLOP efficiency even at small scale, while increasing depth or context without sufficient optimization can degrade performance. We further examine rotary positional embeddings (RoPE), finding that architectural techniques successful in large language models do not necessarily transfer to small-model regimes.
翻译:本文对严格计算约束下的小型语言模型进行了系统性实证研究,分析了架构选择与训练资源如何共同决定模型性能。我们从线性下一词元预测器出发,逐步引入非线性结构、自注意力机制以及多层Transformer架构,并在字符级Tiny Shakespeare数据集以及词级Penn Treebank(PTB)和WikiText-2数据集上对每个模型进行评估。我们采用测试负对数似然(NLL)、参数量与近似训练浮点运算量(FLOPs)作为指标,以刻画精度与效率之间的权衡关系。实验结果表明:即使在较小规模下,基于注意力的模型在单位FLOPs效率上仍显著优于多层感知机(MLP);而未经充分优化便增加深度或上下文长度反而可能导致性能下降。我们进一步考察了旋转位置编码(RoPE),发现大规模语言模型中成功的架构技术未必适用于小模型场景。