Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model is available here: http://ouro-llm.github.io.
翻译:现代大语言模型主要通过显式文本生成(如思维链)进行“思考”,这种方式将推理推迟到训练后阶段,未能充分利用预训练数据。我们提出并开源了Ouro(以递归的衔尾蛇命名),这是一个预训练的循环语言模型家族,通过以下方式将推理构建到预训练阶段:(i)在潜在空间中进行迭代计算,(ii)用于学习深度分配的熵正则化目标,以及(iii)扩展至7.7万亿词元。Ouro 1.4B和2.6B模型展现出卓越性能,在广泛的基准测试中匹配了高达120亿参数的最先进大语言模型的结果。通过控制实验,我们证明这一优势并非源于知识容量的增加,而是来自更优越的知识操纵能力。我们还表明,循环语言模型生成的推理轨迹比显式思维链更与最终输出对齐。我们希望我们的结果展示循环语言模型作为推理时代一种新颖扩展方向的潜力。我们的模型可在此处获取:http://ouro-llm.github.io。