Large language models (LLMs) are limited by substantial computational cost. We introduce a "computational economics" framework that treats an LLM as an internal economy of resource-constrained agents (attention heads and neuron blocks) that must allocate scarce computation to maximize task utility. First, we show empirically that when computation is scarce, standard LLMs reallocate attention toward high-value tokens while preserving accuracy. Building on this observation, we propose an incentive-driven training paradigm that augments the task loss with a differentiable computation cost term, encouraging sparse and efficient activations. On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method yields a family of models that trace a Pareto frontier and consistently dominate post-hoc pruning; for a similar accuracy we obtain roughly a forty percent reduction in FLOPS and lower latency, together with more interpretable attention patterns. These results indicate that economic principles offer a principled route to designing efficient, adaptive, and more transparent LLMs under strict resource constraints.
翻译:大规模语言模型(LLMs)受到高昂计算成本的限制。本文提出一种“计算经济学”框架,将LLM视为一个由资源受限智能体(注意力头与神经元块)构成的内部经济体,这些智能体必须在稀缺计算资源下进行分配以最大化任务效用。首先,我们通过实验证明:当计算资源稀缺时,标准LLM会将注意力重新分配给高价值词元,同时保持准确性。基于这一发现,我们提出一种激励驱动的训练范式,在任务损失函数中引入可微计算成本项,从而激励稀疏且高效的激活。在GLUE(MNLI、STS-B、CoLA)和WikiText-103基准测试中,该方法生成的一系列模型描绘出帕累托前沿,且持续优于事后剪枝方法;在保持相近精度时,我们实现了约40%的FLOPS降低与更低的延迟,同时获得了更具可解释性的注意力模式。这些结果表明,经济学原理为在严格资源约束下设计高效、自适应且更透明的LLMs提供了理论化路径。