多模归一：面向群体智能的贝叶斯Transformer模型 (Many Minds from One Model: Bayesian Transformers for Population Intelligence)

Despite their scale and success, modern transformers are almost universally trained as single-minded systems: optimization produces one deterministic set of parameters, representing a single functional hypothesis about the data. Motivated by the idea that intelligence emerge from many minds, we propose Population Bayesian Transformers (B-Trans), which transform a standard Large Language Model into a Bayesian Transformer model to supports sampling diverse yet coherent model instances from a single set of pre-trained weights. B-Trans introduces a Bayesian-motivated posterior proxy by treating the bias-like offsets in normalization layers as stochastic variables with a Gaussian variational approximation, inducing a distribution over model behavior without the cost of training full Bayesian neural networks. Sampling from this proxy yields a set of model instances with diverse behaviors while maintaining general competence. To preserve coherence within each generation, we freeze the sampled noise at the sequence level, enforcing temporal consistency across tokens. B-Trans allows for population-level decision-making, where aggregating predictions across sampled individuals significantly enhances exploration. Experiments across zero-shot generation, Reinforcement Learning with Verifiable Rewards (RLVR), and RL without explicit labels demonstrate that B-Trans effectively leverage the wisdom of crowds, yielding superior semantic diversity while achieving better task performance compared to deterministic baselines.

翻译：尽管规模庞大且成效显著，现代Transformer模型几乎普遍被训练为单一心智系统：优化过程产生一组确定性参数，仅代表对数据的单一函数假设。受"智能源于多元心智"理念的启发，我们提出群体贝叶斯Transformer（B-Trans），将标准大语言模型转化为贝叶斯Transformer模型，使其能够从单一预训练权重集中采样生成多样且连贯的模型实例。B-Trans通过将归一化层中偏置类偏移量视为具有高斯变分近似的随机变量，引入贝叶斯启发的后验代理，从而在不训练完整贝叶斯神经网络的情况下诱导出模型行为的概率分布。从该代理采样可获得行为多样且保持通用能力的模型实例集合。为维持单次生成过程中的连贯性，我们在序列层级冻结采样噪声，强制实现跨词元的时间一致性。B-Trans支持群体级决策机制，通过聚合多个采样个体的预测显著增强探索能力。在零样本生成、可验证奖励强化学习（RLVR）以及无显式标签强化学习等实验场景中，B-Trans有效利用了群体智慧，在获得更优任务性能的同时，相比确定性基线模型产生了更丰富的语义多样性。