通过混合LoRA改进递归Transformer模型 (Improving Recursive Transformers with Mixture of LoRAs)

Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.

翻译：递归Transformer中的参数共享虽能减小模型规模，但会导致层间表达能力退化。本文提出混合LoRA（MoL），一种轻量级的条件计算机制，通过在共享前馈网络（FFN）内部插入低秩自适应（LoRA）专家模块实现。与先前添加固定或外部适配器的方法不同，MoL能够在不解耦主干参数的前提下，对共享FFN进行基于令牌条件的权重空间调制。我们预训练了一个现代化的递归架构ModernALBERT，整合了旋转位置编码、GeGLU激活函数、FlashAttention机制以及基于蒸馏的初始化方法。在GLUE、SQuAD-v2和BEIR基准测试中，ModernALBERT（50M-120M参数）在紧凑模型中实现了最先进的性能，并超越了更大规模的全参数基线模型。我们还提出一种专家合并方法，可在推理时将MoL压缩为单一适配器且保持精度，从而实现高效部署。实验结果表明，条件权重空间调制能有效恢复递归Transformer在激进参数共享策略下损失的表达能力。