Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.
翻译:递归Transformer中的参数共享虽能减小模型规模,但会导致层间表达能力退化。本文提出混合LoRA(MoL),一种轻量级的条件计算机制,通过在共享前馈网络(FFN)内部插入低秩自适应(LoRA)专家模块实现。与先前添加固定或外部适配器的方法不同,MoL能够在不解耦主干参数的前提下,对共享FFN进行基于令牌条件的权重空间调制。我们预训练了一个现代化的递归架构ModernALBERT,整合了旋转位置编码、GeGLU激活函数、FlashAttention机制以及基于蒸馏的初始化方法。在GLUE、SQuAD-v2和BEIR基准测试中,ModernALBERT(50M-120M参数)在紧凑模型中实现了最先进的性能,并超越了更大规模的全参数基线模型。我们还提出一种专家合并方法,可在推理时将MoL压缩为单一适配器且保持精度,从而实现高效部署。实验结果表明,条件权重空间调制能有效恢复递归Transformer在激进参数共享策略下损失的表达能力。