Parameter-Efficient Fine-Tuning of Diffusion Transformers (DiTs) for diverse, multi-conditional tasks often suffers from task interference when using monolithic adapters like LoRA. The Mixture of Low-rank Experts (MoLE) architecture offers a modular solution, but its potential is usually limited by routing policies that operate at a token level. Such local routing can conflict with the global nature of user instructions, leading to artifacts like spatial fragmentation and semantic drift in complex image generation tasks. To address these limitations, we introduce InstructMoLE, a novel framework that employs an Instruction-Guided Mixture of Low-Rank Experts. Instead of per-token routing, InstructMoLE utilizes a global routing signal, Instruction-Guided Routing (IGR), derived from the user's comprehensive instruction. This ensures that a single, coherently chosen expert council is applied uniformly across all input tokens, preserving the global semantics and structural integrity of the generation process. To complement this, we introduce an output-space orthogonality loss, which promotes expert functional diversity and mitigates representational collapse. Extensive experiments demonstrate that InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants across challenging multi-conditional generation benchmarks. Our work presents a robust and generalizable framework for instruction-driven fine-tuning of generative models, enabling superior compositional control and fidelity to user intent.
翻译:针对扩散Transformer(DiT)在多条件任务中的参数高效微调,使用LoRA等单一适配器常面临任务干扰问题。低秩专家混合(MoLE)架构提供了一种模块化解决方案,但其潜力通常受限于基于令牌级别的路由策略。此类局部路由可能与用户指令的全局性质相冲突,导致复杂图像生成任务中出现空间碎片化和语义漂移等伪影。为克服这些限制,我们提出了InstructMoLE——一种采用指令引导低秩专家混合的新型框架。该框架摒弃逐令牌路由机制,转而采用源自用户完整指令的全局路由信号(指令引导路由,IGR),确保所有输入令牌统一应用一个连贯选定的专家委员会,从而保持生成过程的全局语义与结构完整性。为此,我们同步提出输出空间正交性损失函数,以增强专家功能多样性并缓解表征塌缩问题。大量实验表明,在具有挑战性的多条件生成基准测试中,InstructMoLE显著优于现有LoRA适配器及各类MoLE变体。本研究为生成模型的指令驱动微调提供了一个鲁棒且可泛化的框架,实现了更优的组合控制能力与用户意图保真度。