Recent advances in text-to-image diffusion models enable high-fidelity generation across diverse prompts. However, these models falter in long-tail settings, such as medical imaging, where rare pathologies comprise a small fraction of data. This results in mode collapse: tail-class outputs lack quality and diversity, undermining the goal of synthetic data augmentation for underrepresented conditions. We pinpoint gradient conflicts between frequent head and rare tail classes as the primary culprit, a factor unaddressed by existing sampling or conditioning methods that mainly steer inference without altering the learned distribution. To resolve this, we propose GRASP: Guided Residual Adapters with Sample-wise Partitioning. GRASP uses external priors to statically partition samples into clusters that minimize intra-group gradient clashes. It then fine-tunes pre-trained models by injecting cluster-specific residual adapters into transformer feedforward layers, bypassing learned gating for stability and efficiency. On the long-tail MIMIC-CXR-LT dataset, GRASP yields superior FID and diversity metrics, especially for rare classes, outperforming baselines like vanilla fine-tuning and Mixture of Experts variants. Downstream classification on NIH-CXR-LT improves considerably for tail labels. Generalization to ImageNet-LT confirms broad applicability. Our method is lightweight, scalable, and readily integrates with diffusion pipelines.
翻译:文本到图像扩散模型的最新进展实现了跨多样提示的高保真生成。然而,这些模型在长尾场景(如医学影像)中表现不佳,其中罕见病理仅占数据的一小部分。这导致模式崩溃:尾部类别的输出缺乏质量和多样性,从而削弱了针对代表性不足病症进行合成数据增强的目标。我们指出频繁头部类别与罕见尾部类别之间的梯度冲突是主要原因,这一因素未被现有采样或条件化方法所解决,这些方法主要引导推理而不改变已学习分布。为解决此问题,我们提出GRASP:基于样本划分的引导残差适配器。GRASP利用外部先验静态地将样本划分为最小化组内梯度冲突的聚类,随后通过向Transformer前馈层注入聚类特定的残差适配器来微调预训练模型,绕过学习门控机制以确保稳定性和效率。在长尾MIMIC-CXR-LT数据集上,GRASP在FID和多样性指标上表现优异,尤其对罕见类别,优于基线方法如普通微调和混合专家变体。在NIH-CXR-LT数据集的下游分类任务中,尾部标签性能显著提升。在ImageNet-LT上的泛化实验证实了其广泛适用性。我们的方法具有轻量化、可扩展的特点,并能无缝集成到扩散流程中。