Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are found to be fragile to downstream fine-tuning, as we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety, while effectively preserving benign generation capability. Our code and pretrained models are publicly available at https://github.com/AntigoneRandy/ResAlign.
翻译:文本到图像(T2I)扩散模型已实现卓越的图像生成质量,并日益被微调用于个性化应用。然而,这些模型常从有毒预训练数据中继承不安全行为,引发日益增长的安全担忧。尽管近期安全驱动遗忘方法在抑制模型毒性方面取得显著进展,我们发现这些方法对下游微调极为脆弱——即使在全良性数据集上进行微调,现有最先进方法也大多无法保持其有效性。为缓解此问题,本文提出ResAlign框架,这是一种具备增强下游微调鲁棒性的安全驱动遗忘框架。通过将下游微调建模为基于莫罗包络重构的隐式优化问题,ResAlign能实现高效梯度估计以最小化有害行为的恢复。此外,本文提出元学习策略来模拟多样化微调场景分布以提升泛化能力。在广泛数据集、微调方法与配置上的大量实验表明,ResAlign在保持安全性方面持续优于现有遗忘方法,同时有效保留良性生成能力。我们的代码与预训练模型已在https://github.com/AntigoneRandy/ResAlign公开。