Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often compromises their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning. The method introduces two simple strategies that modify training data by previewing partial answer prefixes, thereby minimizing perturbations to the model's initial token distributions and maintaining its built-in safety mechanisms. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs.
翻译:微调使大型语言模型能够适应特定领域,但往往会损害其先前建立的安全对齐。为缓解微调过程中的模型安全性退化,我们提出前瞻性调优——一种轻量级且有效的数据驱动方法,可在微调过程中保持安全性。该方法引入两种简单策略,通过预览部分答案前缀来修改训练数据,从而最小化对模型初始令牌分布的扰动,并维持其内置安全机制。综合实验表明,前瞻性调优能有效保持模型安全性,同时不牺牲下游任务的稳健性能。我们的研究将前瞻性调优定位为一种可靠高效的解决方案,用于实现大型语言模型的安全有效适配。