The alignment of Large Language Models (LLMs) with human values is central to their safe deployment, yet current practice produces static, brittle, and costly-to-maintain models that fail to keep pace with evolving norms and policies. This misalignment, which we term the Alignment-Reality Gap, poses a growing challenge for reliable long-term use. Existing remedies are inadequate: large-scale re-annotation is economically prohibitive, and standard unlearning methods act as blunt instruments that erode utility rather than enable precise policy updates. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework for principled unlearning that reconceives re-alignment as a programmatic policy application problem. TRACE programmatically triages existing preference data against a new policy, identifies high-impact conflicts via a alignment impact score, and applies a hybrid optimization that cleanly inverts, discards, or preserves preferences while safeguarding model performance. Empirical results show that TRACE achieves robust re-alignment across diverse model families (Qwen2.5-7B, Gemma-2-9B, Llama-3.1-8B). On both synthetic benchmarks and the PKU-SafeRLHF dataset under complex policy shift, TRACE enforces new principles without degrading general capabilities. Our work establishes a scalable, dynamic, and cost-effective paradigm for maintaining LLM alignment, providing a foundation for sustainable and responsible AI deployment.
翻译:大语言模型(LLMs)与人类价值观的对齐是其安全部署的核心,然而当前实践产生的模型具有静态性、脆弱性且维护成本高昂,无法跟上不断演变的规范与政策。这种我们称之为'对齐-现实鸿沟'的错位问题,对长期可靠使用构成日益严峻的挑战。现有解决方案存在不足:大规模重新标注在经济上难以承受,而标准遗忘方法如同钝器,会侵蚀模型效用而非实现精准政策更新。我们提出TRACE(基于对齐冲突评估的优先级划分与重对齐)框架,该框架将重对齐问题重新定义为可编程政策应用问题,实现原则性遗忘。TRACE通过编程方式依据新政策对现有偏好数据进行优先级划分,通过对齐影响分数识别高冲突样本,并采用混合优化策略——在保护模型性能的同时,对偏好进行精准反转、剔除或保留。实验结果表明,TRACE在不同模型系列(Qwen2.5-7B、Gemma-2-9B、Llama-3.1-8B)中均实现了稳健的重对齐。在合成基准测试与PKU-SafeRLHF数据集上的复杂政策迁移实验中,TRACE能在不损害通用能力的前提下强制执行新原则。本研究为大语言模型对齐的可持续维护建立了可扩展、动态且经济高效的范式,为负责任的人工智能部署奠定了理论基础。