When fine-tuning pre-trained Language Models (LMs) to exhibit desired behaviors, maintaining control over risk is critical for ensuring both safety and trustworthiness. Most existing safety alignment methods, such as Safe RLHF and SACPO, typically operate under a risk-neutral paradigm that is insufficient to address the risks arising from deviations from the reference policy and offers limited robustness against rare but potentially catastrophic harmful behaviors. To address this limitation, we propose Risk-aware Stepwise Alignment (RSA), a novel alignment method that explicitly incorporates risk awareness into the policy optimization process by leveraging a class of nested risk measures. Specifically, RSA formulates safety alignment as a token-level risk-aware constrained policy optimization problem and solves it through a stepwise alignment procedure that yields token-level policy updates derived from the nested risk measures. This design offers two key benefits: (1) it mitigates risks induced by excessive model shift away from a reference policy, and (2) it explicitly suppresses low-probability yet high-impact harmful behaviors. Moreover, we provide theoretical analysis on policy optimality under mild assumptions. Experimental results demonstrate that our method achieves high levels of helpfulness while ensuring strong safety and significantly suppresses tail risks, namely low-probability yet high-impact unsafe responses.
翻译:在微调预训练语言模型以展现期望行为时,风险控制对于确保安全性和可信度至关重要。现有的大多数安全对齐方法(如Safe RLHF和SACPO)通常在风险中性的范式下运行,这种范式不足以应对因偏离参考策略而产生的风险,并且对罕见但可能具有灾难性的有害行为缺乏鲁棒性。为解决这一局限,我们提出风险感知逐步对齐方法,这是一种新颖的对齐方法,通过利用一类嵌套风险度量,将风险感知显式地纳入策略优化过程。具体而言,RSA将安全对齐问题表述为令牌级别的风险感知约束策略优化问题,并通过逐步对齐过程求解,该过程产生源自嵌套风险度量的令牌级策略更新。这种设计具有两个关键优势:(1)它减轻了因模型过度偏离参考策略而引发的风险;(2)它显式地抑制了低概率但高影响的有害行为。此外,我们在温和假设下提供了关于策略最优性的理论分析。实验结果表明,我们的方法在确保强安全性的同时实现了高水平的帮助性,并显著抑制了尾部风险,即低概率但高影响的不安全响应。