基于对比表征学习提升大型语言模型安全性 (Improving Large Language Model Safety with Contrastive Representation Learning)

Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at https://github.com/samuelsimko/crl-llm-defense

翻译：大型语言模型（LLMs）作为具有深远社会影响的强大工具，其应对多样且不受控输入生成响应的能力使其易受对抗攻击。现有防御方法通常难以在不同攻击类型间实现泛化，而表征工程领域的最新进展提供了有前景的替代方案。本研究提出一个将模型防御构建为对比表征学习（CRL）问题的防御框架。该方法通过结合对抗性困难负例挖掘的三元组损失对模型进行微调，以促进良性表征与有害表征的分离。在多个模型上的实验结果表明，本方法优于现有基于表征工程的防御方案，能在不影响标准性能的前提下，提升模型对输入层面和嵌入空间攻击的鲁棒性。代码发布于 https://github.com/samuelsimko/crl-llm-defense