Large language models (LLMs) are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive rejection, which either fails against adaptive attackers or overly restricts benign users. We propose a honeypot-based proactive guardrail system that transforms risk avoidance into risk utilization. Our framework fine-tunes a bait model to generate ambiguous, non-actionable but semantically relevant responses, which serve as lures to probe user intent. Combined with the protected LLM's safe reply, the system inserts proactive bait questions that gradually expose malicious intent through multi-turn interactions. We further introduce the Honeypot Utility Score (HUS), measuring both the attractiveness and feasibility of bait responses, and use a Defense Efficacy Rate (DER) for balancing safety and usability. Initial experiment on MHJ Datasets with recent attack method across GPT-4o show that our system significantly disrupts jailbreak success while preserving benign user experience.
翻译:大型语言模型(LLMs)日益面临多轮越狱攻击的威胁,攻击者通过迭代交互诱使模型产生有害行为,从而绕过单轮安全过滤器。现有防御机制主要依赖被动拒绝策略,这要么难以应对自适应攻击者,要么过度限制了良性用户的使用。我们提出了一种基于蜜罐的主动式护栏系统,将风险规避转化为风险利用。该框架通过微调一个诱饵模型,使其生成语义相关但模棱两可、不可操作的响应,以此作为探测用户意图的诱饵。结合受保护大型语言模型的安全回复,系统会插入主动式诱饵问题,通过多轮交互逐步暴露恶意意图。我们进一步提出了蜜罐效用分数(HUS),用于衡量诱饵响应的吸引力和可行性,并采用防御效能率(DER)来平衡安全性与可用性。在MHJ数据集上使用最新攻击方法对GPT-4o进行的初步实验表明,我们的系统能显著降低越狱成功率,同时保持良性用户体验。