Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation. However, our research identifies a vulnerability: LLMs can be manipulated by "adversarial instructions" hidden in input data, such as resumes or code, causing them to deviate from their intended task. Notably, while defenses may exist for mature domains such as code review, they are often absent in other common applications such as resume screening and peer review. This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types. We evaluate two defense mechanisms: prompt-based defenses achieve 10.1% attack reduction with 12.5% false rejection increase, while our proposed FIDS (Foreign Instruction Detection through Separation) using LoRA adaptation achieves 15.4% attack reduction with 10.4% false rejection increase. The combined approach provides 26.3% attack reduction, demonstrating that training-time defenses outperform inference-time mitigations in both security and utility preservation.
翻译:大语言模型(LLMs)在文本理解和生成方面表现出色,使其成为代码审查和内容审核等自动化任务的理想选择。然而,我们的研究发现了一个漏洞:LLMs可能被隐藏在输入数据(如简历或代码)中的“对抗性指令”所操纵,导致其偏离预期任务。值得注意的是,尽管在代码审查等成熟领域可能存在防御措施,但在简历筛选和同行评审等其他常见应用中往往缺乏此类防护。本文引入了一个基准来评估简历筛选中的此类漏洞,结果显示特定攻击类型的成功率超过80%。我们评估了两种防御机制:基于提示的防御实现了10.1%的攻击降低,但误拒率增加了12.5%;而我们提出的基于LoRA适配的FIDS(通过分离检测外部指令)方法实现了15.4%的攻击降低,误拒率增加10.4%。组合防御方案可降低26.3%的攻击成功率,证明训练时防御在安全性和效用保持方面均优于推理时缓解措施。