LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.
翻译:大型语言模型(LLMs)已成为信息检索的重要组成部分。然而,它们作为问答聊天机器人的角色引发了重大担忧,因为其已表现出对对抗性中间人(MitM)攻击的脆弱性。本文首次基于我们提出的新型理论驱动中间人攻击框架Xmera,对LLMs在提示注入下的真实记忆能力进行了系统性攻击评估。通过在三种封闭式、基于事实的问答场景中扰动输入至“受害”LLMs,我们破坏了回答的正确性,并评估了其生成过程的不确定性。令人惊讶的是,简单的基于指令的攻击显示出最高的成功率(约85.3%),同时对于错误回答的问题具有较高的不确定性。为防御Xmera攻击,我们基于回答不确定性水平训练随机森林分类器,以区分受攻击与未受攻击的查询(平均AUC高达约96%)。我们认为,提醒用户对从黑盒及可能被篡改的LLMs中获取的答案保持警惕,是迈向用户网络空间安全的第一道防线。