Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.
翻译:大型语言模型(LLMs)可能遵循有害指令,尽管其能力卓越,却引发了严重的安全担忧。近期研究采用基于探测的方法来分析LLM内部表示中恶意与良性输入的可分离性,并有研究者提出将此类探测方法用于安全检测。本文系统性地重新审视了这一范式。受其较差的分布外性能启发,我们假设探测模型学习的是表层模式而非语义上的危害性。通过受控实验,我们证实了这一假设,并识别出所学习的特定模式:指令模式和触发词。我们的研究遵循系统性方法,从展示简单n-gram方法具有可比性能开始,到使用语义净化数据集的受控实验,再到对模式依赖性的详细分析。这些结果揭示了当前基于探测的方法所营造的虚假安全感,并强调需要重新设计模型和评估协议。为此,我们提供了进一步的讨论,以期推动该方向负责任的研究。本项目已在https://github.com/WangCheng0116/Why-Probe-Fails开源。