Current literature suggests that alignment faking (deceptive alignment) is an emergent property of large language models. We present the first empirical evidence that a small instruction-tuned model, specifically LLaMA 3 8B, can exhibit alignment faking. We further show that prompt-only interventions, including deontological moral framing and scratchpad reasoning, significantly reduce this behavior without modifying model internals. This challenges the assumption that prompt-based ethics are trivial and that deceptive alignment requires scale. We introduce a taxonomy distinguishing shallow deception, shaped by context and suppressible through prompting, from deep deception, which reflects persistent, goal-driven misalignment. Our findings refine the understanding of deception in language models and underscore the need for alignment evaluations across model sizes and deployment settings.
翻译:现有文献表明,虚假对齐(欺骗性对齐)是大语言模型的一种涌现特性。我们首次提供了实证证据,证明一个小型指令微调模型(具体为LLaMA 3 8B)能够表现出虚假对齐行为。我们进一步证明,仅通过提示干预(包括道义伦理框架提示和思维链推理)即可显著减少此类行为,而无需修改模型内部结构。这一发现挑战了“基于提示的伦理干预是微不足道的”以及“欺骗性对齐需要模型达到一定规模”的假设。我们提出了一个分类法,用以区分浅层欺骗(受上下文影响且可通过提示抑制)与深层欺骗(反映持久性、目标驱动的未对齐状态)。我们的研究结果深化了对语言模型中欺骗行为的理解,并强调了在不同模型规模和部署场景下进行对齐评估的必要性。