Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.
翻译:大型语言模型(LLMs)有时能检测到自身正处于评估状态,并调整行为以显得更加符合预期,这影响了安全评估的可靠性。本文研究表明,通过在LLM的激活状态中添加引导向量,可以抑制其评估感知能力,使模型在评估过程中表现出类似部署状态的行为。为研究该引导技术,我们通过两步训练流程使LLM呈现评估感知行为,该流程模拟了此类行为自然产生的过程:首先,基于两类文档进行持续预训练——(1)在评估时使用Python类型提示而部署时不使用;(2)识别特定评估线索始终代表测试场景。随后通过专家迭代训练模型在评估场景中使用Python类型提示。所得模型具有评估感知特性:在评估语境中比部署语境更频繁地使用类型提示。研究发现,激活引导能抑制评估感知,使模型即使存在评估线索时仍表现出部署状态行为。关键的是,我们使用额外训练前的原始模型构建了引导向量。结果表明,AI评估者可通过引导模型模拟部署状态行为来提升安全评估的可靠性。