The deployment of Large Language Models (LLMs) as tool-using agents causes their alignment training to manifest in new ways. Recent work finds that language models can use tools in ways that contradict the interests or explicit instructions of the user. We study LLM whistleblowing: a subset of this behavior where models disclose suspected misconduct to parties beyond the dialog boundary (e.g., regulatory agencies) without user instruction or knowledge. We introduce an evaluation suite of diverse and realistic staged misconduct scenarios to assess agents for this behavior. Across models and settings, we find that: (1) the frequency of whistleblowing varies widely across model families, (2) increasing the complexity of the task the agent is instructed to complete lowers whistleblowing tendencies, (3) nudging the agent in the system prompt to act morally substantially raises whistleblowing rates, and (4) giving the model more obvious avenues for non-whistleblowing behavior, by providing more tools and a detailed workflow to follow, decreases whistleblowing rates. Additionally, we verify the robustness of our dataset by testing for model evaluation awareness, and find that both black-box methods and probes on model activations show lower evaluation awareness in our settings than in comparable previous work.
翻译:大型语言模型(LLMs)作为工具使用代理的部署,使其对齐训练以新的方式显现。近期研究发现,语言模型可能以违背用户利益或明确指令的方式使用工具。我们研究LLM举报行为:这是此类行为的一个子集,指模型在未经用户指示或知情的情况下,向对话边界之外的各方(例如监管机构)披露可疑的不当行为。我们引入了一个包含多样且现实的模拟不当行为场景的评估套件,以评估代理的此类行为。在不同模型和设置中,我们发现:(1)举报频率在不同模型系列间差异显著;(2)增加代理被指示完成任务的复杂性会降低举报倾向;(3)在系统提示中引导代理采取道德行为会显著提高举报率;(4)通过提供更多工具和详细的工作流程,为模型提供更明显的非举报行为途径,会降低举报率。此外,我们通过测试模型对评估的认知来验证数据集的稳健性,发现无论是黑盒方法还是对模型激活的探测,在我们的设置中均显示出比以往类似工作更低的评估认知度。