Future AI systems could conceal their capabilities ('sandbagging') during evaluations, potentially misleading developers and auditors. We stress-tested sandbagging detection techniques using an auditing game. First, a red team fine-tuned five models, some of which conditionally underperformed, as a proxy for sandbagging. Second, a blue team used black-box, model-internals, or training-based approaches to identify sandbagging models. We found that the blue team could not reliably discriminate sandbaggers from benign models. Black-box approaches were defeated by effective imitation of a weaker model. Linear probes, a model-internals approach, showed more promise but their naive application was vulnerable to behaviours instilled by the red team. We also explored capability elicitation as a strategy for detecting sandbagging. Although Prompt-based elicitation was not reliable, training-based elicitation consistently elicited full performance from the sandbagging models, using only a single correct demonstration of the evaluation task. However the performance of benign models was sometimes also raised, so relying on elicitation as a detection strategy was prone to false-positives. In the short-term, we recommend developers remove potential sandbagging using on-distribution training for elicitation. In the longer-term, further research is needed to ensure the efficacy of training-based elicitation, and develop robust methods for sandbagging detection. We open source our model organisms at https://github.com/AI-Safety-Institute/sandbagging_auditing_games and select transcripts and results at https://huggingface.co/datasets/sandbagging-games/evaluation_logs . A demo illustrating the game can be played at https://sandbagging-demo.far.ai/ .
翻译:未来的人工智能系统可能在评估过程中隐藏其真实能力(即'隐藏能力'),从而误导开发者和审计者。本研究通过审计游戏对隐藏能力检测技术进行了压力测试。首先,红队对五个模型进行了微调,其中部分模型在特定条件下表现不佳,以此模拟隐藏能力行为。其次,蓝队采用黑盒方法、模型内部分析方法或基于训练的方法来识别隐藏能力模型。研究发现,蓝队无法可靠区分隐藏能力模型与正常模型:黑盒方法因模型能有效模仿较弱模型而失效;线性探针(一种模型内部分析方法)虽更具潜力,但其简单应用易受红队植入行为的影响。我们还探索了能力激发作为检测隐藏能力的策略:基于提示的激发方法不可靠,而基于训练的激发方法仅需单个任务正确演示即可持续激发隐藏能力模型的全部性能。但正常模型的性能有时也会被提升,因此依赖激发作为检测策略易产生误报。短期建议开发者通过分布内训练消除潜在隐藏能力以实现激发;长期需进一步研究确保基于训练的激发有效性,并开发鲁棒的隐藏能力检测方法。我们在 https://github.com/AI-Safety-Institute/sandbagging_auditing_games 开源模型样本,在 https://huggingface.co/datasets/sandbagging-games/evaluation_logs 提供精选记录与结果,演示游戏可通过 https://sandbagging-demo.far.ai/ 体验。