As large language models (LLMs) advance, concerns about their misconduct in complex social contexts intensify. Existing research overlooked the systematic understanding and assessment of their criminal capability in realistic interactions. We propose a unified framework PRISON, to quantify LLMs' criminal potential across five traits: False Statements, Frame-Up, Psychological Manipulation, Emotional Disguise, and Moral Disengagement. Using structured crime scenarios adapted from classic films grounded in reality, we evaluate both criminal potential and anti-crime ability of LLMs. Results show that state-of-the-art LLMs frequently exhibit emergent criminal tendencies, such as proposing misleading statements or evasion tactics, even without explicit instructions. Moreover, when placed in a detective role, models recognize deceptive behavior with only 44% accuracy on average, revealing a striking mismatch between conducting and detecting criminal behavior. These findings underscore the urgent need for adversarial robustness, behavioral alignment, and safety mechanisms before broader LLM deployment.
翻译:随着大型语言模型(LLMs)的发展,人们对其在复杂社会情境中行为失范的担忧日益加剧。现有研究忽视了对其在真实交互中犯罪能力的系统性理解与评估。本文提出统一框架PRISON,从五个维度量化LLMs的犯罪潜力:虚假陈述、构陷诬告、心理操纵、情感伪装及道德推脱。通过基于现实经典电影改编的结构化犯罪场景,我们同时评估了LLMs的犯罪潜力与反犯罪能力。结果显示,最先进的LLMs频繁表现出涌现的犯罪倾向,例如提出误导性陈述或逃避策略,即使在没有明确指令的情况下亦如此。此外,当模型被置于侦探角色时,其对欺骗行为的识别准确率平均仅为44%,揭示了实施犯罪与识别犯罪行为之间存在显著错配。这些发现凸显了在广泛部署LLMs之前,亟需加强对抗鲁棒性、行为对齐及安全机制建设。