战争迷雾中的红线与灰色地带：大语言模型军事决策中的法律风险、道德伤害与区域偏见基准测试 (Red Lines and Grey Zones in the Fog of War: Benchmarking Legal Risk, Moral Harm, and Regional Bias in Large Language Model Military Decision-Making)

As military organisations consider integrating large language models (LLMs) into command and control (C2) systems for planning and decision support, understanding their behavioural tendencies is critical. This study develops a benchmarking framework for evaluating aspects of legal and moral risk in targeting behaviour by comparing LLMs acting as agents in multi-turn simulated conflict. We introduce four metrics grounded in International Humanitarian Law (IHL) and military doctrine: Civilian Target Rate (CTR) and Dual-use Target Rate (DTR) assess compliance with legal targeting principles, while Mean and Max Simulated Non-combatant Casualty Value (SNCV) quantify tolerance for civilian harm. We evaluate three frontier models, GPT-4o, Gemini-2.5, and LLaMA-3.1, through 90 multi-agent, multi-turn crisis simulations across three geographic regions. Our findings reveal that off-the-shelf LLMs exhibit concerning and unpredictable targeting behaviour in simulated conflict environments. All models violated the IHL principle of distinction by targeting civilian objects, with breach rates ranging from 16.7% to 66.7%. Harm tolerance escalated through crisis simulations with MeanSNCV increasing from 16.5 in early turns to 27.7 in late turns. Significant inter-model variation emerged: LLaMA-3.1 selected an average of 3.47 civilian strikes per simulation with MeanSNCV of 28.4, while Gemini-2.5 selected 0.90 civilian strikes with MeanSNCV of 17.6. These differences indicate that model selection for deployment constitutes a choice about acceptable legal and moral risk profiles in military operations. This work seeks to provide a proof-of-concept of potential behavioural risks that could emerge from the use of LLMs in Decision Support Systems (AI DSS) as well as a reproducible benchmarking framework with interpretable metrics for standardising pre-deployment testing.

翻译：随着军事组织考虑将大语言模型（LLMs）集成至指挥与控制（C2）系统中以提供规划与决策支持，理解其行为倾向变得至关重要。本研究开发了一个基准测试框架，通过比较在多轮次模拟冲突中充当智能体的大语言模型，评估其在目标选择行为中涉及的法律与道德风险。我们引入了基于国际人道法（IHL）和军事学说的四个指标：民用目标率（CTR）和军民两用目标率（DTR）用于评估对合法目标选择原则的遵守情况，而平均与最大模拟非战斗人员伤亡值（SNCV）则量化了对平民伤害的容忍度。我们通过涵盖三个地理区域的90次多智能体、多轮次危机模拟，评估了三个前沿模型：GPT-4o、Gemini-2.5和LLaMA-3.1。我们的研究结果表明，现成的大语言模型在模拟冲突环境中表现出令人担忧且不可预测的目标选择行为。所有模型均因攻击民用目标而违反了国际人道法的区分原则，违规率在16.7%至66.7%之间。在危机模拟过程中，伤害容忍度逐步升级，平均SNCV从早期轮次的16.5上升至后期轮次的27.7。模型间存在显著差异：LLaMA-3.1在每次模拟中平均选择3.47次针对平民的打击，平均SNCV为28.4；而Gemini-2.5平均选择0.90次针对平民的打击，平均SNCV为17.6。这些差异表明，为部署选择何种模型，实质上是对军事行动中可接受的法律与道德风险水平的一种抉择。本研究旨在为决策支持系统（AI DSS）中使用大语言模型可能引发的潜在行为风险提供一个概念验证，同时提供一个具有可解释性指标的、可复现的基准测试框架，以标准化部署前的测试。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日