As LLM-based agents grow more autonomous and multi-modal, ensuring they remain controllable, auditable, and faithful to deployer intent becomes critical. Prior benchmarks measured the propensity for misaligned behavior and showed that agent personalities and tool access significantly influence misalignment. Building on these insights, we propose a Verifiability-First architecture that (1) integrates run-time attestations of agent actions using cryptographic and symbolic methods, (2) embeds lightweight Audit Agents that continuously verify intent versus behavior using constrained reasoning, and (3) enforces challenge-response attestation protocols for high-risk operations. We introduce OPERA (Observability, Provable Execution, Red-team, Attestation), a benchmark suite and evaluation protocol designed to measure (i) detectability of misalignment, (ii) time to detection under stealthy strategies, and (iii) resilience of verifiability mechanisms to adversarial prompt and persona injection. Our approach shifts the evaluation focus from how likely misalignment is to how quickly and reliably misalignment can be detected and remediated.
翻译:随着基于LLM的智能体日益自主化和多模态化,确保其保持可控性、可审计性并忠实于部署者意图变得至关重要。现有基准测试主要衡量智能体行为失准的倾向性,并表明智能体角色设定与工具访问权限会显著影响失准程度。基于这些发现,我们提出一种可验证性优先的架构,该架构具有以下特征:(1) 通过密码学与符号化方法集成运行时智能体行为验证机制;(2) 嵌入轻量级审计智能体,持续运用约束推理验证意图与行为的一致性;(3) 对高风险操作实施挑战-响应式验证协议。我们提出了OPERA(可观测性、可证明执行、红队测试、验证)基准测试套件与评估协议,旨在量化以下指标:(i) 行为失准的可检测性;(ii) 隐蔽策略下的检测时效;(iii) 验证机制对抗提示注入与角色注入的鲁棒性。本方法将评估重点从失准可能性转向失准行为的检测与修复速度及可靠性。