Field experiments (A/B tests) are often the most credible benchmark for methods in societal systems, but their cost and latency create a major bottleneck for iterative method development. LLM-based persona simulation offers a cheap synthetic alternative, yet it is unclear whether replacing humans with personas preserves the benchmark interface that adaptive methods optimize against. We prove an if-and-only-if characterization: when (i) methods observe only the aggregate outcome (aggregate-only observation) and (ii) evaluation depends only on the submitted artifact and not on the algorithm's identity or provenance (algorithm-blind evaluation), swapping humans for personas is just panel change from the method's point of view, indistinguishable from changing the evaluation population (e.g., New York to Jakarta). Furthermore, we move from validity to usefulness: we define an information-theoretic discriminability of the induced aggregate channel and show that making persona benchmarking as decision-relevant as a field experiment is fundamentally a sample-size question, yielding explicit bounds on the number of independent persona evaluations required to reliably distinguish meaningfully different methods at a chosen resolution.
翻译:实地实验(A/B测试)通常是社会系统中方法最可信的基准,但其成本和延迟为迭代方法开发造成了主要瓶颈。基于大语言模型(LLM)的角色模拟提供了一种廉价的合成替代方案,但尚不清楚用人造角色替代人类是否保留了自适应方法所优化的基准接口。我们证明了一个充要条件特征:当(i)方法仅观测聚合结果(仅聚合观测),且(ii)评估仅取决于提交的成果而不依赖算法的身份或来源(算法盲评估)时,从方法视角看,将人类替换为角色仅相当于评估面板的变更,与改变评估群体(例如从纽约到雅加达)无法区分。此外,我们从有效性转向实用性:定义了诱导聚合信道的信息论可区分性,并证明使角色基准测试达到与实地实验相当的决策相关性本质上是一个样本量问题,由此推导出在选定分辨率下可靠区分具有实质差异的方法所需独立角色评估数量的显式边界。