Simulation of complex systems originated in manufacturing and queuing applications. It is now widely used for large-scale, ML-based systems in research, education, and consumer surveys. However, characterizing the discrepancy between simulators and ground truth remains challenging for increasingly complex, machine-learning-based systems. We propose a computationally tractable method to estimate the quantile function of the discrepancy between the simulated and ground-truth outcome distributions. Our approach focuses on output uncertainty and treats the simulator as a black box, imposing no modeling assumptions on its internals, and hence applies broadly across many parameter families, from Bernoulli and multinomial models to continuous, vector-valued settings. The resulting quantile curve supports confidence interval construction for unseen scenarios, risk-aware summaries of sim-to-real discrepancy (e.g., VaR/CVaR), and comparison of simulators' performance. We demonstrate our methodology in an application assessing LLM simulation fidelity on the WorldValueBench dataset spanning four LLMs.
翻译:复杂系统的仿真技术最初起源于制造业与排队应用领域,如今已广泛应用于研究、教育和消费者调查中基于机器学习的大规模系统。然而,对于日益复杂的基于机器学习的系统,如何准确刻画仿真器与真实情况之间的差异仍具挑战性。本文提出一种计算可行的方法,用于估计仿真结果分布与真实结果分布之间差异的分位数函数。该方法聚焦于输出不确定性,将仿真器视为黑箱,不对其内部结构施加任何建模假设,因而可广泛适用于从伯努利模型、多项分布模型到连续向量值场景的多种参数族。所得的分位数曲线支持以下功能:为未见场景构建置信区间、提供仿真与真实差异的风险感知摘要(如风险价值/条件风险价值),以及比较不同仿真器的性能。我们在评估WorldValueBench数据集上四个大语言模型的仿真保真度应用中验证了本方法的有效性。