Simulation studies are widely used to evaluate statistical methods. However, new methods are often introduced and evaluated using data-generating mechanisms (DGMs) devised by the same authors. This coupling creates misaligned incentives, e.g., the need to demonstrate the superiority of new methods, potentially compromising the neutrality of simulation studies. Furthermore, results of simulation studies are often difficult to compare due to differences in DGMs, competing methods, and performance measures. This fragmentation can lead to conflicting conclusions, hinder methodological progress, and delay the adoption of effective methods. To address these challenges, we introduce the concept of living synthetic benchmarks. The key idea is to disentangle method and simulation study development and continuously update the benchmark whenever a new DGM, method, or performance measure becomes available. This separation benefits the neutrality of method evaluation, emphasizes the development of both methods and DGMs, and enables systematic comparisons. In this paper, we outline a blueprint for building and maintaining such benchmarks, discuss the technical and organizational challenges of implementation, and demonstrate feasibility with a prototype benchmark for publication bias adjustment methods. We conclude that living synthetic benchmarks have the potential to foster neutral, reproducible, and cumulative evaluation of methods, benefiting both method developers and users.
翻译:仿真研究被广泛用于评估统计方法。然而,新方法通常由同一作者设计的数据生成机制进行评估和引入。这种耦合造成了激励错位,例如需要证明新方法的优越性,可能损害仿真研究的中立性。此外,由于数据生成机制、竞争方法和性能指标的差异,仿真研究的结果往往难以比较。这种碎片化可能导致相互矛盾的结论,阻碍方法学进展,并延迟有效方法的采用。为解决这些挑战,我们引入了活体合成基准的概念。其核心思想是将方法与仿真研究的开发解耦,并在新的数据生成机制、方法或性能指标可用时持续更新基准。这种分离有利于方法评估的中立性,强调方法与数据生成机制的共同发展,并支持系统化比较。本文概述了构建和维护此类基准的蓝图,讨论了实施过程中的技术与组织挑战,并通过一个用于发表偏倚调整方法的原型基准验证了可行性。我们得出结论,活体合成基准有潜力促进中立、可复现且累积的方法评估,使方法开发者和使用者共同受益。