对数据驱动的发现进行预测评分,以便进行可复制的研究 (Prediction scoring of data-driven discoveries for reproducible research)

Predictive modeling uncovers knowledge and insights regarding a hypothesized data generating mechanism (DGM). Results from different studies on a complex DGM, derived from different data sets, and using complicated models and algorithms, are hard to quantitatively compare due to random noise and statistical uncertainty in model results. This has been one of the main contributors to the replication crisis in the behavioral sciences. The contribution of this paper is to apply prediction scoring to the problem of comparing two studies, such as can arise when evaluating replications or competing evidence. We examine the role of predictive models in quantitatively assessing agreement between two datasets that are assumed to come from two distinct DGMs. We formalize a distance between the DGMs that is estimated using cross validation. We argue that the resulting prediction scores depend on the predictive models created by cross validation. In this sense, the prediction scores measure the distance between DGMs, along the dimension of the particular predictive model. Using human behavior data from experimental economics, we demonstrate that prediction scores can be used to evaluate preregistered hypotheses and provide insights comparing data from different populations and settings. We examine the asymptotic behavior of the prediction scores using simulated experimental data and demonstrate that leveraging competing predictive models can reveal important differences between underlying DGMs. Our proposed cross-validated prediction scores are capable of quantifying differences between unobserved data generating mechanisms and allow for the validation and assessment of results from complex models.

翻译：预测模型揭示了有关虚伪数据生成机制(DGM)的知识和洞察力。关于复杂的DGM的不同研究的结果,来自不同的数据集,使用复杂的模型和算法,由于随机噪音和模型结果的统计不确定性,很难进行定量比较。这是行为科学复制危机的主要促成者之一。本文件的贡献是将预测评分应用于比较两项研究的问题,例如评估复制或竞争证据时可能出现的人类行为数据。我们研究了预测模型在定量评估两个假设来自两个不同的DGM的数据集之间的协议方面的作用。我们正式确定DGM之间的距离,通过交叉验证估算。我们争论,由此产生的预测分数取决于通过交叉验证产生的预测模型。从这个意义上讲,预测分数衡量DGMs之间的距离,以及特定预测模型的维度。我们利用实验经济学中的人类行为数据数据数据,证明预测分数可用于评估预先登记的假设,并提供比较不同人口和不同环境的数据的洞察。我们从预测中考察了不同数据预测的跨度模型,通过模拟模型和预测性模型来评估我们所预测的具有的可比较性的结果。我们通过模拟的预测性模型来预测结果的预测结果的预测结果,可以用来预测。