A fundamental aspect of statistics is the integration of data from different sources. Classically, Fisher and others were focused on how to integrate homogeneous (or only mildly heterogeneous) sets of data. More recently, as data is becoming more accessible, the question of if data sets from different sources should be integrated is becoming more relevant. The current literature treats this as a question with only two answers: integrate or don't. Here we take a different approach, motivated by information-sharing principles coming from the shrinkage estimation literature. In particular, we deviate from the do/don't perspective and propose a dial parameter that controls the extent to which two data sources are integrated. How far this dial parameter should be turned is shown to depend, for example, on the informativeness of the different data sources as measured by Fisher information. In the context of generalized linear models, this more nuanced data integration framework leads to relatively simple parameter estimates and valid tests/confidence intervals. Moreover, we demonstrate both theoretically and empirically that setting the dial parameter according to our recommendation leads to more efficient estimation compared to other binary data integration schemes.
翻译:统计的一个根本方面是来自不同来源的数据的整合。 典型地说, Fisher 和其他人集中关注如何整合同质( 或只是温和的多元性) 数据集。 最近,随着数据越来越容易获取, 不同来源的数据集是否应该整合的问题越来越重要。 目前的文献将这一问题视为只有两个答案的问题: 整合或不整合。 我们在此采取不同的方法, 其动机是来自缩略估算文献的信息分享原则。 特别是, 我们偏离了“ 做/ 不做” 角度, 并提出了一个控制两个数据源整合程度的拨号参数。 拨号参数的转换速度应该在多大程度上取决于, 例如, 取决于不同数据源的信息是否具有由渔业信息测量的信息性。 在一般线性模型中, 这个更加细微的数据整合框架导致相对简单的参数估计和有效测试/信任间隔。 此外, 我们从理论上和从经验上证明, 根据我们的建议设定拨号参数可以比其他二元数据整合计划更有效率的估计。