半参数个体数据和总结数据融合的悖论和解决方法 (Paradoxes and resolutions for semiparametric fusion of individual and summary data)

Suppose we have available individual data from an internal study and various types of summary statistics from relevant external studies. External summary statistics have been used as constraints on the internal data distribution, which promised to improve the statistical inference in the internal data; however, the additional use of external summary data may lead to paradoxical results: efficiency loss may occur if the uncertainty of summary statistics is not negligible and large estimation bias can emerge even if the bias of external summary statistics is small. We investigate these paradoxical results in a semiparametric framework. We establish the semiparametric efficiency bound for estimating a general functional of the internal data distribution, which is shown to be no larger than that using only internal data. We propose a data-fused efficient estimator that achieves this bound so that the efficiency paradox is resolved. Besides, a debiased estimator is further proposed which has selection consistency property by employing adaptive lasso penalty so that the resultant estimator can achieve the same asymptotic distribution as the oracle one that uses only unbiased summary statistics, which resolves the bias paradox. Simulations and application to a Helicobacter pylori infection dataset are used to illustrate the proposed methods.

翻译：假设我们有内部研究的个体数据，以及相关外部研究的各种类型的总结统计数据。外部总结统计数据已用作内部数据分布的约束条件，这有望改善内部数据的统计推断；然而，如果总结统计数据的不确定性不可忽略，那么将额外利用外部总结数据可能导致效率损失，并且即使外部总结统计数据偏差较小，大型估计偏差也可能出现。我们在半参数框架下研究了这些悖论性的结果。我们建立了估计内部数据分布的一般功能的半参数效率界限，该界限显示不会比仅使用内部数据更大。我们提出了一个数据融合的有效估计量，以实现该界限，从而解决了效率悖论。此外，我们还提出了一个去偏估计器，通过采用自适应 Lasso 惩罚来实现选择一致性属性，从而使得结果估计器能够实现与仅使用无偏总结统计数据的 Oracle 的相同渐近分布，从而解决偏差悖论。模拟和应用于幽门螺杆菌感染数据集以说明所提出的方法。