Electronic health records (EHRs) offer great promises for advancing precision medicine and, at the same time, present significant analytical challenges. Particularly, it is often the case that patient-level data in EHRs cannot be shared across institutions (data sources) due to government regulations and/or institutional policies. As a result, there are growing interests about distributed learning over multiple EHRs databases without sharing patient-level data. To tackle such challenges, we propose a novel communication efficient method that aggregates the local optimal estimates, by turning the problem into a missing data problem. In addition, we propose incorporating posterior samples of remote sites, which can provide partial information on the missing quantities and improve efficiency of parameter estimates while having the differential privacy property and thus reducing the risk of information leaking. The proposed approach, without sharing the raw patient level data, allows for proper statistical inference and can accommodate sparse regressions. We provide theoretical investigation for the asymptotic properties of the proposed method for statistical inference as well as differential privacy, and evaluate its performance in simulations and real data analyses in comparison with several recently developed methods.
翻译:电子健康记录(EHRs)为推进精密医学提供了巨大的前景,同时也带来了巨大的分析挑战。特别是,由于政府条例和/或机构政策,人们往往无法在各机构(数据源)之间分享EHR的病人数据(数据源),因此,人们越来越关心在多个EHR数据库中分散学习而不分享病人一级数据的问题。为了应对这些挑战,我们提议一种新的通信高效方法,通过将问题转化为缺失的数据问题,将当地的最佳估计结果综合起来。此外,我们提议纳入偏远地点的事后抽样,这种样本可以提供缺失数量方面的部分信息,提高参数估计的效率,同时具有不同的隐私属性,从而减少信息泄漏的风险。拟议的方法,在不分享原始病人一级数据的情况下,允许适当的统计推断,并能够适应微小的回归。我们为拟议的统计推断方法的无症状特性以及差异隐私提供了理论调查,并比最近制定的若干方法评估其在模拟和真实数据分析中的绩效。