Surrogate variables in electronic health records (EHR) play an important role in biomedical studies due to the scarcity or absence of chart-reviewed gold standard labels, under which supervised methods only using labeled data poorly perform poorly. Meanwhile, synthesizing multi-site EHR data is crucial for powerful and generalizable statistical learning but encounters the privacy constraint that individual-level data is not allowed to be transferred from the local sites, known as DataSHIELD. In this paper, we develop a novel approach named SASH for Surrogate-Assisted and data-Shielding High-dimensional integrative regression. SASH leverages sizable unlabeled data with EHR surrogates predictive of the response from multiple local sites to assist the training with labeled data and largely improve statistical efficiency. It first extracts a preliminary supervised estimator to realize convex training of a regularized single index model for the surrogate at each local site and then aggregates the fitted local models for accurate learning of the target outcome model. It protects individual-level information from the local sites through summary-statistics-based data aggregation. We show that under mild conditions, our method attains substantially lower estimation error rates than the supervised or local semi-supervised methods, as well as the asymptotic equivalence to the ideal individual patient data pooled estimator (IPD) only available in the absence of privacy constraints. Through simulation studies, we demonstrate that SASH outperforms all existing supervised or SS federated approaches and performs closely to IPD. Finally, we apply our method to develop a high dimensional genetic risk model for type II diabetes using large-scale biobank data sets from UK Biobank and Mass General Brigham, where only a small fraction of subjects from the latter has been labeled via chart reviewing.
翻译:电子健康记录(EHR)中的超值变量在生物医学研究中发挥重要作用,原因是缺少或缺少经过图表审查的黄金标准标签,在这种标签下,监督方法只能使用标签数据,结果表现不佳。同时,综合多站EHR数据对于强有力和可普遍适用的统计学习至关重要,但遇到隐私限制,即不允许从当地地点(称为DataSHIELD)传输个人数据,因为每个地方地点都不允许将个人一级数据从当地地点转移出去。在本文中,我们开发了一个名为SASH的新办法,用于SASH用于Surrogate Access and DS-SHIS-SHIS-高维综合回归。SASASIS将非标签数据与EHR的替代,预测多个地方地点的反应有助于使用标签数据的培训,并在很大程度上提高统计效率。它首先提取一个初步监测模型,用于在每个地方地点进行定期的单一指数模型培训,然后将适合的当地模型用于准确学习目标结果模型。它保护地方地点通过基于简要统计的SISD数据整合整合的方法从当地地点获得无标签的数据。 我们通过基于内部数据流数据流数据流数据流数据流数据,然后在低级数据流数据流数据流数据流数据流中进行。