Data fusion enables powerful and generalizable analyses across multiple sources. However, different data collection capacities across different sources lead to blockwise missingness (BM), which poses challenges in practice. Meanwhile, the high cost of obtaining gold-standard labels leaves the majority of samples unlabeled, known as the semi-supervised (SS) problem. In this paper, we propose a novel Data-adaptive Estimation approach for data FUsion in the SEmi-supervised setting (DEFUSE) that handles both BM and SS issues in the presence of distributional shifts across data sources under a missing at random (MAR) mechanism}. DEFUSE starts with a complete-data-only estimator derived from the primary data source, and uses data-adaptive and distributional-shift-adjusted procedures to successively incorporate the data with BM covariates and the large unlabeled sample to effectively reduce the estimation variance without incurring bias. To further avoid bias due to fusion of misaligned data violating of the MAR assumption, a screening method is developed to identify and exclude data sources that are not aligned with the primary source. Compared to existing approaches, DEFUSE offers two main improvements. First, it offers a new data-adaptive control variate approach to handle BM, which achieves intrinsic efficiency and robustness against distributional shifts. Second, it reveals a more essential role for the unlabeled sample in the BM regression problem, leading to improved estimation. These advantages are theoretically guaranteed and empirically supported by simulation studies and two real-world biomedical applications.
翻译:数据融合能够实现跨多个数据源的强大且可泛化的分析。然而,不同数据源之间数据采集能力的差异导致了分块缺失现象,这在实际应用中带来了挑战。同时,获取金标准标签的高成本使得大部分样本处于未标记状态,即半监督问题。本文提出了一种新颖的半监督环境下数据融合的自适应估计方法,该方法在随机缺失机制下处理分布漂移时同时应对分块缺失与半监督问题。该方法从主数据源推导出仅基于完整数据的估计量,并通过数据自适应和分布漂移校正程序,依次纳入具有分块缺失协变量的数据及大量未标记样本,从而在不引入偏差的情况下有效降低估计方差。为进一步避免因融合未对齐数据(违反随机缺失假设)导致的偏差,开发了一种筛选方法以识别并排除与主数据源未对齐的数据源。与现有方法相比,该方法具有两大改进:首先,提出了一种新的数据自适应控制变量方法处理分块缺失,实现了内在的效率性及对分布漂移的鲁棒性;其次,揭示了未标记样本在分块缺失回归问题中更本质的作用,从而提升了估计性能。这些优势在理论上有保证,并通过仿真研究和两项真实世界生物医学应用得到了实证支持。