High-quality labeled data are essential for reliable statistical inference, but are often limited by validation costs. While surrogate labels provide cost-effective alternatives, their noise can introduce non-negligible bias. To address this challenge, we propose the surrogate-powered inference (SPI) toolbox, a unified framework that leverages both the validity of high-quality labels and the abundance of surrogates to enable reliable statistical inference. SPI comprises three progressively enhanced versions. Base-SPI integrates validated labels and surrogates through augmentation to improve estimation efficiency. SPI+ incorporates regularized regression to safely handle multiple surrogates, preventing performance degradation due to error accumulation. SPI++ further optimizes efficiency under limited validation budgets through an adaptive, multiwave labeling procedure that prioritizes informative subjects for labeling. Compared to traditional methods, SPI substantially reduces the estimation error and increases the power in risk factor identification. These results demonstrate the value of SPI in improving the reproducibility. Theoretical guarantees and extensive simulation studies further illustrate the properties of our approach.
翻译:高质量的标注数据对于可靠的统计推断至关重要,但往往受限于验证成本。虽然替代标签提供了成本效益更高的替代方案,但其噪声可能引入不可忽略的偏差。为应对这一挑战,我们提出了替代标签驱动的统计推断工具箱,这是一个统一框架,它同时利用高质量标签的有效性和替代标签的丰富性,以实现可靠的统计推断。SPI包含三个逐步增强的版本。基础SPI通过数据增强整合已验证标签和替代标签,以提高估计效率。SPI+引入了正则化回归以安全地处理多个替代标签,防止因误差累积导致的性能下降。SPI++通过一种自适应的多轮次标注流程进一步优化了有限验证预算下的效率,该流程优先标注信息量大的样本。与传统方法相比,SPI显著降低了估计误差,并提高了风险因素识别的统计功效。这些结果证明了SPI在提升研究可重复性方面的价值。理论保证和广泛的模拟研究进一步阐明了我们方法的特性。