With the deluge of digitized information in the Big Data era, massive datasets are becoming increasingly available for learning predictive models. However, in many practical situations, the poor control of the data acquisition processes may naturally jeopardize the outputs of machine learning algorithms, and selection bias issues are now the subject of much attention in the literature. The present article investigates how to extend Empirical Risk Minimization, the principal paradigm in statistical learning, when training observations are generated from biased models, i.e., from distributions that are different from that in the test/prediction stage, and absolutely continuous with respect to the latter. Precisely, we show how to build a "nearly debiased" training statistical population from biased samples and the related biasing functions, following in the footsteps of the approach originally proposed in Vardi (1985). Furthermore, we study from a nonasymptotic perspective the performance of minimizers of an empirical version of the risk computed from the statistical population thus created. Remarkably, the learning rate achieved by this procedure is of the same order as that attained in absence of selection bias. Beyond the theoretical guarantees, we also present experimental results supporting the relevance of the algorithmic approach promoted in this paper.
翻译:随着大数据时代数字化信息的泛滥,大量数据集越来越多地用于学习预测模型,然而,在许多实际情况下,对数据获取过程的控制不力自然会危及机器学习算法的输出,选择偏差问题现在成为文献中许多关注的主题。本条款调查如何扩大经验风险最小化,这是统计学习的主要范例,因为培训观测来自偏向模式,即不同于测试/预测阶段的分布,而后者则绝对持续。准确地说,我们展示了如何在Vardi(1985年)最初提出的方法的足迹中,从偏向抽样中建立“近距离下降”培训统计人口和相关偏差功能。此外,我们从非抽象的角度研究如何最大限度地降低从统计人口中计算出的风险的经验版本的绩效。值得注意的是,这一程序的学习率与选择偏差的顺序相同。除了理论保证外,我们还介绍了实验性结果,支持在本文中推广了算法的相关性。