Bias in datasets can be very detrimental for appropriate statistical estimation. In response to this problem, importance weighting methods have been developed to match any biased distribution to its corresponding target unbiased distribution. The seminal Kernel Mean Matching (KMM) method is, nowadays, still considered as state of the art in this research field. However, one of the main drawbacks of this method is the computational burden for large datasets. Building on previous works by Huang et al. (2007) and de Mathelin et al. (2021), we derive a novel importance weighting algorithm which scales to large datasets by using a neural network to predict the instance weights. We show, on multiple public datasets, under various sample biases, that our proposed approach drastically reduces the computational time on large dataset while maintaining similar sample bias correction performance compared to other importance weighting methods. The proposed approach appears to be the only one able to give relevant reweighting in a reasonable time for large dataset with up to two million data.
翻译:数据集中的比亚值可能对适当的统计估计非常有害。 针对这一问题,已经开发了重要加权方法,将任何偏差分布与其相应的目标无偏差分布匹配。 原始核心线匹配(KMM)方法目前仍被视为是这一研究领域的最先进方法。 但是,这种方法的一个主要缺点是大型数据集的计算负担。 根据黄等人(2007年)和德马特林等人(2021年)以往的作品,我们通过使用神经网络预测实例加权,得出了一种新的重要加权算法,该算法通过预测实例加权到大型数据集。我们在多个公共数据集中显示,在各种抽样偏差下,我们拟议的方法大大缩短了大型数据集的计算时间,同时保持与其他重要加权方法相比类似的偏差校正率性功能。 拟议的方法似乎是唯一一个能够在合理时间内对拥有多达200万数据的大型数据集进行相关加权的算法。