A data set sampled from a certain population is biased if the subgroups of the population are sampled at proportions that are significantly different from their underlying proportions. Training machine learning models on biased data sets requires correction techniques to compensate for the bias. We consider two commonly-used techniques, resampling and reweighting, that rebalance the proportions of the subgroups to maintain the desired objective function. Though statistically equivalent, it has been observed that resampling outperforms reweighting when combined with stochastic gradient algorithms. By analyzing illustrative examples, we explain the reason behind this phenomenon using tools from dynamical stability and stochastic asymptotics. We also present experiments from regression, classification, and off-policy prediction to demonstrate that this is a general phenomenon. We argue that it is imperative to consider the objective function design and the optimization algorithm together while addressing the sampling bias.
翻译:从特定人群抽样的一组数据如果抽样调查人口分组的比例与其基本比例大不相同,则有偏差。关于偏差数据集的培训机器学习模型需要纠正技术来弥补偏差。我们考虑两种常用的技术,即再抽样和再加权,重新平衡分组的比例以维持预期目标功能。虽然统计上等同,但人们发现,在与随机梯度算法结合时,重新抽样比重比重重效果要好。通过分析示例,我们用动态稳定性和随机性失常性工具解释这一现象背后的原因。我们还提出回归、分类和离政策预测的实验,以证明这是一个普遍现象。我们主张,在解决抽样偏差的同时,必须考虑客观功能设计和优化算法。