We investigate the issue of parameter estimation with nonuniform negative sampling for imbalanced data. We first prove that, with imbalanced data, the available information about unknown parameters is only tied to the relatively small number of positive instances, which justifies the usage of negative sampling. However, if the negative instances are subsampled to the same level of the positive cases, there is information loss. To maintain more information, we derive the asymptotic distribution of a general inverse probability weighted (IPW) estimator and obtain the optimal sampling probability that minimizes its variance. To further improve the estimation efficiency over the IPW method, we propose a likelihood-based estimator by correcting log odds for the sampled data and prove that the improved estimator has the smallest asymptotic variance among a large class of estimators. It is also more robust to pilot misspecification. We validate our approach on simulated data as well as a real click-through rate dataset with more than 0.3 trillion instances, collected over a period of a month. Both theoretical and empirical results demonstrate the effectiveness of our method.
翻译:我们首先通过不统一的负面抽样来调查不平衡数据参数估计问题。我们首先证明,由于数据不平衡,关于未知参数的现有信息仅与相对较少的正数事例有关,而正数比较少的情况证明使用负抽样是合理的。但是,如果负数事例被分解为相同程度的正数,信息就会丢失。为了保持更多的信息,我们得出一般反差加权估计值(IPW)的零星分布,并获得最佳的抽样概率,以尽量减少其差异。为了进一步提高IPW方法的估算效率,我们提出了一个基于概率的估测器,纠正抽样数据的日数概率,并证明经改进的估测器在大类估计者中差异最小,还更能进行误差试验。我们验证了我们对于模拟数据的方法以及一个月内收集到的0.3万亿多例实际点击率数据集。理论和实证结果都证明了我们方法的有效性。