Logistic regression remains one of the most widely used tools in applied statistics, machine learning and data science. Practical datasets often have a substantial number of features $d$ relative to the sample size $n$. In these cases, the logistic regression maximum likelihood estimator (MLE) is biased, and its standard large-sample approximation is poor. In this paper, we develop an improved method for debiasing predictions and estimating frequentist uncertainty for such datasets. We build on recent work characterizing the asymptotic statistical behavior of the MLE in the regime where the aspect ratio $d / n$, instead of the number of features $d$, remains fixed as $n$ grows. In principle, this approximation facilitates bias and uncertainty corrections, but in practice, these corrections require an estimate of the signal strength of the predictors. Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude. The bias correction that this facilitates also reduces the variance of the predictions, yielding narrower confidence intervals with higher (valid) coverage of the true underlying probabilities and parameters. We provide an open source package for this method, available at https://github.com/google-research/sloe-logistic.
翻译:在应用统计、机器学习和数据科学方面,物流回归仍然是最广泛使用的工具之一。实用数据集通常具有与抽样规模相对相当的大量特征。在这些案例中,后勤回归最大可能性估计仪(MLE)存在偏差,其标准大范围抽样近似值很低。在本文中,我们开发了一种更好的方法,用以减少预测的偏差,并估计这类数据集的常态不确定性。我们以最近的工作为基础,将MLE在制度下的无约束统计行为定性为标准,因为制度内方位比率为$/n美元,而不是特征数目为美元,但随着美元的增长而固定不变。原则上,这种近似可促进偏差和不确定性的纠正,但在实践中,这些更正需要估计预测仪的信号强度。我们的主要贡献是SLOE,这是信号强度的衡量标准,保证会减少估算的计算时间和数量级的推断。纠正偏差还有助于减少预测的差异,产生较窄的互信度间隔期,产生较窄的间隔期,在可获取的精确度/精确的参数中,我们提供这种精确的精确的源。