Measuring the stability of conclusions derived from Ordinary Least Squares linear regression is critically important, but most metrics either only measure local stability (i.e. against infinitesimal changes in the data), or are only interpretable under statistical assumptions. Recent work proposes a simple, global, finite-sample stability metric: the minimum number of samples that need to be removed so that rerunning the analysis overturns the conclusion, specifically meaning that the sign of a particular coefficient of the estimated regressor changes. However, besides the trivial exponential-time algorithm, the only approach for computing this metric is a greedy heuristic that lacks provable guarantees under reasonable, verifiable assumptions; the heuristic provides a loose upper bound on the stability and also cannot certify lower bounds on it. We show that in the low-dimensional regime where the number of covariates is a constant but the number of samples is large, there are efficient algorithms for provably estimating (a fractional version of) this metric. Applying our algorithms to the Boston Housing dataset, we exhibit regression analyses where we can estimate the stability up to a factor of $3$ better than the greedy heuristic, and analyses where we can certify stability to dropping even a majority of the samples.
翻译:测量从普通最低广场线性回归中得出的结论的稳定性至关重要,但大多数衡量标准要么只衡量当地稳定(即数据的变化极微小),要么仅根据统计假设加以解释。最近的工作提出了一个简单、全球、有限、抽样的稳定度:需要删除的样本最低数量,以便重新进行分析,从而推翻这一结论,这具体意味着估计回归率变化的某个特定系数的标志。然而,除了微小的指数-时间算法外,计算这一指标的唯一方法是贪婪的惯性,在合理、可核查的假设下缺乏可证实的保证;超常对稳定有较松的上限,也无法证明较低的界限。我们表明,在低维度制度中,共变数是固定的,但样本数量很大,有高效的算法,可以以可辨别的方式估算(一个小数版的)该指标。除了微小的指数算法外,我们用回归分析的方法可以估计稳定性,甚至比贪婪的海产模型高3美元。