Estimating the data uncertainty in regression tasks is often done by learning a quantile function or a prediction interval of the true label conditioned on the input. It is frequently observed that quantile regression -- a vanilla algorithm for learning quantiles with asymptotic guarantees -- tends to \emph{under-cover} than the desired coverage level in reality. While various fixes have been proposed, a more fundamental understanding of why this under-coverage bias happens in the first place remains elusive. In this paper, we present a rigorous theoretical study on the coverage of uncertainty estimation algorithms in learning quantiles. We prove that quantile regression suffers from an inherent under-coverage bias, in a vanilla setting where we learn a realizable linear quantile function and there is more data than parameters. More quantitatively, for $\alpha>0.5$ and small $d/n$, the $\alpha$-quantile learned by quantile regression roughly achieves coverage $\alpha - (\alpha-1/2)\cdot d/n$ regardless of the noise distribution, where $d$ is the input dimension and $n$ is the number of training data. Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error that is not implied by existing theories on quantile regression. Experiments on simulated and real data verify our theory and further illustrate the effect of various factors such as sample size and model capacity on the under-coverage bias in more practical setups.
翻译:估计回归任务中的数据不确定性往往是通过学习以输入为条件的真实标签的量化函数或预测间隔来做。我们经常看到,四分回归 -- -- 一种用于学习带有无线保证的四分位数的香草算法 -- -- 往往与现实所期望的覆盖水平相比,倾向于为 emph{under-supple}。虽然提出了各种修正,但更根本地理解为什么这种覆盖不足的偏差首先仍然难以实现。在本文中,我们对学习量化的不确定性估算算法的覆盖面进行了严格的理论研究。我们证明,四分回归是内在的、覆盖不足偏差的偏差,在香草环境中,我们学习了一个可实现的线性微分函数,数据多于参数。对于$alpha>0.5美元和小美元/美元来说,更根本地理解为什么这种隐性偏差偏差在模型中可以进一步覆盖 $alpha - (\alpha-/2)\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\