We study random design linear regression with no assumptions on the distribution of the covariates and with a heavy-tailed response variable. In this distribution-free regression setting, we show that boundedness of the conditional second moment of the response given the covariates is a necessary and sufficient condition for achieving nontrivial guarantees. As a starting point, we prove an optimal version of the classical in-expectation bound for the truncated least squares estimator due to Gy\"{o}rfi, Kohler, Krzy\.{z}ak, and Walk. However, we show that this procedure fails with constant probability for some distributions despite its optimal in-expectation performance. Then, combining the ideas of truncated least squares, median-of-means procedures, and aggregation theory, we construct a non-linear estimator achieving excess risk of order $d/n$ with an optimal sub-exponential tail. While existing approaches to linear regression for heavy-tailed distributions focus on proper estimators that return linear functions, we highlight that the improperness of our procedure is necessary for attaining nontrivial guarantees in the distribution-free setting.
翻译:我们研究随机设计线性回归,没有关于共差分布的假设,也没有重尾反应变量。 但是,在这种无分配回归设置中,我们显示,考虑到共差,有条件响应的第二个时刻的界限性是达到非三重保证的一个必要和充分的条件。作为起点,我们证明,由于Gy\"{o}rfi、Kohler、Krzy\.{z}ak和Walk,我们为短程最小正方块估计值的超大风险建立了非线性估计值的最佳版本。虽然重尾分配线性回归的现有方法侧重于返回直线函数的正确估计值,但我们显示,尽管该程序在一些分配方面有最佳的概率,但某些分配却经常失败。然后,结合短程最小平方块、中值程序以及集成理论,我们建造了一个非线性估计值超大风险的非线性估计值,而最优的亚反差尾巴。我们强调,在返回直线性函数的正确估计值分配过程中,我们的程序的不当性保证是达到必要的非直线性分布。