We propose statistically robust and computationally efficient linear learning methods in the high-dimensional batch setting, where the number of features $d$ may exceed the sample size $n$. We employ, in a generic learning setting, two algorithms depending on whether the considered loss function is gradient-Lipschitz or not. Then, we instantiate our framework on several applications including vanilla sparse, group-sparse and low-rank matrix recovery. This leads, for each application, to efficient and robust learning algorithms, that reach near-optimal estimation rates under heavy-tailed distributions and the presence of outliers. For vanilla $s$-sparsity, we are able to reach the $s\log (d)/n$ rate under heavy-tails and $\eta$-corruption, at a computational cost comparable to that of non-robust analogs. We provide an efficient implementation of our algorithms in an open-source $\mathtt{Python}$ library called $\mathtt{linlearn}$, by means of which we carry out numerical experiments which confirm our theoretical findings together with a comparison to other recent approaches proposed in the literature.
翻译:在高维分批设置中,我们提出统计上稳健和计算上高效的线性学习方法,其中地物数量可能超过抽样规模,但采用两种算法,取决于所考虑的损失函数是梯度-Lipschitz还是非。然后,我们立即在包括香草稀释、群体偏差和低级别矩阵回收在内的若干应用中采用我们的框架。这导致在每种应用中采用高效和稳健的学习算法,这些算法在重尾分布和外部值存在的情况下达到接近最佳的估计率。对于香草(美元),我们在一般学习环境中采用两种算法,即重尾线(d)/n美元率和美元腐败率,其计算成本可与非野蛮类比。我们通过一个叫做$\matht{Python}的开放源库高效地实施我们的算法,这些算出一个叫做$\matht{linarn}的估算率。对于香草(d)/nexparity),我们能够达到重尾巴(d)和美元(美元)和美元腐败下的计算成本。我们最近所拟进行的数字实验的方法,用以证实我们与其他文献的理论实验。