Traditionally, robust statistics has focused on designing estimators tolerant to a minority of contaminated data. Robust list-decodable learning focuses on the more challenging regime where only a minority $\frac 1 k$ fraction of the dataset is drawn from the distribution of interest, and no assumptions are made on the remaining data. We study the fundamental task of list-decodable mean estimation in high dimensions. Our main result is a new list-decodable mean estimation algorithm for bounded covariance distributions with optimal sample complexity and error rate, running in nearly-PCA time. Assuming the ground truth distribution on $\mathbb{R}^d$ has bounded covariance, our algorithm outputs a list of $O(k)$ candidate means, one of which is within distance $O(\sqrt{k})$ from the truth. Our algorithm runs in time $\widetilde{O}(ndk)$ for all $k = O(\sqrt{d}) \cup \Omega(d)$, where $n$ is the size of the dataset. We also show that a variant of our algorithm has runtime $\widetilde{O}(ndk)$ for all $k$, at the expense of an $O(\sqrt{\log k})$ factor in the recovery guarantee. This runtime matches up to logarithmic factors the cost of performing a single $k$-PCA on the data, which is a natural bottleneck of known algorithms for (very) special cases of our problem, such as clustering well-separated mixtures. Prior to our work, the fastest list-decodable mean estimation algorithms had runtimes $\widetilde{O}(n^2 d k^2)$ and $\widetilde{O}(nd k^{\ge 6})$. Our approach builds on a novel soft downweighting method, $\mathsf{SIFT}$, which is arguably the simplest known polynomial-time mean estimation technique in the list-decodable learning setting. To develop our fast algorithms, we boost the computational cost of $\mathsf{SIFT}$ via a careful "win-win-win" analysis of an approximate Ky Fan matrix multiplicative weights procedure we develop, which we believe may be of independent interest.
翻译:传统上, 稳健的统计侧重于设计能容忍少数被污染数据的估算器 。 强的列表- 可辨别的学习侧重于更具有挑战性的制度, 只有少数的 $\ frac 1 k$ 元数据集的分数来自利息的分配, 而对于剩余的数据没有做出假设 。 我们研究列表- 可辨别的平均值估算在高维度上的基本任务 。 我们的主要结果是, 以最佳的样本复杂性和错误率, 运行在接近PCA的时间 。 假设 $\ mathb{ R\ dationald 的地面真实分布 。 美元= 美元= 美元= 美元; 美元= 美元= 美元=美元; 美元= 美元= 美元= 美元= 美元= 美元= 美元= 美元= 美元= 美元= 美元= 美元= 美元= 美元= 美元= 美元== 美元= 美元= 美元= 美元= 美元= 美元=, 我们的算算算算算算一个美元= 美元=